DCGM Modularity

DCGM supports modularity where the different functional/feature areas of DCGM are separated into different shared libraries. These shared libraries are lazy-loaded by DCGM when first used by a corresponding API or DCGMI call. If you never want a module to be loaded, you can blacklist that module using dcgmi, API calls, or nv-hostengine command-line arguments. Additionally, you can remove the so.1 file of the module you want to permanently blacklist, and DCGM will behave as if the library is blacklisted.

Module List

The following modules have been added to DCGM.

Module ID # Description
DcgmModuleIdNvSwitch 1

Manages NVSwitches and is required in order for DGX-2 / HGX-2 systems to function properly. This can be loaded explicitly by adding “-l -g” to your nv-hostengine command line.

Requires nv-hostengine to run as root.

DcgmModuleIdVGPU 2 Provides telemetry on paravirtualized GPUs.
DcgmModuleIdIntrospect 3 Provides time-series data about the running state of nv-hostengine.
DcgmModuleIdHealth 4 Provides passive health checks of GPUs and NVSwitches
DcgmModuleIdPolicy 5 Allows users to register callbacks based off GPU events like XIDs and overtemp.
DcgmModuleIdConfig 6

Allows users to set GPU configuration.

Requires nv-hostengine to run as root.

DcgmModuleIdDiag 7 Enables users to call the DCGM GPU Diagnostic

Blacklisting Modules

Users may prevent DCGM from loading modules by providing a command-line option to nv-hostengine when they run it. The argument to this command line is the # column in the table above.

For instance, to start nv-hostengine with the introspection (3) and health (4) modules blacklisted, you would change nv-hostengine’s service file to pass the following arguments: nv-hostengine --blacklist-modules 3,4

Note that NVSwitch must be explicitly loaded with the -l and -g options. To only load the NVSwitch module and blacklist all others, use the following command line:

nv-hostengine -l -g --blacklist-modules 2,3,4,5,6,7

You can query the status of all of the dcgm modules with the following command:

#dcgmi modules -l 
+-----------+--------------------+------------------------------------------+ 
| List Modules                                                              | 
| Status: Success                                                           |
+===========+====================+==========================================+
| Module ID | Name               | State                                    |
+-----------+--------------------+------------------------------------------+
| 0         | Core               | Loaded                                   |
| 1         | NvSwitch           | Not loaded                               |
| 2         | VGPU               | Not loaded                               |
| 3         | Introspection      | Not loaded                               |
| 4         | Health             | Not loaded                               |
| 5         | Policy             | Not loaded                               |
| 6         | Config             | Not loaded                               |
| 7         | Diag               | Not loaded                               |
+-----------+--------------------+------------------------------------------+ 

Only modules that are Not Loaded can be blacklisted.

To blacklist a module, take note of its module name from the table above. We’ll blacklist module Policy for this example:

#dcgmi modules --blacklist Policy 
+-----------------------------+---------------------------------------------+ 
| Blacklist Module                                                          | 
| Status: Success                                                           | 
| Successfully blacklisted module Policy                                    | 
+=============================+=============================================+ 
+-----------------------------+---------------------------------------------+ 

Once a module has been blacklisted, you can verify that by listing modules again:

dcgmi modules -l 
+-----------+--------------------+------------------------------------------+ 
| List Modules                                                              | 
| Status: Success                                                           | 
+===========+====================+==========================================+ 
| Module ID | Name               | State                                    | 
+-----------+--------------------+------------------------------------------+ 
| 0         | Core               | Loaded                                   | 
| 1         | NvSwitch           | Not loaded                               | 
| 2         | VGPU               | Not loaded                               | 
| 3         | Introspection      | Not loaded                               | 
| 4         | Health             | Not loaded                               | 
| 5         | Policy             | Blacklisted                              |
| 6         | Config             | Not loaded                               | 
| 7         | Diag               | Not loaded                               | 
+-----------+--------------------+------------------------------------------+