DCGM Modularity

DCGM supports modularity where the different functional/feature areas of DCGM are separated into different shared libraries. These shared libraries are lazy-loaded by DCGM when first used by a corresponding API or DCGMI call. If you never want a module to be loaded, you can disable that module using dcgmi, API calls, or nv-hostengine command-line arguments. Additionally, you can remove the so.1 file of the module you want to permanently disable, and DCGM will behave as if the library is disabled.

../_images/dcgm-modularity.png

Module List

The following modules have been added to DCGM.

Module ID

Number

Description

DcgmModuleIdNvSwitch

1

Manages NVSwitches and is required in order for DGX-2 / HGX-2 systems to function properly. This can be loaded explicitly by adding -l -g to your nv-hostengine command line.

Requires nv-hostengine to run as root.

DcgmModuleIdVGPU

2

Provides telemetry on paravirtualized GPUs.

DcgmModuleIdIntrospect

3

Provides time-series data about the running state of nv-hostengine.

DcgmModuleIdHealth

4

Provides passive health checks of GPUs and NVSwitches

DcgmModuleIdPolicy

5

Allows users to register callbacks based off GPU events like XIDs and overtemp.

DcgmModuleIdConfig

6

Allows users to set GPU configuration. Requires nv-hostengine to run as root.

DcgmModuleIdDiag

7

Enables users to call the DCGM GPU Diagnostic

DcgmModuleIdProfiling

8

Enables users to monitor profiling metrics.

DcgmModuleIdSysmon

9

Enables users to monitor supported Nvidia CPUs.

Disabling Modules

Users may prevent DCGM from loading modules by providing a command-line option to nv-hostengine when they run it. The argument to this command line is the Number column in the table above.

For instance, to start nv-hostengine with the introspection (3) and health (4) modules disabled, you would change nv-hostengine’s service file to pass the following arguments:

$ nv-hostengine --blacklist-modules 3,4

Note that NVSwitch must be explicitly loaded with the -l and -g options. To only load the NVSwitch module and disable all others, use the following command line:

$ nv-hostengine -l -g --blacklist-modules 2,3,4,5,6,7

You can query the status of all of the dcgm modules with the following command:

$ dcgmi modules -l
+-----------+--------------------+------------------------------------------+
| List Modules                                                              |
| Status: Success                                                           |
+===========+====================+==========================================+
| Module ID | Name               | State                                    |
+-----------+--------------------+------------------------------------------+
| 0         | Core               | Loaded                                   |
| 1         | NvSwitch           | Not loaded                               |
| 2         | VGPU               | Not loaded                               |
| 3         | Introspection      | Not loaded                               |
| 4         | Health             | Not loaded                               |
| 5         | Policy             | Not loaded                               |
| 6         | Config             | Not loaded                               |
| 7         | Diag               | Not loaded                               |
| 8         | Profiling          | Not loaded                               |
| 9         | Sysmon             | Not loaded                               |
+-----------+--------------------+------------------------------------------+

Only modules that are Not Loaded can be disabled.

To disable a module, take note of its module name from the table above. We’ll disable module Policy for this example:

$ dcgmi modules --blacklist Policy
+-----------------------------+---------------------------------------------+
| Blacklist Module                                                          |
| Status: Success                                                           |
| Successfully blacklisted module Policy                                    |
+=============================+=============================================+
+-----------------------------+---------------------------------------------+

Once a module has been disabled, you can verify that by listing modules again:

$ dcgmi modules -l
+-----------+--------------------+------------------------------------------+
| List Modules                                                              |
| Status: Success                                                           |
+===========+====================+==========================================+
| Module ID | Name               | State                                    |
+-----------+--------------------+------------------------------------------+
| 0         | Core               | Loaded                                   |
| 1         | NvSwitch           | Not loaded                               |
| 2         | VGPU               | Not loaded                               |
| 3         | Introspection      | Not loaded                               |
| 4         | Health             | Not loaded                               |
| 5         | Policy             | Blacklisted                              |
| 6         | Config             | Not loaded                               |
| 7         | Diag               | Not loaded                               |
| 8         | Profiling          | Not loaded                               |
| 9         | Sysmon             | Not loaded                               |
+-----------+--------------------+------------------------------------------+