DCGM Modularity¶
DCGM supports modularity where the different functional/feature areas of DCGM are
separated into different shared libraries. These shared libraries are lazy-loaded by
DCGM when first used by a corresponding API or DCGMI call. If you never want a module
to be loaded, you can disable that module using dcgmi, API calls, or nv-hostengine
command-line arguments. Additionally, you can remove the so.1
file of the module you
want to permanently disable, and DCGM will behave as if the library is disabled.
Module List¶
The following modules have been added to DCGM.
Module ID |
Number |
Description |
---|---|---|
DcgmModuleIdNvSwitch |
1 |
Manages NVSwitches and is required in order for DGX-2 / HGX-2 systems
to function properly. This can be loaded explicitly by adding Requires nv-hostengine to run as root. |
DcgmModuleIdVGPU |
2 |
Provides telemetry on paravirtualized GPUs. |
DcgmModuleIdIntrospect |
3 |
Provides time-series data about the running state of nv-hostengine. |
DcgmModuleIdHealth |
4 |
Provides passive health checks of GPUs and NVSwitches |
DcgmModuleIdPolicy |
5 |
Allows users to register callbacks based off GPU events like XIDs and overtemp. |
DcgmModuleIdConfig |
6 |
Allows users to set GPU configuration. Requires nv-hostengine to run as root. |
DcgmModuleIdDiag |
7 |
Enables users to call the DCGM GPU Diagnostic |
Disabling Modules¶
Users may prevent DCGM from loading modules by providing a command-line option to nv-hostengine when they run it. The argument to this command line is the Number column in the table above.
For instance, to start nv-hostengine with the introspection (3) and health (4) modules disabled, you would change nv-hostengine’s service file to pass the following arguments:
$ nv-hostengine --blacklist-modules 3,4
Note that NVSwitch must be explicitly loaded with the -l
and -g
options. To only load the NVSwitch module
and disable all others, use the following command line:
$ nv-hostengine -l -g --blacklist-modules 2,3,4,5,6,7
You can query the status of all of the dcgm modules with the following command:
$ dcgmi modules -l
+-----------+--------------------+------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==========================================+
| Module ID | Name | State |
+-----------+--------------------+------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Not loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
+-----------+--------------------+------------------------------------------+
Only modules that are Not Loaded can be disabled.
To disable a module, take note of its module name from the table above. We’ll disable module Policy for this example:
$ dcgmi modules --blacklist Policy
+-----------------------------+---------------------------------------------+
| Blacklist Module |
| Status: Success |
| Successfully blacklisted module Policy |
+=============================+=============================================+
+-----------------------------+---------------------------------------------+
Once a module has been disabled, you can verify that by listing modules again:
$ dcgmi modules -l
+-----------+--------------------+------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==========================================+
| Module ID | Name | State |
+-----------+--------------------+------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Not loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Blacklisted |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
+-----------+--------------------+------------------------------------------+