DCGM Modularity¶
DCGM supports modularity where the different functional/feature areas of DCGM are
separated into different shared libraries. These shared libraries are lazy-loaded by
DCGM when first used by a corresponding API or DCGMI call. If you never want a module
to be loaded, you can disable that module using dcgmi, API calls, or nv-hostengine
command-line arguments. Additionally, you can remove the so.1
file of the module you
want to permanently disable, and DCGM will behave as if the library is disabled.
![../_images/dcgm-modularity.png](../_images/dcgm-modularity.png)
Module List¶
The following modules have been added to DCGM.
Module ID |
Number |
Description |
---|---|---|
DcgmModuleIdNvSwitch |
1 |
Manages NVSwitches and is required in order for DGX-2 / HGX-2 systems
to function properly. This can be loaded explicitly by adding Requires nv-hostengine to run as root. |
DcgmModuleIdVGPU |
2 |
Provides telemetry on paravirtualized GPUs. |
DcgmModuleIdIntrospect |
3 |
Provides time-series data about the running state of nv-hostengine. |
DcgmModuleIdHealth |
4 |
Provides passive health checks of GPUs and NVSwitches |
DcgmModuleIdPolicy |
5 |
Allows users to register callbacks based off GPU events like XIDs and overtemp. |
DcgmModuleIdConfig |
6 |
Allows users to set GPU configuration. Requires nv-hostengine to run as root. |
DcgmModuleIdDiag |
7 |
Enables users to call the DCGM GPU Diagnostic |
Disabling Modules¶
Users may prevent DCGM from loading modules by providing a command-line option to nv-hostengine when they run it. The argument to this command line is the Number column in the table above.
For instance, to start nv-hostengine with the introspection (3) and health (4) modules disabled, you would change nv-hostengine’s service file to pass the following arguments:
$ nv-hostengine --blacklist-modules 3,4
Note that NVSwitch must be explicitly loaded with the -l
and -g
options. To only load the NVSwitch module
and disable all others, use the following command line:
$ nv-hostengine -l -g --blacklist-modules 2,3,4,5,6,7
You can query the status of all of the dcgm modules with the following command:
$ dcgmi modules -l
+-----------+--------------------+------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==========================================+
| Module ID | Name | State |
+-----------+--------------------+------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Not loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
+-----------+--------------------+------------------------------------------+
Only modules that are Not Loaded can be disabled.
To disable a module, take note of its module name from the table above. We’ll disable module Policy for this example:
$ dcgmi modules --blacklist Policy
+-----------------------------+---------------------------------------------+
| Blacklist Module |
| Status: Success |
| Successfully blacklisted module Policy |
+=============================+=============================================+
+-----------------------------+---------------------------------------------+
Once a module has been disabled, you can verify that by listing modules again:
$ dcgmi modules -l
+-----------+--------------------+------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==========================================+
| Module ID | Name | State |
+-----------+--------------------+------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Not loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Blacklisted |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
+-----------+--------------------+------------------------------------------+