Known Issues: All DGX Systems

See the sections for specific versions to see which issues are open in those versions.

DCGM Service Reported as Deprecated

Issue

When inquiring the status of dcgm.service, it is reported as deprecated.

$ sudo systemctl status dcgm.service
dcgm.service - DEPRECATED. Please use nvidia-dcgm.service
...

Explanation and Workaround

The message can be ignored. dcgm.service is, indeed, deprecated, but can still be used without issue. The name of the DCGM service is in the process of migrating from dcgm.service to nvidia-dcgm.service. During the transition, both are included in DCGM 2.2.8. A later version of DGX EL7 will enable nvidia-dcgm.service by default. You can enable nvidia-dcgm.service manually (even though there is no functional difference) as follows:
$ sudo systemctl stop dcgm.service 
$ sudo systemctl disable dcgm.service 
$ sudo systemctl start nvidia-dcgm.service 
$ sudo systemctl enable nvidia-dcgm.service 

Fabric Manager May Fail to Start if DCGM Service Installed

Issue

On systems where both the datacenter-gpu-manager=1.x and datacenter-gpu-manager-fabricmanager=1.x packages are installed, you may see the nvidia-fabricmanager service fail to start with the following errors:

nvhostengine_daemon[18519]: nv-hostengine version 1.7.2 daemon started
nvhostengine_daemon[18519]: DCGM initialized
nv-hostengine[18517]: ERROR: TCP bind failed for port 5555 address 16777343 errno 98
nv-hostengine[18517]: Failed to start host engine server
nvhostengine_daemon[18519]: Err: Failed to start DCGM Server
systemd[1]: Started FabricManager service.
systemd[1]: nvidia-fabricmanager.service: main process exited, code=exited, status=255/n/a
systemd[1]: Unit nvidia-fabricmanager.service entered failed state.
systemd[1]: nvidia-fabricmanager.service failed.

This happens because both dcgm.service and nvidia-fabricmanager.service try to launch nv-hostengine. This issue does not affect datacenter-gpu-manager version 2.x and later.

Explanation and Workaround

On non-NVSwitch systems such as DGX-1, stop and disable the nvidia-fabricmanager service:

$ systemctl stop nvidia-fabricmanager
$ systemctl disable nvidia-fabricmanager

On NVSwitch systems such as DGX-2 and DGX A100, stop and disable the dcgm service:

$ systemctl stop dcgm
$ systemctl disable dcgm

nvsm-plugin-memory Service Fails to Launch

Issue

After attempting to start nvsm-plugin-memory.service, the service fails. Issuing systemctl status nvsm-plugin-memory.service returns error messages such as nvsm-plugin-memory.service failed or Failed to start NVSM API plugin service to monitor system memory devices.

Explanation and Workaround

This can occur with GPU driver release 418 when one or more DIMMs are bad. To workaround, update to Release 450 or 470 as described in Installing and Updating the Software.