Known Issues: All DGX Systems

See the sections for specific versions to see which issues are open in those versions.

Pull container CUDA: 11.7.0-base-ubi8 failure

Issue

Pull container cuda: 11.7.0-base-ubi8 failure;

root cause and resolution unknown.

Explanation

There may be stale Docker data, or older images or containers on your system which may need to be removed.

Workaround

sudo rpm -e nv-docker-gpus sudo rpm -e nv-docker-options sudo yum group remove -y ‘NVIDIA Container Runtime sudo yum install -y docker sudo yum install nv-docker-gpus sudo yum group install -y ‘NVIDIA Container Runtime’ sudo systemctl restart docker sudo systemctl restart nv-docker-gpus sudo docker rmi sudo docker images -q # may get an error if there are no images sudo docker rm sudo docker ps -aq # may get an error if there are no containers sudo docker run –security-opt label=type:nvidia_container_t –rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi

DCGM Service Reported as Deprecated

Issue

When inquiring the status of dcgm.service, it is reported as deprecated.

Copy
Copied!

            
            sudo systemctl status dcgm.service
dcgm.service - DEPRECATED. Please use nvidia-dcgm.service
...

Explanation and Workaround

The message can be ignored. dcgm.service is, indeed, deprecated, but can still be used without issue. The name of the DCGM service is in the process of migrating from dcgm.service to nvidia-dcgm.service. During the transition, both are included in DCGM 2.2.8. A later version of DGX EL7 will enable nvidia-dcgm.service by default. You can enable nvidia-dcgm.service manually (even though there is no functional difference) as follows: .. code:: bash

sudo systemctl stop dcgm.service sudo systemctl disable dcgm.service sudo systemctl start nvidia-dcgm.service sudo systemctl enable nvidia-dcgm.service

Fabric Manager May Fail to Start if DCGM Service Installed

Issue

On systems where both the datacenter-gpu-manager=1.x and datacenter-gpu-manager-fabricmanager=1.x packages are installed, you may see the nvidia-fabricmanager service fail to start with the following errors:

Copy
Copied!

            
            nvhostengine_daemon[18519]: nv-hostengine version 1.7.2 daemon started
nvhostengine_daemon[18519]: DCGM initialized
nv-hostengine[18517]: ERROR: TCP bind failed for port 5555 address 16777343 errno 98
nv-hostengine[18517]: Failed to start host engine server
nvhostengine_daemon[18519]: Err: Failed to start DCGM Server
systemd[1]: Started FabricManager service.
systemd[1]: nvidia-fabricmanager.service: main process exited, code=exited, status=255/n/a
systemd[1]: Unit nvidia-fabricmanager.service entered failed state.
systemd[1]: nvidia-fabricmanager.service failed.

This happens because both dcgm.service and nvidia-fabricmanager.service try to launch nv-hostengine. This issue does not affect datacenter-gpu-manager version 2.x and later.

Explanation and Workaround

On non-NVSwitch systems such as DGX-1, stop and disable the nvidia-fabricmanager service:

Copy
Copied!

            
            systemctl stop nvidia-fabricmanager
systemctl disable nvidia-fabricmanager

On NVSwitch systems such as DGX-2 and DGX A100, stop and disable the dcgm service:

Copy
Copied!

            
            systemctl stop dcgm
systemctl disable dcgm

nvsm-plugin-memory Service Fails to Launch

Issue

After attempting to start nvsm-plugin-memory.service, the service fails. Issuing systemctl status nvsm-plugin-memory.service returns error messages such as nvsm-plugin-memory.service failed or Failed to start NVSM API plugin service to monitor system memory devices

Explanation and Workaround

This can occur with GPU driver release 418 when one or more DIMMs are bad. To workaround, update to Release 450 or 470 as described in Installing and Updating the Software