Known Issues: All DGX Systems
See the sections for specific versions to see which issues are open in those versions.
Issue
Pull container cuda: 11.7.0-base-ubi8 failure;
root cause and resolution unknown.
Explanation
There may be stale Docker data, or older images or containers on your system which may need to be removed.
Workaround
sudo rpm -e nv-docker-gpus sudo rpm -e nv-docker-options sudo yum group remove -y ‘NVIDIA Container Runtime sudo yum install -y docker sudo yum install nv-docker-gpus sudo yum group install -y ‘NVIDIA Container Runtime’ sudo systemctl restart docker sudo systemctl restart nv-docker-gpus sudo docker rmi sudo docker images -q # may get an error if there are no images sudo docker rm sudo docker ps -aq # may get an error if there are no containers sudo docker run –security-opt label=type:nvidia_container_t –rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi
Issue
When inquiring the status of dcgm.service, it is reported as deprecated.
sudo systemctl status dcgm.service
dcgm.service - DEPRECATED. Please use nvidia-dcgm.service
...
Explanation and Workaround
The message can be ignored. dcgm.service
is, indeed, deprecated, but can still be used without issue. The name of the DCGM service is in the process of migrating from dcgm.service
to nvidia-dcgm.service
. During the transition, both are included in DCGM 2.2.8. A later version of DGX EL7 will enable nvidia-dcgm.service
by default. You can enable nvidia-dcgm.service
manually (even though there is no functional difference) as follows:
.. code:: bash
sudo systemctl stop dcgm.service sudo systemctl disable dcgm.service sudo systemctl start nvidia-dcgm.service sudo systemctl enable nvidia-dcgm.service
Issue
On systems where both the datacenter-gpu-manager=1.x and datacenter-gpu-manager-fabricmanager=1.x packages are installed, you may see the nvidia-fabricmanager service fail to start with the following errors:
nvhostengine_daemon[18519]: nv-hostengine version 1.7.2 daemon started
nvhostengine_daemon[18519]: DCGM initialized
nv-hostengine[18517]: ERROR: TCP bind failed for port 5555 address 16777343 errno 98
nv-hostengine[18517]: Failed to start host engine server
nvhostengine_daemon[18519]: Err: Failed to start DCGM Server
systemd[1]: Started FabricManager service.
systemd[1]: nvidia-fabricmanager.service: main process exited, code=exited, status=255/n/a
systemd[1]: Unit nvidia-fabricmanager.service entered failed state.
systemd[1]: nvidia-fabricmanager.service failed.
This happens because both dcgm.service and nvidia-fabricmanager.service try to launch nv-hostengine. This issue does not affect datacenter-gpu-manager version 2.x and later.
Explanation and Workaround
On non-NVSwitch systems such as DGX-1, stop and disable the nvidia-fabricmanager service:
systemctl stop nvidia-fabricmanager
systemctl disable nvidia-fabricmanager
On NVSwitch systems such as DGX-2 and DGX A100, stop and disable the dcgm service:
systemctl stop dcgm
systemctl disable dcgm
Issue
After attempting to start nvsm-plugin-memory.service, the service fails. Issuing systemctl status nvsm-plugin-memory.service
returns error messages such as nvsm-plugin-memory.service failed
or Failed to start NVSM API plugin service to monitor system memory devices
Explanation and Workaround
This can occur with GPU driver release 418 when one or more DIMMs are bad. To workaround, update to Release 450 or 470 as described in Installing and Updating the Software