This chapter captures the issues related to the DGX OS software or DGX hardware at the time of the software release.
Known Software Issues
The following are known issues with the software.
Pull container CUDA: 11.7.0-base-ubi8 failure
DCGM Service Labelled as Deprecated
NVSM May Raise ‘md1 is corrupted’ Alert
nvsm show health Reports Empty /proc/driver Folders
NVSM Reports “Unknown” for Number of logical CPU cores on non-English system
InfiniBand Bandwidth Drops for KVM Guest VMs
ISSUE: Pull container CUDA: 11.7.0-base-ubi8 failure. Root cause and resolution unknown
There may be stale Docker data, or possibly older images or containers on your system which may need to be removed.
sudo rpm -e nv-docker-gpus sudo rpm -e nv-docker-options docker sudo yum group remove -y 'NVIDIA Container Runtime' sudo yum install -y docker sudo yum install nv-docker-gpus sudo yum group install -y 'NVIDIA Container Runtime' sudo systemctl restart docker sudo systemctl restart nv-docker-gpus sudo docker rmi \`sudo docker images -q\` # may get an error if there are no images sudo docker rm \`sudo docker ps -aq\` # may get an error if there are no containers sudo docker run --security-opt label=type:nvidia_container_t --rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi
Issue: DCGM Service Labelled as Deprecated
When inquiring the status of dcgm.service, it is reported as deprecated.
$ sudo systemctl status dcgm.service
dcgm.service - DEPRECATED. Please use nvidia-dcgm.service
The message can be ignored.
dcgm.service is, indeed, deprecated, but can still be used without issue. The name of the DCGM service is in the process of migrating from dcgm.service to nvidia-dcgm.service. During the transition, both are included in DCGM 2.2.8.
A later version of DGX OS 4 will enable nvidia-dcgm.service by default. You can enable nvidia-dcgm.service manually (even though there is no functional difference) as follows:
$ sudo systemctl stop dcgm.service
$ sudo systemctl disable dcgm.service
$ sudo systemctl start nvidia-dcgm.service
$ sudo systemctl enable nvidia-dcgm.service
Issue: NVSM May Raise ‘md1 is corrupted’ Alert
On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises ‘md1 is corrupted’ alerts.
The OS RAID 1 drives are running in a non-standard configuration, resulting in erroneous alert messages. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.
To configure NVSM to support a custom drive partitioning, perform the following.
Stop NVSM services.
systemctl stop nvsm
Edit /etc/nvsm/nvsm.config and set the “use_standard_config_storage” parameter to false.
Remove the NVSM database.
sudo rm /var/lib/nvsm/sqlite/nvsm.db
systemctl restart nvsm
Issue: nvsm show health Reports Empty /proc/driver Folders
When issuing nvsm show health, the nvsmhealth_log.txt log file reports that the /proc/driver/ folders are empty.
Example from a DGX-1
2020-09-01 20:03:05,204 INFO: Found empty path glob "/proc/driver/nvidia/*/gpus/*/information" 2020-09-01 20:03:06,206 INFO: Found empty path glob "/proc/driver/nvidia/*/gpus/*/registry" 2020-09-01 20:03:09,742 INFO: Found empty path glob "/proc/driver/nvidia/*/params" 2020-09-01 20:03:10,743 INFO: Found empty path glob "/proc/driver/nvidia/*/registry" 2020-09-01 20:03:11,745 INFO: Found empty path glob "/proc/driver/nvidia/*/version" 2020-09-01 20:03:12,747 INFO: Found empty path glob "/proc/driver/nvidia/*/warnings/*"
This is an erroneous message as the folder content is actually loaded during the software installation. The message can be ignored. This will be resolved in a future NVSM release.
Issue: NVSM Reports “Unknown” for Number of logical CPU cores on non-English system
On systems set up for a non-English locale, the
nvsm show health command lists the number of logical CPU cores as Unknown.
Number of logical CPU cores [None]………………………. Unknown
This issue will be resolved in a later version of the DGX OS software.
Issue: InfiniBand Bandwidth Drops for KVM Guest VMs
The InfiniBand bandwidth when running on multi-GPU guest VMs is lower than when running on bare metal.
Currently, performance when using GPUDirect within a guest VM will be lower than when used on a bare-metal system.
Known DGX-2 System Issues
The following are known issues specific to the DGX-2 server.
DGX KVM: nvidia-vm health-check May Fail
NVSM Does not Detect Downgraded GPU PCIe Link
Issue: DGX KVM: nvidia-vm health-check May Fail
When running nvidia-vm health-check to check the health of specific GPUs used by the DGX KVM guest VM, the command may fail.
sudo nvidia-vm health-check --gpu-count 1 --gpu-index 0 --fulltest run … ERROR: Unexpected response from blacklist "connection" ERROR: Unexpected response from blacklist "to" ERROR: Unexpected response from blacklist "the" ERROR: Unexpected response from blacklist "host" ERROR: Unexpected response from blacklist "engine" ERROR: Unexpected response from blacklist "is" ERROR: Unexpected response from blacklist "not" ERROR: Unexpected response from blacklist "valid" ERROR: Unexpected response from blacklist "any" ERROR: Unexpected response from blacklist "longer"" ERROR: No healthy/unhealthy data returned from blacklist command
Explanation and Resolution
This occurs because the health-check VM is created from an image based on the DGX OS ISO, which uses the R418 driver package, but the host was updated to the R450 driver package. The two packages use different DCGM releases which cannot communicate with each other, resulting in the error.
If the GPU PCIe link is downgraded to Gen1, NVSM still reports the GPU health status as OK.
Explanation and Resolution
The NVSM software currently does not check for this condition. The check will be added in a future software release.
Known DGX-1 System Issues
The following are known issues specific to the DGX-1 server.
nvidia-nvswitch Version Mismatch Message Appears when Running DCGM
Forced Reboot Hangs the OS
Issue: nvidia-nvswitch Version Mismatch Message Appears when Running DCGM
When starting the DCGM service, a version mismatch error message similar to the following will appear:
78075.772392 nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06
This occurs with GPU driver versions later than 450.51.06. The version check occurs on all DGX systems, but applies only to NVSwitch systems, so the message can be ignored on non-NVSwitch systems such as the DGX Station or DGX-1.
Issue: Forced Reboot Hangs the OS
reboot -f (forced reboot), I/O error messages appear on the console and then the system hangs.
The system reboots normally when issuing
This issue will be resolved in a future version of the DGX OS server.
The following are known issues related to the Ubuntu OS or the Linux kernel that affect the DGX server.
System May Slow Down When Using mpirun
Issue: System May Slow Down When Using ``mpirun``
Customers running Message Passing Interface (MPI) workloads may experience the OS becoming very slow to respond. When this occurs, a log message similar to the following would appear in the kernel log:
kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
Due to the current design of the Linux kernel, the condition may be triggered when
get_user_pages is used on a file that is on persistent storage. For example, this can happen when
cudaHostRegister is used on a file path that is stored in an ext4 filesystem. DGX systems implement
/tmp on a persistent ext4 filesystem.
If you performed this workaround on a previous DGX OS software version, you do not need to do it again after updating to the latest DGX OS version.
To avoid using persistent storage, MPI can be configured to use shared memory at
/dev/shm (this is a temporary filesystem).
If you are using Open MPI, then you can solve the issue by configuring the Modular Component Architecture (MCA) parameters so that
mpirun uses the temporary file system in memory.