Known Issues

This chapter captures the issues related to the DGX OS software or DGX hardware at the time of the software release.

Known Software Issues

The following are known issues with the software.

  • Pull container CUDA: 11.7.0-base-ubi8 failure

  • DCGM Service Labelled as Deprecated

  • NVSM May Raise ‘md1 is corrupted’ Alert

  • nvsm show health Reports Empty /proc/driver Folders

  • NVSM Reports “Unknown” for Number of logical CPU cores on non-English system

  • InfiniBand Bandwidth Drops for KVM Guest VMs

ISSUE: Pull container CUDA: 11.7.0-base-ubi8 failure. Root cause and resolution unknown

EXPLANATION

There may be stale Docker data, or possibly older images or containers on your system which may need to be removed.

WORKAROUND

sudo rpm -e nv-docker-gpus
sudo rpm -e nv-docker-options docker
sudo yum group remove -y 'NVIDIA Container Runtime'
sudo yum install -y docker
sudo yum install nv-docker-gpus
sudo yum group install -y 'NVIDIA Container Runtime'
sudo systemctl restart docker
sudo systemctl restart nv-docker-gpus
sudo docker rmi \`sudo docker images -q\` # may get an error if there are no images
sudo docker rm \`sudo docker ps -aq\` # may get an error if there are no containers
sudo docker run --security-opt label=type:nvidia_container_t --rm nvcr.io/nvidia/cuda:11.0-base nvidia-smi

Issue: DCGM Service Labelled as Deprecated

When inquiring the status of dcgm.service, it is reported as deprecated.

$ sudo systemctl status dcgm.service

dcgm.service - DEPRECATED. Please use nvidia-dcgm.service

Explanation

The message can be ignored.

dcgm.service is, indeed, deprecated, but can still be used without issue. The name of the DCGM service is in the process of migrating from dcgm.service to nvidia-dcgm.service. During the transition, both are included in DCGM 2.2.8.

A later version of DGX OS 4 will enable nvidia-dcgm.service by default. You can enable nvidia-dcgm.service manually (even though there is no functional difference) as follows:

$ sudo systemctl stop dcgm.service

$ sudo systemctl disable dcgm.service

$ sudo systemctl start nvidia-dcgm.service

$ sudo systemctl enable nvidia-dcgm.service

Issue: NVSM May Raise ‘md1 is corrupted’ Alert

On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises ‘md1 is corrupted’ alerts.

Explanation

The OS RAID 1 drives are running in a non-standard configuration, resulting in erroneous alert messages. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.

To configure NVSM to support a custom drive partitioning, perform the following.

  1. Stop NVSM services.

    systemctl stop nvsm
    
  2. Edit /etc/nvsm/nvsm.config and set the “use_standard_config_storage” parameter to false.

“use_standard_config_storage”:false

  1. Remove the NVSM database.

    sudo rm /var/lib/nvsm/sqlite/nvsm.db
    
  • Restart NVSM.

    systemctl restart nvsm
    

Issue: nvsm show health Reports Empty /proc/driver Folders

When issuing nvsm show health, the nvsmhealth_log.txt log file reports that the /proc/driver/ folders are empty.

Example from a DGX-1

2020-09-01 20:03:05,204 INFO: Found empty path glob "/proc/driver/nvidia/*/gpus/*/information"
2020-09-01 20:03:06,206 INFO: Found empty path glob "/proc/driver/nvidia/*/gpus/*/registry"
2020-09-01 20:03:09,742 INFO: Found empty path glob "/proc/driver/nvidia/*/params"
2020-09-01 20:03:10,743 INFO: Found empty path glob "/proc/driver/nvidia/*/registry"
2020-09-01 20:03:11,745 INFO: Found empty path glob "/proc/driver/nvidia/*/version"
2020-09-01 20:03:12,747 INFO: Found empty path glob "/proc/driver/nvidia/*/warnings/*"

Explanation

This is an erroneous message as the folder content is actually loaded during the software installation. The message can be ignored. This will be resolved in a future NVSM release.

Issue: NVSM Reports “Unknown” for Number of logical CPU cores on non-English system

On systems set up for a non-English locale, the nvsm show health command lists the number of logical CPU cores as Unknown.

Number of logical CPU cores [None]………………………. Unknown

Resolution

This issue will be resolved in a later version of the DGX OS software.

Issue: InfiniBand Bandwidth Drops for KVM Guest VMs

The InfiniBand bandwidth when running on multi-GPU guest VMs is lower than when running on bare metal.

Explanation

Currently, performance when using GPUDirect within a guest VM will be lower than when used on a bare-metal system.

Known DGX-2 System Issues

The following are known issues specific to the DGX-2 server.

  • DGX KVM: nvidia-vm health-check May Fail

  • NVSM Does not Detect Downgraded GPU PCIe Link

Issue: DGX KVM: nvidia-vm health-check May Fail

When running nvidia-vm health-check to check the health of specific GPUs used by the DGX KVM guest VM, the command may fail.

Example:

sudo nvidia-vm health-check --gpu-count 1 --gpu-index 0 --fulltest run
…
ERROR: Unexpected response from blacklist "connection"
ERROR: Unexpected response from blacklist "to"
ERROR: Unexpected response from blacklist "the"
ERROR: Unexpected response from blacklist "host"
ERROR: Unexpected response from blacklist "engine"
ERROR: Unexpected response from blacklist "is"
ERROR: Unexpected response from blacklist "not"
ERROR: Unexpected response from blacklist "valid"
ERROR: Unexpected response from blacklist "any"
ERROR: Unexpected response from blacklist "longer""
ERROR: No healthy/unhealthy data returned from blacklist command

Explanation and Resolution

This occurs because the health-check VM is created from an image based on the DGX OS ISO, which uses the R418 driver package, but the host was updated to the R450 driver package. The two packages use different DCGM releases which cannot communicate with each other, resulting in the error.

If the GPU PCIe link is downgraded to Gen1, NVSM still reports the GPU health status as OK.

Explanation and Resolution

The NVSM software currently does not check for this condition. The check will be added in a future software release.

Known DGX-1 System Issues

The following are known issues specific to the DGX-1 server.

  • nvidia-nvswitch Version Mismatch Message Appears when Running DCGM

  • Forced Reboot Hangs the OS

Issue: nvidia-nvswitch Version Mismatch Message Appears when Running DCGM

When starting the DCGM service, a version mismatch error message similar to the following will appear:

78075.772392 nvidia-nvswitch: Version mismatch, kernel version 450.80.02 user version 450.51.06

Explanation

This occurs with GPU driver versions later than 450.51.06. The version check occurs on all DGX systems, but applies only to NVSwitch systems, so the message can be ignored on non-NVSwitch systems such as the DGX Station or DGX-1.

Issue: Forced Reboot Hangs the OS

When issuing reboot -f (forced reboot), I/O error messages appear on the console and then the system hangs.

The system reboots normally when issuing reboot.

Resolution

This issue will be resolved in a future version of the DGX OS server.

The following are known issues related to the Ubuntu OS or the Linux kernel that affect the DGX server.

  • System May Slow Down When Using mpirun

Issue: System May Slow Down When Using ``mpirun``

Customers running Message Passing Interface (MPI) workloads may experience the OS becoming very slow to respond. When this occurs, a log message similar to the following would appear in the kernel log:

kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!

Explanation

Due to the current design of the Linux kernel, the condition may be triggered when get_user_pages is used on a file that is on persistent storage. For example, this can happen when cudaHostRegister is used on a file path that is stored in an ext4 filesystem. DGX systems implement /tmp on a persistent ext4 filesystem.

Workaround

Note

If you performed this workaround on a previous DGX OS software version, you do not need to do it again after updating to the latest DGX OS version.

To avoid using persistent storage, MPI can be configured to use shared memory at /dev/shm (this is a temporary filesystem).

If you are using Open MPI, then you can solve the issue by configuring the Modular Component Architecture (MCA) parameters so that mpirun uses the temporary file system in memory.

For details on how to accomplish this, see the Knowledge Base Article DGX System Slows Down When Using mpirun (requires login to the NVIDIA Enterprise Support portal).