Known Issues: DGX Station
See the sections for specific versions to see which issues are open in those versions.
Issue (fixed in 20.02)
Attempting to run GPU-accelerated Docker containers may return the following error. .. code:: text
Failed to initialize NVML: Unknown Error
Explanation and Workaround
This issue occurs if you have installed docker-1.13.1-108 provided by Red Hat Enterprise Linux.
An updated Docker version that resolves the issue is now available. To obtain the update, issue the following: .. code:: text
sudo yum update
Issue
[Fixed in EL7-21-07] Removing the NVIDIA CUDA Toolkit can cause the symbolic link to /usr/local/cuda to be removed even if multiple versions of the NVIDIA CUDA Toolkit are installed.
Workaround
This workaround requires sudo privileges.
Re-create the symbolic link to /usr/local/cuda from the versioned CUDA directory, for example, /usr/local/cuda-10.1.
sudo ln -s /usr/local/cuda-10.1 /usr/local/cuda
Issue
The nvhealth command incorrectly lists the serial number of the motherboard in the DGX Serial Number
entry under Checks
. The correct serial number is listed under System Summary
.
$ sudo nvhealth
Info
----
Timestamp: Thu Mar 7 08:54:52 2019 -0800
Version: 19.01.6
Checks
------
DGX BaseOS Version [4.0.5]...........................................
BIOS Version [0406]..................................................
DGX Serial Number [160984157800056]..................................
...
System Summary
--------------
Product Name: DGX Station
Manufacturer: NVIDIA
DGX Serial Number: 0154017000004
Uptime: up 5 days, 17 hours, 44 minutes
Motherboard:
BIOS Version: 0406
Serial Number: 160984157800056
...
Issue (fixed with EL7-22.02)
The DGX Station cannot be resumed after being suspended either from the desktop GUI or by using the systemctl suspend command. Pressing a keyboard key or the power button when the system is suspended has no effect: The display remains dark, it is not possible to log in to the system, and the system does not respond to a ping command from a remote host.
Workaround
To avoid this issue, do not suspend the system.
If you encounter this issue, turn off the power to the system and then turn on the power to the system again.