Known Issues: DGX Station
Docker GPU Containers Cannot be Run
Issue (fixed in 20.02)
Failed to initialize NVML: Unknown Error
Explanation and Workaround
This issue occurs if you have installed docker-1.13.1-108 provided by Red Hat Enterprise Linux.
sudo yum update
DGX Station: The Symbolic Link to /usr/local/cuda Is Missing
Issue
[Fixed in EL7-21-07] Removing the NVIDIA CUDA Toolkit can cause the symbolic link to /usr/local/cuda to be removed even if multiple versions of the NVIDIA CUDA Toolkit are installed.
Workaround
This workaround requires sudo privileges.
Re-create the symbolic link to /usr/local/cuda from the versioned CUDA directory, for example, /usr/local/cuda-10.1.
# sudo ln -s /usr/local/cuda-10.1 /usr/local/cuda
DGX Station: An Incorrect Serial Number Is Listed in nvhealth Output
Issue
The nvhealth command incorrectly lists the serial number of the motherboard in the DGX Serial Number entry under Checks. The correct serial number is listed under System Summary.
$ sudo nvhealth Info ---- Timestamp: Thu Mar 7 08:54:52 2019 -0800 Version: 19.01.6 Checks ------ DGX BaseOS Version [4.0.5]........................................... BIOS Version [0406].................................................. DGX Serial Number [160984157800056].................................. ... System Summary -------------- Product Name: DGX Station Manufacturer: NVIDIA DGX Serial Number: 0154017000004 Uptime: up 5 days, 17 hours, 44 minutes Motherboard: BIOS Version: 0406 Serial Number: 160984157800056 ...
DGX Station: The System Cannot be Resumed After Suspension
Issue (fixed with EL7-22.02)
The DGX Station cannot be resumed after being suspended either from the desktop GUI or by using the systemctl suspend command. Pressing a keyboard key or the power button when the system is suspended has no effect: The display remains dark, it is not possible to log in to the system, and the system does not respond to a ping command from a remote host.
Workaround
To avoid this issue, do not suspend the system.
If you encounter this issue, turn off the power to the system and then turn on the power to the system again.