Known Issues
This section provides summaries of the issues in the DGX Software for Red Hat Enterprise Linux.
Reboot Hang after Configuring RAID
Platform
DGX H100 System with EL9-23.08 and RHEL 9.1 or 9.2
Issue
After installing the DGX H100 Configurations group and configuring RAID with the
sudo /usr/bin/configure_raid_array.py -c -f -5
command and rebooting, the system
can hang and display console messages like the following example:
...
[ 1944.542085] md: md124 stopped.
[ 1944.545711] md: md124 stopped.
...
Workaround
Perform a power cycle to reboot the system successfully. The system boots normally on subsequent reboots.
Explanation
Before rebooting, this issue is triggered by a RAID State of active, degraded, recovering
that can be displayed by running the sudo mdadm --detail /dev/mdXXX
command.
Replace XXX
with the RAID array that you configured with the configure_raid_array.py
command.
Refer to the following sample output:
$ sudo mdadm --detail /dev/md125
/dev/md125:
Version : 1.2
Creation Time : Wed Aug 30 11:39:08 2023
Raid Level : raid5
Array Size : 26254240768 (24.45 TiB 26.88 TB)
Used Dev Size : 3750605824 (3.49 TiB 3.84 TB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Aug 30 11:55:51 2023
State : active, degraded, recovering
Active Devices : 7
Working Devices : 8
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Rebuild Status : 5% complete
Name : nv-data-array
UUID : 2dbe34c6:70decf1e:c54206a6:e78b9161
Events : 204
Number Major Minor RaidDevice State
0 259 1 0 active sync /dev/nvme2n1
1 259 3 1 active sync /dev/nvme3n1
2 259 6 2 active sync /dev/nvme4n1
3 259 7 3 active sync /dev/nvme5n1
4 259 9 4 active sync /dev/nvme6n1
5 259 13 5 active sync /dev/nvme7n1
6 259 14 6 active sync /dev/nvme8n1
8 259 15 7 spare rebuilding /dev/nvme9n1
MOFED mlnxofedinstall reports “Current operation system is not supported” using RHEL 9.2
Platform
EL9-23.01 and RHEL 9.2 with MLNX_OFED_LINUX-5.8-2.0.3.0-rhel9.1-x86_64.iso
Issue
When installing MLNX MOFED driver from the downloaded ISO using mlnxofedinstall --add-kernel-support
the system generates a warning: “Current operation system is not supported!”
Workaround
Specify the last supported version of RHEL on the commandline by adding “–distro rhel9.1”
`` mlnxofedinstall --distro rhel9.1 --add-kernel-support
Explanation
The current MLNX MOFED installer script can require the most recent supported OS to be specified by name if the OS is upgraded before the installer support is added for that OS version.
Precompiled GPU Driver 525 package is not available for Rocky 9.1
Platform
Rocky 9.1 with EL9-23.01
Issue
The Pre-compiled GPU Driver might not support the installed Rocky Linux kernel.
Workaround
You can install the GPU driver by using the DKMS subsystem by running the following commands:
sudo dnf module reset -y nvidia-driver
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo dnf module install nvidia-driver:525-dkms
Yellow screen appears during RHEL 9.1 installation
Issue
When installing the RedHat Enterprise Linux 9.1 ISO on a DGX Station V100, the first installation page shows a yellow screen. This can persist through the installation process and when complete.
Workaround
Install RedHat Enterprise Linux 9.0 on the DGX Station V100, then perform the over the air (OTA) update for the latest RHEL9 version and the DGX EL9-23.01 updates.
DGX A100: VBIOS cannot update due to running service processes
Issue
VBIOS fails to update on Red Hat Enterprise Linux 9 because service(s)/process(es) are holding onto the resource about to be upgraded.
Workaround
The following services (system processes) must be stopped manually for the firmware update to start:
process nvidia-persistenced
process nv-hostengine
process cache_mgr_event
process cache_mgr_main
process dcgm_ipc
If xorg is holding the resources, try to stop it by running
sudo systemctl stop (display manager)
where the (display manager) can be acquired by
cat /etc/X11/default-display-manager
NVSM Unsupported Drive Error
Issue
When running nvsm show storage
, the NV-DRIVE-01 alert displays an “Unsupported Drive Configuration” message.
Workaround
The following services (system processes) must be stopped manually for the firmware update to start:
Create a config file to disable nvme multipath:
sudo sh -c 'echo "options nvme-core multipath=n" > /etc/modprobe.d/nvidia-nvme.conf'
Recreate the initramfs.
dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
Reboot the system.
sudo systemctl reboot
The message might be displayed when you log in or when you run the nvsm show alert
and nvsm
show storage
commands and can be safely ignored. This issue will be fixed in a future release.
Tuned profiles do not take effect in graphical mode
Issue
DGX tuned profiles might not take effect due to a known Red Hat Enterprise Linux 9 issue. This affects systems that use a graphical target mode.
Workaround
This issue can be fixed by running the following commands:
Mask the power-profiles-daemon service then tuned is able to start during boot.
systemctl mask power-profiles-daemon
Reboot the system
reboot