Known Issues#
This section provides summaries of the issues in the DGX Software for Red Hat Enterprise Linux. For information about issues that were previously listed as known issues, but have been resolved, see Resolved Issues.
Enable the DCGM systemd Service After Installation or Reboot#
Issue
In DCGM 4.x, dcgm.service
has been demoted from being a standalone systemd unit to being an
alias of the nvidia-dcgm.service
systemd unit.
After installation of DCGM 4.x or after a reboot with DCGM 4.x already installed, the DCGM service might be inactive. For instance,
$ sudo systemctl status nvidia-dcgm
- nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: disabled)
Active: inactive (dead) since Wed 2025-03-05 11:00:29 PST; 3min 35s ago
Duration: 3h 20min 29.595s
Main PID: 3878817 (code=exited, status=0/SUCCESS)
CPU: 5.563s
Workaround
If the DCGM systemd service is inactive, enable the DCGM systemd service and start it now:
sudo systemctl --now enable nvidia-dcgm
As seen in the following systemctl status
command output, the DCGM service is now active:
$ sudo systemctl status nvidia-dcgm
- nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: disabled)
Active: active (running) since Wed 2025-03-05 11:05:31 PST; 16s ago
Main PID: 80888 (nv-hostengine)
Tasks: 9 (limit: 3355442)
Memory: 16.5M
CPU: 93ms
CGroup: /system.slice/nvidia-dcgm.service
└─80888 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
In DCGM systemd systemctl
commands, nvidia-dcgm
and dcgm
can be used interchangeably.
See the Post-Install
section in the DCGM User Guide for more information.
Symbolic Links Removed During nvidia-mig-manager Upgrade#
Issue
During the upgrade of the nvidia-mig-manager
package to version 0.10.1 through 0.12.1, the
following two symbolic links might be inadvertently removed.
/etc/nvidia-mig-manager/config.yaml
/etc/nvidia-mig-manager/hooks.yaml
Workaround
To resolve this issue, run the following command.
sudo dnf reinstall nvidia-mig-manager
Red Hat Installer Fails to Delete Partitions#
Issue
When performing disk partitioning with the Red Hat installation utility, the utility can fail to delete partitions that were previously used by other operating systems.
Workaround
To manually wipe the devices, perform the following steps.
Click Done twice from the Manual Partitioning menu.
You returned to the main menu for the installation utility.
Press Ctrl+Alt+F2 to use a shell in a different virtual console.
If the installation utility does not respond to the keystrokes, add a hotkey that sends Ctrl+Alt+F2.
Select Hot Keys > Add Hot Keys.
The User Defined Macros window opens.
Click Add.
The Add Macro window opens.
Press Ctrl+Alt+F2. Ensure that the key sequence appears in the text field and then click Insert.
Click Close on the User Defined Macros window.
Select Hot Keys > Ctrl+Alt+F2 to use a different virtual console.
Run the remaining commands in the virtual console.
Stop RAID devices.
Run the
lsblk
command.If the command output includes any md devices, stop the devices:
mdadm --stop /dev/md<device-id>
Run the wipefs command for all the drives:
wipefs -a /dev/nvme0n1 wipefs -a /dev/nvme1n1 ...
Reboot the machine and restart the installation process.
Virtualization Not Supported#
Issue
Virtualization technology, such as ESXi hypervisors or kernel-based virtual machines (KVM), is not an intended use case on DGX systems and has not been tested.
NVSM Service and Fabric Manager Service Reported as Inactive#
Platform
DGX H100 System, A100 System, and A100 Station with EL9-24.06
Issue
After EL9-24.06 upgrade and a system reboot, the status of nvsm.service
and nvidia-fabricmanager.service
shows inactive (dead)
when you run systemctl status nvsm
and systemctl status nvidia-fabricmanager
, respectively.
$ sudo systemctl status nvsm
...
nvsm.service - NVIDIA System Management service suite
Loaded: loaded (/usr/lib/systemd/system/nvsm.service; enabled; preset: disabled)
Active: inactive (dead)
...
$ sudo systemctl status nvidia-fabricmanager
...
nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
Active: inactive (dead)
...
Workaround
The nvsm.service
service manages the start and stop of services running under NVSM. Because
the NVSM services are operating normally and NVSM is fully functioning, you can ignore the inactive
status of nvsm.service
. To fix the nvsm.service
status issue, run the systemctl start nvsm
command after a system reboot.
However, the nvidia-fabricmanager.service
service remains inactive. To resolve this issue, manually
start the service by running the systemctl start nvidia-fabricmanager.service
command.
Explanation
After a system reboot on DGX systems running the GPU Driver Release 550 or newer, nvsm.service
and
nvidia-fabricmanager.service
appear inactive because systemd
finds a dependency on
nvidia-fabricmanager.service
during startup. The circular dependency between nvsm.service
and nvidia-fabricmanager.service
makes one service wait for the other and prevents the services
from starting.
Excessive OpenSM Log Growth Causing DGX Systems to Become Inoperable#
Issue
An exceptionally large /var/log/opensm.log
file can cause DGX systems to become inoperable.
Explanation
During the installation process of the MLNX_OFED or DOCA OFED software, the opensm
package is
also installed. By default, OpenSM is disabled. On systems where OpenSM is enabled, the
/etc/logrotate.d/opensm
file should be configured to include the following options to manage
the size of the opensm.log
file:
The maximum size of log files for log rotation, such as
maxsize 10M
ormaxsize 100M
The rotation duration, such as
daily
,weekly
, ormonthly
Not specifying the two configuration options might result in an exceptionally large
/var/log/opensm.log
file that can cause DGX systems to become inoperable. For more information
about OpenSM network topology, configuration, and enablement, refer to the NVIDIA OpenSM documentation.
Reboot Hang after Configuring RAID#
Platform
DGX H100 System with EL9-23.08 and RHEL 9.1 or 9.2
Issue
After installing the DGX H100 Configurations group and configuring RAID with the
sudo /usr/bin/configure_raid_array.py -c -f -5
command and rebooting, the system
can hang and display console messages like the following example:
...
[ 1944.542085] md: md124 stopped.
[ 1944.545711] md: md124 stopped.
...
Workaround
Perform a power cycle to reboot the system successfully. The system boots normally on subsequent reboots.
Explanation
Before rebooting, this issue is triggered by a RAID State of active, degraded, recovering
that can be displayed by running the sudo mdadm --detail /dev/mdXXX
command.
Replace XXX
with the RAID array that you configured with the configure_raid_array.py
command.
Refer to the following sample output:
$ sudo mdadm --detail /dev/md125
/dev/md125:
Version : 1.2
Creation Time : Wed Aug 30 11:39:08 2023
Raid Level : raid5
Array Size : 26254240768 (24.45 TiB 26.88 TB)
Used Dev Size : 3750605824 (3.49 TiB 3.84 TB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Aug 30 11:55:51 2023
State : active, degraded, recovering
Active Devices : 7
Working Devices : 8
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Rebuild Status : 5% complete
Name : nv-data-array
UUID : 2dbe34c6:70decf1e:c54206a6:e78b9161
Events : 204
Number Major Minor RaidDevice State
0 259 1 0 active sync /dev/nvme2n1
1 259 3 1 active sync /dev/nvme3n1
2 259 6 2 active sync /dev/nvme4n1
3 259 7 3 active sync /dev/nvme5n1
4 259 9 4 active sync /dev/nvme6n1
5 259 13 5 active sync /dev/nvme7n1
6 259 14 6 active sync /dev/nvme8n1
8 259 15 7 spare rebuilding /dev/nvme9n1
MOFED mlnxofedinstall reports “Current operation system is not supported” using RHEL 9.2#
Platform
EL9-23.01 and RHEL 9.2 with MLNX_OFED_LINUX-5.8-2.0.3.0-rhel9.1-x86_64.iso
Issue
When installing MLNX MOFED driver from the downloaded ISO using mlnxofedinstall --add-kernel-support
the system generates a warning: “Current operation system is not supported!”
Workaround
Specify the last supported version of RHEL on the commandline by adding “–distro rhel9.1”
`` mlnxofedinstall --distro rhel9.1 --add-kernel-support
Explanation
The current MLNX MOFED installer script can require the most recent supported OS to be specified by name if the OS is upgraded before the installer support is added for that OS version.
Precompiled GPU Driver 525 package is not available for Rocky 9.1#
Platform
Rocky 9.1 with EL9-23.01
Issue
The Pre-compiled GPU Driver might not support the installed Rocky Linux kernel.
Workaround
You can install the GPU driver by using the DKMS subsystem by running the following commands:
sudo dnf module reset -y nvidia-driver
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo dnf module install nvidia-driver:525-dkms
Yellow screen appears during RHEL 9.1 installation#
Issue
When installing the RedHat Enterprise Linux 9.1 ISO on a DGX Station V100, the first installation page shows a yellow screen. This can persist through the installation process and when complete.
Workaround
Install RedHat Enterprise Linux 9.0 on the DGX Station V100, then perform the over the air (OTA) update for the latest RHEL9 version and the DGX EL9-23.01 updates.
DGX A100: VBIOS cannot update due to running service processes#
Issue
VBIOS fails to update on Red Hat Enterprise Linux 9 because service(s)/process(es) are holding onto the resource about to be upgraded.
Workaround
The following services (system processes) must be stopped manually for the firmware update to start:
process nvidia-persistenced
process nv-hostengine
process cache_mgr_event
process cache_mgr_main
process dcgm_ipc
If xorg is holding the resources, try to stop it by running
sudo systemctl stop (display manager)
where the (display manager) can be acquired by
cat /etc/X11/default-display-manager
NVSM Unsupported Drive Error#
Issue
When running nvsm show storage
, the NV-DRIVE-01 alert displays an “Unsupported Drive Configuration” message.
Workaround
The following services (system processes) must be stopped manually for the firmware update to start:
Create a config file to disable nvme multipath:
sudo sh -c 'echo "options nvme-core multipath=n" > /etc/modprobe.d/nvidia-nvme.conf'
Recreate the initramfs.
dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
Reboot the system.
sudo systemctl reboot
The message might be displayed when you log in or when you run the nvsm show alert
and nvsm
show storage
commands and can be safely ignored. This issue will be fixed in a future release.
nvidia-peermem Cannot Be Loaded#
Issue
After installing DOCA, the GPU driver, and nvidia-peermem-loader
and rebooting,
nvidia-peermem
cannot be loaded.
After rebooting, lsmod
output shows that nvidia-peermem
is not loaded:
lsmod | grep nvidia-peermem
[Nothing is returned.]
Workaround
No service automatically loads the nvidia-peermem
module. To load the module automatically
without error at boot time, do the following steps:
Ensure that MLNX_OFED (see Installing NVIDIA MLNX_OFED) or DOCA-OFED (see Installing NVIDIA DOCA-OFED),
the GPU driver (see Installing the GPU Driver), and nvidia-peermem-loader
are installed.
After installing the nvida-peermem-loader
package, manually attempt to load nvidia-peermem
:
sudo modprobe nvidia-peermem
Check if nvidia-peermem
is loaded:
lsmod | grep nvidia-peermem
If the modprobe
command above reported the errors shown in Error messages indicating that the DOCA-host kernel does not match the running kernel, run the commands shown in Commands to rebuild and install the DOCA-host kernel modules to
rebuild and install the DOCA-host kernel modules that match the running kernel.
(See DOCA Extra Package and doca-kernel-support for more details.)
Note
If updates are made to the kernel in the future, the following commands will need to be run again in order for
nvidia_peermem
load successfully.
Error messages indicating that the DOCA-host kernel does not match the running kernel#
Error message from
sudo modprobe nvidia-peermem
command:
modprobe: ERROR: could not insert 'nvidia_peermem': Unknown symbol in module, or unknown parameter (see dmesg)
dmesg
Error messages:
$ dmesg | grep peermem
[<timestamp>] nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -2)
[<timestamp>] nvidia_peermem: Unknown symbol ib_unregister_peer_memory_client (err -2)
Commands to rebuild and install the DOCA-host kernel modules#
Install doca-extra:
sudo dnf install -y doca-extra
Execute the doca-kernel-support
script which rebuilds and installs the DOCA-host kernel modules that match the running kernel:
/opt/mellanox/doca/tools/doca-kernel-support
Verify that a new directory, /tmp/DOCA.* was created on /tmp:
ls -l /tmp/DOCA.*
Run the following commands to complete installing the DOCA-host kernel modules that match the running kernel:
TMP_DOCA_DIR=$(ls -td /tmp/DOCA* | head -1)
sudo rpm ivh ${TMP_DOCA_DIR}/doca-kernel-repo*.rpm
sudo dnf makecache
sudo dnf install doca-ofed-userspace
DOCA_KERNEL=$(sudo dnf list | grep "^doca-kernel-[0-9].*noarch" | awk ' {print $1}')
sudo dnf install --disablerepo=doca $DOCA_KERNEL
sudo reboot
Check that nvidia-peermem
is loaded correctly by running:
lsmod | grep nvidia-peermem
Unable to Reboot System on RHEL9.6#
Issue
On RHEL 9.6, running sudo reboot
to reboot a system that uses MD RAID devices may hang the system.
This issue will be fixed in a future release.
Workaround
To work around this issue, use the following command instead:
sudo reboot -f
NVIDIA GPUDirect Storage (GDS) 1.7 for CUDA 12.2 is not Supported by DGX EL9-25.08#
Issue
Due to changes in RHEL 6.x kernels, DGX EL9-25.08 isn’t compatible with nvidia-fs 2.17. Because of this incompatibility,DGX EL9-25.08 does not support GDS 1.7 for CUDA 12.2.
Older versions of GDS, such as 1.7, are now uncommon; this issue typically arises in installations created from local repositories; this process is described in Installing with Local Repositories.
Workaround
To workaround the issue, either:
Install a latest version of GDS from NVIDIA/gds-nvidia-fs, or
Upgrade to CUDA 12.9 or later
H100 and H200 platforms, Serial Console is Unusable Due to Incorrect Parameter#
Issue
On H100 and H200 platforms, the serial console is unusable due to the console
kernel parameter being set incorrectly.
Workaround
On H100 and H200 platforms, when you are installing the DGX Software, while doing the steps in Installing Required Components, after you perform the Install DGX tools and configuration files step that installs the DGX [platform-type] Configurations and before rebooting, do the following:
On the H100 platform, modify /etc/tuned/dgx-h100-performance/tuned.conf and /etc/tuned/dgx-h100-no-mitigations/tuned.conf.
On the H200 platform, modify /etc/tuned/dgx-h200-performance/tuned.conf and /etc/tuned/dgx-h200-no-mitigations/tuned.conf.
Run the platform-appropriate commands below for the H100 platform or the H200 platform that you are installing to
update the console parameter in the H100 or H200 tuned.conf
profiles to replace console=ttyS1,115200n8
with
console=ttyS0,115200n8
.
On the H100 platform:
sudo sed -i 's/console=ttyS1/console=ttyS0/' /etc/tuned/dgx-h100-performance/tuned.conf
sudo sed -i 's/console=ttyS1/console=ttyS0/' /etc/tuned/dgx-h100-no-mitigations/tuned.conf
sudo tuned-adm profile dgx-h100-performance
On the H200 platform:
sudo sed -i 's/console=ttyS1/console=ttyS0/' /etc/tuned/dgx-h200-performance/tuned.conf
sudo sed -i 's/console=ttyS1/console=ttyS0/' /etc/tuned/dgx-h200-no-mitigations/tuned.conf
sudo tuned-adm profile dgx-h200-performance
Note
If the above commands are not run during installation, these commands can be run at any time to enable the serial console. After the above commands are run, a reboot is required for the serial console enablement to take effect.
GPUs visible in lspci, but missing in nvidia-smi#
Issue
On B200 systems, lspci
output shows all GPUs on a system; but the nvidia-smi
command output may be
missing some of the GPUs.
Workaround
Shut down and restart the system; after which, the nvidia-smi
command output may show all the
GPUs in the system. This issue is still being investigated.