Known Issues#

This section provides summaries of the issues in the DGX Software for Red Hat Enterprise Linux.

DGX Station A100 Fails to Boot After Applying MIG Configurations#

Issue

When MIG configurations are applied to a DGX station A100 system running DGX EL9-25.04 with the Release 570 GPU Driver installed, the system might fail to boot when you run the sudo reboot command. Resetting the GPUs by performing a DC power cycle does not recover the system.

Workaround

NVIDIA DGX Software for Red Hat Enterprise Linux 9 releases do not support the 570 release of the GPU Driver on DGX Station A100 systems.

To resolve the boot failure, reinstall DGX EL9-25.04, but do not set up MIG configurations. Alternatively, for systems other than the DGX B200, install a previous version of the GPU driver.

Enable the DCGM systemd Service After Installation or Reboot#

Issue

In DCGM 4.x, dcgm.service has been demoted from being a standalone systemd unit to being an alias of the nvidia-dcgm.service systemd unit.

After installation of DCGM 4.x or after a reboot with DCGM 4.x already installed, the DCGM service might be inactive. For instance,

$ sudo systemctl status nvidia-dcgm

- nvidia-dcgm.service - NVIDIA DCGM service
  Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: disabled)
  Active: inactive (dead) since Wed 2025-03-05 11:00:29 PST; 3min 35s ago
  Duration: 3h 20min 29.595s
  Main PID: 3878817 (code=exited, status=0/SUCCESS)
  CPU: 5.563s

Workaround

If the DCGM systemd service is inactive, enable the DCGM systemd service and start it now:

sudo systemctl --now enable nvidia-dcgm

As seen in the following systemctl status command output, the DCGM service is now active:

$ sudo systemctl status nvidia-dcgm

- nvidia-dcgm.service - NVIDIA DCGM service
  Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: disabled)
  Active: active (running) since Wed 2025-03-05 11:05:31 PST; 16s ago
  Main PID: 80888 (nv-hostengine)
  Tasks: 9 (limit: 3355442)
  Memory: 16.5M
  CPU: 93ms
  CGroup: /system.slice/nvidia-dcgm.service
          └─80888 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

In DCGM systemd systemctl commands, nvidia-dcgm and dcgm can be used interchangeably. See the Post-Install section in the DCGM User Guide for more information.

Symbolic Links Removed During nvidia-mig-manager Upgrade#

Issue

During the upgrade of the nvidia-mig-manager package to version 0.10.1, the following two symbolic links might be inadvertently removed.

/etc/nvidia-mig-manager/config.yaml
/etc/nvidia-mig-manager/hooks.yaml

Workaround

To resolve this issue, run the following commands.

sudo dnf remove nvidia-mig-manager
sudo dnf install nvidia-mig-manager

Red Hat Installer Fails to Delete Partitions#

Issue

When performing disk partitioning with the Red Hat installation utility, the utility can fail to delete partitions that were previously used by other operating systems.

Workaround

To manually wipe the devices, perform the following steps.

Click Done twice from the Manual Partitioning menu.

You returned to the main menu for the installation utility.
Press Ctrl+Alt+F2 to use a shell in a different virtual console.

If the installation utility does not respond to the keystrokes, add a hotkey that sends Ctrl+Alt+F2.
1. Select Hot Keys > Add Hot Keys.
  
  The User Defined Macros window opens.
2. Click Add.
  
  The Add Macro window opens.
3. Press Ctrl+Alt+F2. Ensure that the key sequence appears in the text field and then click Insert.
4. Click Close on the User Defined Macros window.
5. Select Hot Keys > Ctrl+Alt+F2 to use a different virtual console.
Run the remaining commands in the virtual console.
Stop RAID devices.
1. Run the lsblk command.
2. If the command output includes any md devices, stop the devices:
```
mdadm --stop /dev/md<device-id>
```

Run the wipefs command for all the drives:

wipefs -a /dev/nvme0n1
wipefs -a /dev/nvme1n1
...

Reboot the machine and restart the installation process.

Virtualization Not Supported#

Issue

Virtualization technology, such as ESXi hypervisors or kernel-based virtual machines (KVM), is not an intended use case on DGX systems and has not been tested.

NVSM Service and Fabric Manager Service Reported as Inactive#

Platform

DGX H100 System, A100 System, and A100 Station with EL9-24.06

Issue

After EL9-24.06 upgrade and a system reboot, the status of nvsm.service and nvidia-fabricmanager.service shows inactive (dead) when you run systemctl status nvsm and systemctl status nvidia-fabricmanager, respectively.

$ sudo systemctl status nvsm
...
nvsm.service - NVIDIA System Management service suite
  Loaded: loaded (/usr/lib/systemd/system/nvsm.service; enabled; preset: disabled)
  Active: inactive (dead)
...

$ sudo systemctl status nvidia-fabricmanager
...
nvidia-fabricmanager.service - NVIDIA fabric manager service
  Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
  Active: inactive (dead)
...

Workaround

The nvsm.service service manages the start and stop of services running under NVSM. Because the NVSM services are operating normally and NVSM is fully functioning, you can ignore the inactive status of nvsm.service. To fix the nvsm.service status issue, run the systemctl start nvsm command after a system reboot.

However, the nvidia-fabricmanager.service service remains inactive. To resolve this issue, manually start the service by running the systemctl start nvidia-fabricmanager.service command.

Explanation

After a system reboot on DGX systems running the GPU Driver Release 550 or newer, nvsm.service and nvidia-fabricmanager.service appear inactive because systemd finds a dependency on nvidia-fabricmanager.service during startup. The circular dependency between nvsm.service and nvidia-fabricmanager.service makes one service wait for the other and prevents the services from starting.

Excessive OpenSM Log Growth Causing DGX Systems to Become Inoperable#

Issue

An exceptionally large /var/log/opensm.log file can cause DGX systems to become inoperable.

Explanation

During the installation process of the MLNX_OFED or DOCA OFED software, the opensm package is also installed. By default, OpenSM is disabled. On systems where OpenSM is enabled, the /etc/logrotate.d/opensm file should be configured to include the following options to manage the size of the opensm.log file:

The maximum size of log files for log rotation, such as maxsize 10M or maxsize 100M
The rotation duration, such as daily, weekly, or monthly

Not specifying the two configuration options might result in an exceptionally large /var/log/opensm.log file that can cause DGX systems to become inoperable. For more information about OpenSM network topology, configuration, and enablement, refer to the NVIDIA OpenSM documentation.

Reboot Hang after Configuring RAID#

Platform

DGX H100 System with EL9-23.08 and RHEL 9.1 or 9.2

Issue

After installing the DGX H100 Configurations group and configuring RAID with the sudo /usr/bin/configure_raid_array.py -c -f -5 command and rebooting, the system can hang and display console messages like the following example:

...
[ 1944.542085] md: md124 stopped.
[ 1944.545711] md: md124 stopped.
...

Workaround

Perform a power cycle to reboot the system successfully. The system boots normally on subsequent reboots.

Explanation

Before rebooting, this issue is triggered by a RAID State of active, degraded, recovering that can be displayed by running the sudo mdadm --detail /dev/mdXXX command. Replace XXX with the RAID array that you configured with the configure_raid_array.py command.

Refer to the following sample output:

$ sudo mdadm --detail /dev/md125
/dev/md125:
           Version : 1.2
     Creation Time : Wed Aug 30 11:39:08 2023
        Raid Level : raid5
        Array Size : 26254240768 (24.45 TiB 26.88 TB)
     Used Dev Size : 3750605824 (3.49 TiB 3.84 TB)
      Raid Devices : 8
     Total Devices : 8
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Aug 30 11:55:51 2023
             State : active, degraded, recovering
    Active Devices : 7
   Working Devices : 8
    Failed Devices : 0
     Spare Devices : 1

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 5% complete

              Name : nv-data-array
              UUID : 2dbe34c6:70decf1e:c54206a6:e78b9161
            Events : 204

    Number   Major   Minor   RaidDevice State
       0     259        1        0      active sync   /dev/nvme2n1
       1     259        3        1      active sync   /dev/nvme3n1
       2     259        6        2      active sync   /dev/nvme4n1
       3     259        7        3      active sync   /dev/nvme5n1
       4     259        9        4      active sync   /dev/nvme6n1
       5     259       13        5      active sync   /dev/nvme7n1
       6     259       14        6      active sync   /dev/nvme8n1
       8     259       15        7      spare rebuilding   /dev/nvme9n1

MOFED mlnxofedinstall reports “Current operation system is not supported” using RHEL 9.2#

Platform

EL9-23.01 and RHEL 9.2 with MLNX_OFED_LINUX-5.8-2.0.3.0-rhel9.1-x86_64.iso

Issue

When installing MLNX MOFED driver from the downloaded ISO using mlnxofedinstall --add-kernel-support the system generates a warning: “Current operation system is not supported!”

Workaround

Specify the last supported version of RHEL on the commandline by adding “–distro rhel9.1” `` mlnxofedinstall --distro rhel9.1 --add-kernel-support

Explanation

The current MLNX MOFED installer script can require the most recent supported OS to be specified by name if the OS is upgraded before the installer support is added for that OS version.

Precompiled GPU Driver 525 package is not available for Rocky 9.1#

Platform

Rocky 9.1 with EL9-23.01

Issue

The Pre-compiled GPU Driver might not support the installed Rocky Linux kernel.

Workaround

You can install the GPU driver by using the DKMS subsystem by running the following commands:

sudo dnf module reset -y nvidia-driver
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo dnf module install nvidia-driver:525-dkms

Yellow screen appears during RHEL 9.1 installation#

Issue

When installing the RedHat Enterprise Linux 9.1 ISO on a DGX Station V100, the first installation page shows a yellow screen. This can persist through the installation process and when complete.

Workaround

Install RedHat Enterprise Linux 9.0 on the DGX Station V100, then perform the over the air (OTA) update for the latest RHEL9 version and the DGX EL9-23.01 updates.

DGX A100: VBIOS cannot update due to running service processes#

Issue

VBIOS fails to update on Red Hat Enterprise Linux 9 because service(s)/process(es) are holding onto the resource about to be upgraded.

Workaround

The following services (system processes) must be stopped manually for the firmware update to start:

process nvidia-persistenced
process nv-hostengine
process cache_mgr_event
process cache_mgr_main
process dcgm_ipc

If xorg is holding the resources, try to stop it by running

sudo systemctl stop (display manager)

where the (display manager) can be acquired by

cat /etc/X11/default-display-manager

NVSM Unsupported Drive Error#

Issue

When running nvsm show storage, the NV-DRIVE-01 alert displays an “Unsupported Drive Configuration” message.

Workaround

The following services (system processes) must be stopped manually for the firmware update to start:

Create a config file to disable nvme multipath:

sudo sh -c 'echo "options nvme-core multipath=n" > /etc/modprobe.d/nvidia-nvme.conf'

Recreate the initramfs.

dracut --force /boot/initramfs-$(uname -r).img $(uname -r)

Reboot the system.
```
sudo systemctl reboot
```

The message might be displayed when you log in or when you run the nvsm show alert and nvsm show storage commands and can be safely ignored. This issue will be fixed in a future release.

Tuned profiles do not take effect in graphical mode#

Issue

DGX tuned profiles might not take effect due to a known Red Hat Enterprise Linux 9 issue. This affects systems that use a graphical target mode.

Workaround

This issue can be fixed by running the following commands:

Mask the power-profiles-daemon service then tuned is able to start during boot.
```
systemctl mask power-profiles-daemon
```
Reboot the system
```
reboot
```