Known Issues#

This section provides summaries of the issues in the DGX Software for Red Hat Enterprise Linux.

Enable the DCGM systemd Service After Installation or Reboot#

Issue

In DCGM 4.x, dcgm.service has been demoted from being a standalone systemd unit to being an alias of the nvidia-dcgm.service systemd unit.

After installation of DCGM 4.x or after a reboot with DCGM 4.x already installed, the DCGM service might be inactive. For instance,

$ sudo systemctl status nvidia-dcgm

- nvidia-dcgm.service - NVIDIA DCGM service
  Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: disabled)
  Active: inactive (dead) since Wed 2025-03-05 11:00:29 PST; 3min 35s ago
  Duration: 3h 20min 29.595s
  Main PID: 3878817 (code=exited, status=0/SUCCESS)
  CPU: 5.563s

Workaround

If the DCGM systemd service is inactive, enable the DCGM systemd service and start it now:

sudo systemctl --now enable nvidia-dcgm

As seen in the following systemctl status command output, the DCGM service is now active:

$ sudo systemctl status nvidia-dcgm

- nvidia-dcgm.service - NVIDIA DCGM service
  Loaded: loaded (/usr/lib/systemd/system/nvidia-dcgm.service; enabled; preset: disabled)
  Active: active (running) since Wed 2025-03-05 11:05:31 PST; 16s ago
  Main PID: 80888 (nv-hostengine)
  Tasks: 9 (limit: 3355442)
  Memory: 16.5M
  CPU: 93ms
  CGroup: /system.slice/nvidia-dcgm.service
          └─80888 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

In DCGM systemd systemctl commands, nvidia-dcgm and dcgm can be used interchangeably. See the Post-Install section in the DCGM User Guide for more information.

Red Hat Installer Fails to Delete Partitions#

Issue

When performing disk partitioning with the Red Hat installation utility, the utility can fail to delete partitions that were previously used by other operating systems.

Workaround

To manually wipe the devices, perform the following steps.

  1. Click Done twice from the Manual Partitioning menu.

    You returned to the main menu for the installation utility.

  2. Press Ctrl+Alt+F2 to use a shell in a different virtual console.

    If the installation utility does not respond to the keystrokes, add a hotkey that sends Ctrl+Alt+F2.

    1. Select Hot Keys > Add Hot Keys.

      The User Defined Macros window opens.

    2. Click Add.

      The Add Macro window opens.

    3. Press Ctrl+Alt+F2. Ensure that the key sequence appears in the text field and then click Insert.

    4. Click Close on the User Defined Macros window.

    5. Select Hot Keys > Ctrl+Alt+F2 to use a different virtual console.

    Run the remaining commands in the virtual console.

  3. Stop RAID devices.

    1. Run the lsblk command.

    2. If the command output includes any md devices, stop the devices:

      mdadm --stop /dev/md<device-id>
      
  4. Run the wipefs command for all the drives:

    wipefs -a /dev/nvme0n1
    wipefs -a /dev/nvme1n1
    ...
    
  5. Reboot the machine and restart the installation process.

Virtualization Not Supported#

Issue

Virtualization technology, such as ESXi hypervisors or kernel-based virtual machines (KVM), is not an intended use case on DGX systems and has not been tested.

Excessive OpenSM Log Growth Causing DGX Systems to Become Inoperable#

Issue

An exceptionally large /var/log/opensm.log file can cause DGX systems to become inoperable.

Explanation

During the installation process of the DOCA-OFED software, the opensm package is also installed. By default, OpenSM is disabled. On systems where OpenSM is enabled, there are various methods to manage the size of /var/log/opensm.log to prevent it from becoming too large, including:

  • Modify /etc/logrotate.d/opensm to specify the rotation duration, such as daily, weekly, or monthly.

  • Specify the size at which /var/log/opensm.log will be rotated.

Not specifying rotation of /var/log/opensm.log appropiate for your system might result in an exceptionally large /var/log/opensm.log that could possilby cause the DGX system to become inoperable. For more information about configuring OpenSM log rotation, refer to NVIDIA Networking Software Infiniband Cluster Bring-Up Procedure, SM Logs and other NVIDIA OpenSM documentation.

For more information about OpenSM network topology, configuration, and enablement, refer to NVIDIA OpenSM documentation.

CIFS Returns an Error after DOCA is Installed#

Issue

After installing DOCA on a system where CIFS had previously been installed, any attempt to use CIFS fails; the following error is reported when trying to mount a CIFS filesystem:

$ sudo mount -t cifs -o <options> //SERVER_IP_OR_HOSTNAME/SHARE_NAME /MOUNT_POINT
mount error: cifs filesystem not supported by the system
mount error(19): No such device

In DOCA Framework Known Issues, Issue #2657392, it says:

OFED installation caused CIFS to break in RHEL 8.4 and above. A dummy
module was added so that CIFS will be disabled after OFED installation in RHEL 8.4
and above.

Workaround

There is no workaround. CIFS and DOCA cannot be installed at the same time.

NVSM and sosreport/doca-sosreport May Be Removed when DOCA is Uninstalled#

The Installing NVIDIA DOCA-OFED section instructs users “Before installing a different version of DOCA-OFED software, you must remove the installed DOCA-OFED or MLNX_OFED software on your system, if it was previously installed.”

Uninstalling DOCA-OFED may cause the NVSM package and the doca-sosreport or the sosreport package, if any of these were installed, to be uninstalled.

Workaround

After DOCA-OFED is re-installed (see Installing DOCA-OFED), reinstall NVSM and either the doca-sosreport or sosreport, as desired.