Known Issues

Refer to the following known issues with NVIDIA DGX OS 6 and NVIDIA DGX Systems.

Reduced Network Communication Speeds on DGX H100 System

Issue

On DGX H100 Systems running DGX OS 6, the MAX_ACC_OUT_READ configuration for the ConnectX-7 controllers can be too low and result in reduced data transfer rates.

This issue is fixed in DGX OS 6.1.

Explanation

For DGX H100 Systems, the value is set to 44 by the nvidia-mlnx-config service.

You can check the value by using the mlxconfig or mstconfig command to query each device and view the value of the MAX_ACC_OUT_READ configuration.

Workaround

Perform the following steps to set the value to 128 and disable the nvidia-mlnx-config service.

  1. Stop the service:

    sudo systemctl stop nvidia-mlnx-config
    
  2. Disable the service so that it does not start after the next boot and revert the configuration:

    sudo systemctl disable nvidia-mlnx-config
    
  3. Set the configuration based on whether your system uses the MLNX_OFED drivers or the Inbox drivers.

    • MLNX_OFED drivers

      for dev in $(ls /sys/class/infiniband/); do \
        sudo mlxconfig -y -d ${dev} set ADVANCED_PCI_SETTINGS=1; \
        sudo mlxconfig -y -d ${dev} set MAX_ACC_OUT_READ=128; \
      done
      
    • Inbox OFED drivers

      for bdf in $(ls /sys/bus/pci/devices); do \
        if [[ -e "/sys/bus/pci/devices/${bdf}/infiniband" ]]; then \
          sudo mstconfig -y -d "${bdf}" set ADVANCED_PCI_SETTINGS=1; \
          sudo mstconfig -y -d "${bdf}" set MAX_ACC_OUT_READ=128; \
        fi \
      done
      

NVSM Raises Alerts for Missing Devices on DGX H100 System

Issue

NVSM reports the following missing devices on NVIDIA DGX H100 Systems:

/systems/localhost/pcie/alerts/alert0
    message_details = Device is missing on b1:00.1.
    ...

/systems/localhost/pcie/alerts/alert1
    message_details = Device is missing on b1:00.0.
    ...

 /systems/localhost/pcie/alerts/alert2
    message_details = Device is missing on 0b:00.1.

Explanation

NVSM version 22.12.02 is configured with the preceding devices as resources in the DGX H100 System configuration file and the devices are not present in the system at the reported PCI IDs.

You can ignore the alerts for these PCI IDs. The alerts are false positives.

Workaround

After upgrading to a newer version of NVSM, the alerts can persist in local storage. Perform the following steps to remove the NVSM alert database from local storage:

  1. Stop the NVSM service:

    sudo systemctl stop nvsm
    
  2. Delete the alert database from local storage:

    sudo rm /var/lib/nvsm/sqlite/nvsm.db
    
  3. Start the NVSM service:

    sudo systemctl start nvsm
    

DGX A800 Station/Server: mig-parted config

Issue

DGX Station A800 is not currently supported in the all-balanced configuration of the default mig-parted config file.

Workaround

To add the A800 device ID to the all-balanced configuration:

  1. Make a copy of the default configuration.

  2. Add device ID 0x20F310DE to the device-filter of the all-balanced config.

  3. Point mig-parted apply at this new file when selecting a config.

Erroneous Insufficient Power Error May Occur for PCIe Slots

Issue

Reported in release 4.99.9.

The DGX A100 server reports “Insufficient power” on PCIe slots when network cables are connected.

Explanation

This may occur with optical cables and indicates that the calculated power of the card + 2 optical cables is higher than what the PCIe slot can provide.

The message can be ignored.

Applications that call the cuCTXCreate API Might Experience a Performance Drop

Issue

Reported in release 5.0.

When some applications call cuCtxCreate, cuGLCtxCreate, or cut Destroy, there might be a drop in performance.

Explanation

This issue occurs with Ubuntu 22.04, but not with previous versions. The issue affects applications that perform graphics/compute interoperations or have a plugin mechanism for CUDA, where every plugin creates its own context, or video streaming applications where computations are needed. Examples include ffmpeg, Blender, simpleDrive Runtime, and cuSolverSp_LinearSolver.

This issue is not expected to impact deep learning training.

Incorrect nvidia-container-toolkit version after upgrade from 5.X to 6.0

Issue

Reported in release 6.0.

Following a release upgrade from DGX OS 5.X to DGX OS 6.X the version of nvidia-container-toolkit is not updated to the 6.X version.

Explanation

During the release upgrade process the older version of nvidia-container-toolkit has a higher priority than the newer version so it doesn’t get updated.

The workaround is to perform an additional sudo apt update and sudo apt upgrade following the nvidia-release-upgrade.

UBSAN error and mstconfig stack dump in kernel logs at boot

Issue

Reported in release 6.0.

On boot an error message similar to UBSAN: shift-out-of-bounds in /build/linux-nvidia-s96GJ3/linux-nvidia-5.15.0/debian/build/build-nvidia may appear, followed by a stack trace.

Explanation

This warning is generate when mstconfig closes a device file. The warning can be safely ignored, and will be fixed in a future release.

The BMC Redfish interface is not active on first boot after installation

Issue

Reported in release 6.0.

When the system is first booted after installation the BMC Redfish network interface will be in a DOWN state and will not appear when running the ifconfig command.

Explanation

The interface is reconfigured after the system has already brought up interfaces, so it doesn’t get automatically started.

Running the command sudo netplan apply will cause the interface to be started. It will also automatically be started on all subsequent boots even without running the additional sudo netplan apply