Known Issues#

DGX System Device ID Not Found in /usr/share/misc/pci.ids#

Issue#

When you run the following command to apply the default mig-parted configuration, the nvidia-mig-parted tool issues warnings about failing to find the device ID for the DGX system:

$ sudo nvidia-mig-parted apply -f /etc/nvidia-mig-manager/config-default.yaml -c all-balanced -k /etc/nvidia-mig-manager/hooks-default.yaml

2024/09/05 01:00:00 WARNING: unable to get device name: [failed to find device with id '22a3']
2024/09/05 01:00:00 WARNING: unable to get device name: [failed to find device with id '22a3']
2024/09/05 01:00:00 WARNING: unable to get device name: [failed to find device with id '22a3']

Workaround#

Update the system with the current version of the PCI ID list by running the update-pciids command:

sudo update-pciids

Virtualization Not Supported#

Issue#

Virtualization technology, such as ESXi hypervisors or kernel-based virtual machines (KVM), is not an intended use case on DGX systems and has not been tested.

Excessive Growth of OpenSM Log Causing DGX Systems to Become Inoperable#

Issue#

An exceptionally large /var/log/opensm.log file can cause DGX systems to become inoperable.

Explanation#

During the installation process of the MLNX_OFED or DOCA OFED software, the opensm package is also installed. By default, OpenSM is disabled. On systems where OpenSM is enabled, the /etc/logrotate.d/opensm file should be configured to include the following options to manage the size of the opensm.log file:

  • The maximum size of log files for log rotation, such as maxsize 10M or maxsize 100M

  • The rotation duration, such as daily, weekly, or monthly

Not specifying the two configuration options might result in an exceptionally large /var/log/opensm.log file that can cause DGX systems to become inoperable. For more information about OpenSM network topology, configuration, and enablement, refer to the NVIDIA OpenSM documentation.

Incorrect DCGM Version After Upgrade from 5.X to 6.2.0#

Issue#

During the upgrade process from DGX OS 5.X to DGX OS 6.2.0, the version of the NVIDIA® Data Center GPU Manager (DCGM) is not updated to the latest version.

Workaround#

When you upgrade from Base OS 5.x to Base OS 6.2.0, DCGM might not be upgraded automatically. To ensure that the latest DCGM version is installed, run the following commands after the upgrade and reboot are complete:

sudo apt update
sudo apt upgrade

Errors Occur When Loading Mirrored Repositories on Air-Gapped Systems#

Issue#

When you run the apt update command to load mirrored repositories on an air-gapped system, the following error messages appear:

File not found - /media/repository/mirror/security.ubuntu.com/ubuntu/dists/jammy-security/main/cnf/Commands-amd64 (2: No such file or directory)
Failed to fetch file:/media/repository/mirror/security.ubuntu.com/ubuntu/dists/jammy-security/main/cnf/Commands-amd64  File not found - /media/repository/mirror/security.ubuntu.com/ubuntu/dists/jammy-security/main/cnf/Commands-amd64 (2: No such file or directory)

Explanation#

This issue occurs because a fix for the apt-mirror package, which is available in Ubuntu 23.10, has yet to be implemented in the Ubuntu 22.04 repositories. If you are using an apt-mirror package

  • Version later than 0.5.4-1: Contact NVIDIA Enterprise Services by filing a support case.

  • Version 0.5.4-1: Use the following workaround to mirror the repositories.

You can run the following command to determine the version of your apt-mirror package:

$ dpkg -l | grep apt-mirror

ii  apt-mirror                  0.5.4-1               all             APT sources mirroring tool

Workaround#

To resolve the issue, follow these instructions using an Ubuntu 23.10 Docker image:

  1. On an Ubuntu 20.04 or later system with network access, format a removable USB flash drive and mount that drive at /media. For example,

    sudo mkfs.ext4 device
    sudo mount -t ext4 device /media
    
  2. Create an empty directory and make it accessible by a user who can access a Docker container, such as joe.

    mkdir /media/repository
    chown joe /media/repository
    chmod 755 /media/repository
    
  3. As the user specified in step 2, create the following two files:

    ./mirror.list
    
    set base_path /media/repository
    set run_postmirror 0
    set nthreads 20
    set _tilde 0
    deb http://security.ubuntu.com/ubuntu jammy-security main multiverse universe restricted
    deb http://archive.ubuntu.com/ubuntu/ jammy main multiverse universe restricted
    deb http://archive.ubuntu.com/ubuntu/ jammy-updates main multiverse universe restricted
    deb [ arch=amd64 ] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /
    deb [ arch=amd64 ] https://repo.download.nvidia.com/baseos/ubuntu/jammy/x86_64/ jammy common dgx
    deb [ arch=amd64 ] https://repo.download.nvidia.com/baseos/ubuntu/jammy/x86_64/ jammy-updates common dgx
    deb https://developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /
    deb [ arch=amd64 ] https://repo.download.nvidia.com/baseos/ubuntu/jammy/x86_64/ jammy common dgx
    deb [ arch=amd64 ] https://repo.download.nvidia.com/baseos/ubuntu/jammy/x86_64/ jammy-updates common dgx
    
    ./Dockerfile
    
    FROM ubuntu:23.10
    ENV DEBIAN_FRONTEND=noninteractive
    RUN apt update
    RUN apt install -y apt-mirror
    COPY ./mirror.list /etc/apt/mirror.list
    RUN chmod 644 /etc/apt/mirror.list
    
    CMD ["apt-mirror"]
    
  4. As the user specified in step 2, run the following commands to build the mirrors on /media/repository.

    docker build -t dgxos6mirror .
    docker run --rm -it -v /media/repository/:/media/repository dgxos6mirror
    

    Note

    This step takes a long time to complete due to nearly 1 TB of data to download.

  5. Dismount the media directory from the networked system:

    sudo umount /media
    
  6. Move and mount the media directory to the target DGX system:

    sudo mount -t <device> /media
    
  7. As root, edit the sources.list, cuda-compute-repo.list, dgx.list, and nvhpc.list files to point to the correct local mirrors as follows:

    /etc/apt/sources.list
    deb file:///media/repository/mirror/archive.ubuntu.com/ubuntu/ jammy main restricted universe multiverse
    deb file:///media/repository/mirror/archive.ubuntu.com/ubuntu/ jammy-updates main restricted universe multiverse
    deb file:///media/repository/mirror/security.ubuntu.com/ubuntu/ jammy-security main restricted universe multiverse
    
    /etc/apt/sources.list.d/cuda-compute-repo.list
    deb [arch=amd64 signed-by=/usr/share/keyrings/cuda_debian_prod.gpg] file:///raid/media/repository/mirror/developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /
    
    /etc/apt/sources.list.d/dgx.list
    deb [arch=amd64 signed-by=/usr/share/keyrings/dgx_debian_prod.gpg] file:///raid/media/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/jammy/x86_64/ jammy common dgx
    deb [arch=amd64 signed-by=/usr/share/keyrings/dgx_debian_prod.gpg] file:///raid/media/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/jammy/x86_64/ jammy-updates common dgx
    
    /etc/apt/sources.list.d/nvhpc.list
    deb [arch=amd64 signed-by=/usr/share/keyrings/nvidia-hpcsdk-archive-keyring.gpg] file:///raid/media/repository/mirror/developer.download.nvidia.com/hpc-sdk/ubuntu/amd64 /
    
  8. Review other files in the sources.list.d directory to verify that you do not have duplicate entries for the same repositories.

  9. Test that your target system can load these repositories.

    sudo apt update
    

    If you see error messages, contact NVIDIA Enterprise Services.

Reduced Network Communication Speeds on DGX H100 System#

Issue#

On DGX H100 Systems running DGX OS 6.0, the MAX_ACC_OUT_READ configuration for the ConnectX-7 controllers can be too low and result in reduced data transfer rates.

This issue is fixed for DGX H100 systems that ship from NVIDIA with DGX OS 6.1 preinstalled.

For DGX H100 systems that were initially preinstalled with DGX OS 6.0, the workaround in the following section is required even if the system is upgraded or re-imaged to DGX OS 6.1.

Explanation#

For DGX H100 Systems, the value is set to 44 by the nvidia-mlnx-config service.

You can check the value by using the mlxconfig or mstconfig command to query each device and view the value of the MAX_ACC_OUT_READ configuration.

Workaround#

Perform the following steps to set the value to 0. For DGX OS 6.0, disable the nvidia-mlnx-config service.

On systems that were upgraded or re-imaged with DGX OS 6.1, the following systemctl commands might return an error message that begins with Failed to stop.... This message can occur if the nvidia-mlnx-config service is not running or not installed. Continue to perform the remaining steps.

  1. For DGX OS 6 systems, stop and disable the nvidia-mlnx-config service.

    1. Stop the service:

      sudo systemctl stop nvidia-mlnx-config
      
    2. Disable the service so that it does not start after the next boot and revert the configuration:

      sudo systemctl disable nvidia-mlnx-config
      
  2. For all systems, set the configuration based on whether your system uses the MLNX_OFED drivers or the Inbox drivers.

    • MLNX_OFED drivers

      for dev in $(ls /sys/class/infiniband/); do \
        sudo mlxconfig -y -d ${dev} set ADVANCED_PCI_SETTINGS=1; \
        sudo mlxconfig -y -d ${dev} set MAX_ACC_OUT_READ=0; \
      done
      
    • Inbox OFED drivers

      for bdf in $(ls /sys/bus/pci/devices); do \
        if [[ -e "/sys/bus/pci/devices/${bdf}/infiniband" ]]; then \
          sudo mstconfig -y -d "${bdf}" set ADVANCED_PCI_SETTINGS=1; \
          sudo mstconfig -y -d "${bdf}" set MAX_ACC_OUT_READ=0; \
        fi \
      done
      
  3. Perform a chassis power cycle so the changes take effect.

NVSM Raises Alerts for Missing Devices on DGX H100 System#

Issue#

NVSM reports the following missing devices on NVIDIA DGX H100 Systems:

/systems/localhost/pcie/alerts/alert0
    message_details = Device is missing on b1:00.1.
    ...

/systems/localhost/pcie/alerts/alert1
    message_details = Device is missing on b1:00.0.
    ...

 /systems/localhost/pcie/alerts/alert2
    message_details = Device is missing on 0b:00.1.

Explanation#

NVSM version 22.12.02 is configured with the preceding devices as resources in the DGX H100 System configuration file and the devices are not present in the system at the reported PCI IDs.

You can ignore the alerts for these PCI IDs. The alerts are false positives.

Workaround#

After upgrading to a newer version of NVSM, the alerts can persist in local storage. Perform the following steps to remove the NVSM alert database from local storage:

  1. Stop the NVSM service:

    sudo systemctl stop nvsm
    
  2. Delete the alert database from local storage:

    sudo rm /var/lib/nvsm/sqlite/nvsm.db
    
  3. Start the NVSM service:

    sudo systemctl start nvsm
    

DGX A800 Station/Server: mig-parted config#

Issue#

DGX Station A800 is not currently supported in the all-balanced configuration of the default mig-parted config file.

Workaround#

To add the A800 device ID to the all-balanced configuration:

  1. Make a copy of the default configuration.

  2. Add device ID 0x20F310DE to the device-filter of the all-balanced config.

  3. Point mig-parted apply at this new file when selecting a config.

Erroneous Insufficient Power Error May Occur for PCIe Slots#

Issue#

Reported in release 4.99.9.

The DGX A100 server reports “Insufficient power” on PCIe slots when network cables are connected.

Explanation#

This may occur with optical cables and indicates that the calculated power of the card + 2 optical cables is higher than what the PCIe slot can provide.

The message can be ignored.

Applications that call the cuCTXCreate API Might Experience a Performance Drop#

Issue#

Reported in release 5.0.

When some applications call cuCtxCreate, cuGLCtxCreate, or cut Destroy, there might be a drop in performance.

Explanation#

This issue occurs with Ubuntu 22.04, but not with previous versions. The issue affects applications that perform graphics/compute interoperations or have a plugin mechanism for CUDA, where every plugin creates its own context, or video streaming applications where computations are needed. Examples include ffmpeg, Blender, simpleDrive Runtime, and cuSolverSp_LinearSolver.

This issue is not expected to impact deep learning training.

Incorrect nvidia-container-toolkit version after upgrade from 5.X to 6.0#

Issue#

Reported in release 6.0.

Following a release upgrade from DGX OS 5.X to DGX OS 6.X the version of nvidia-container-toolkit is not updated to the 6.X version.

Explanation#

During the release upgrade process the older version of nvidia-container-toolkit has a higher priority than the newer version so it doesn’t get updated.

The workaround is to perform an additional sudo apt update and sudo apt upgrade following the nvidia-release-upgrade.

UBSAN error and mstconfig stack dump in kernel logs at boot#

Issue#

Reported in release 6.0.

On boot an error message similar to UBSAN: shift-out-of-bounds in /build/linux-nvidia-s96GJ3/linux-nvidia-5.15.0/debian/build/build-nvidia may appear, followed by a stack trace.

Explanation#

This warning is generate when mstconfig closes a device file. The warning can be safely ignored, and will be fixed in a future release.

The BMC Redfish interface is not active on first boot after installation#

Issue#

Reported in release 6.0.

When the system is first booted after installation the BMC Redfish network interface will be in a DOWN state and will not appear when running the ifconfig command.

Explanation#

The interface is reconfigured after the system has already brought up interfaces, so it doesn’t get automatically started.

Running the command sudo netplan apply will cause the interface to be started. It will also automatically be started on all subsequent boots even without running the additional sudo netplan apply