Known Issues#

See the following sections for specific versions to see which issues are open in those versions.

Virtualization Not Supported#

Issue

Virtualization technology, such as ESXi hypervisors or kernel-based virtual machines (KVM), is not an intended use case on DGX systems and has not been tested.

Data corruption seen with NCCL using the LL128 protocol on DGX H100 and DGX H800#

Issue

DGX Base OS 6.0 and 6.1 enable PCIe relaxed ordering on GPU I/Os on DGX H100 and H800. This is the correct behavior on DGX A100, which uses AMD Rome CPUs, but the Intel Sapphire Rapids CPU used in DGX H100 and H800 should not have this option enabled.

Workaround

To resolve this issue, follow these steps:

  1. Run this command:

    sudo rm /etc/modprobe.d/nvidia-relaxed-ordering.conf
    
  2. Reboot the system.

CIFS Returns an Error after DOCA is Installed#

Issue

After installing DOCA on a system where CIFS had previously been installed, any attempt to use CIFS will fail; the following error is reported when trying to mount a CIFS filesystem:

$ sudo mount -t cifs -o <options> //SERVER_IP_OR_HOSTNAME/SHARE_NAME /MOUNT_POINT
mount error: cifs filesystem not supported by the system
mount error(19): No such device

Workaround

There is no workaround. CIFS and DOCA cannot be installed at the same time.

In DOCA Framework Known Issues, Issue #2657392, it says:

OFED installation caused CIFS to break in RHEL 8.4 and above. A dummy
module was added so that CIFS will be disabled after OFED installation in RHEL 8.4
and above.

DGX-1, DGX-2: GPU MIG Partitions do not return output fields#

Issue

When enabling MIG and creating a MIG partition for the GPU, there is no output returned for non-device specific fields: dcgmi dmon -e 1,2,3,4,5

Explanation

This issue affects EL 8 with:

  • Driver Version: 470.141.03

  • CUDA Version: 11.4.152

  • DCGM: 2.4.5

DGX-1, DGX-2: Log displays CEC error#

Issue

DGX A100/A800 Firmware Update Container log may show error messages such as "Unable to send RAW command (channel=0x0 netfn=0x3c lun=0x0 cmd=0xf rsp=0xd3): Destination unavailable"

This error will be displayed when running supported commands and may be safely ignored.

DGX-1: NVSM show controllers SerialNumber shows “NOT_SET”#

Issue

After rebooting, nvsm show controllers may display a blank serial number.

Explanation

This issue is specific to the DGX-1 platform with the MegaRAID controller and can be remedied by restarting the nvsm service after 30 minutes. To restart the service, run systemctl restart nvsm