Known Issues

See the following sections for specific versions to see which issues are open in those versions.

Data corruption seen with NCCL using the LL128 protocol on DGX H100 and DGX H800


DGX Base OS 6.0 and 6.1 enable PCIe relaxed ordering on GPU I/Os on DGX H100 and H800. This is the correct behavior on DGX A100, which uses AMD Rome CPUs, but the Intel Sapphire Rapids CPU used in DGX H100 and H800 should not have this option enabled.


To resolve this issue, follow these steps:

  1. Run this command:

    sudo rm /etc/modprobe.d/nvidia-relaxed-ordering.conf
  2. Reboot the system.

DGX-1, DGX-2: GPU MIG Partitions do not return output fields


When enabling MIG and creating a MIG partition for the GPU, there is no output returned for non-device specific fields: dcgmi dmon -e 1,2,3,4,5


This issue affects EL 8 with:

  • Driver Version: 470.141.03

  • CUDA Version: 11.4.152

  • DCGM: 2.4.5

DGX-1, DGX-2: Log displays CEC error


DGX A100/A800 Firmware Update Container log may show error messages such as "Unable to send RAW command (channel=0x0 netfn=0x3c lun=0x0 cmd=0xf rsp=0xd3): Destination unavailable"

This error will be displayed when running supported commands and may be safely ignored.

DGX-1: NVSM show controllers SerialNumber shows “NOT_SET”


After rebooting, nvsm show controllers may display a blank serial number.


This issue is specific to the DGX-1 platform with the MegaRAID controller and can be remedied by restarting the nvsm service after 30 minutes. To restart the service, run systemctl restart nvsm