Known Issues

See the following sections for specific versions to see which issues are open in those versions.

Data corruption seen with NCCL using the LL128 protocol on DGX H100 and DGX H800

Issue

DGX Base OS 6.0 and 6.1 enable PCIe relaxed ordering on GPU I/Os on DGX H100 and H800. This is the correct behavior on DGX A100, which uses AMD Rome CPUs, but the Intel Sapphire Rapids CPU used in DGX H100 and H800 should not have this option enabled.

Workaround

To resolve this issue, follow these steps:

  1. Run this command:

    sudo rm /etc/modprobe.d/nvidia-relaxed-ordering.conf
    
  2. Reboot the system.

DGX-1, DGX-2: GPU MIG Partitions do not return output fields

Issue

When enabling MIG and creating a MIG partition for the GPU, there is no output returned for non-device specific fields: dcgmi dmon -e 1,2,3,4,5

Explanation

This issue affects EL 8 with:

  • Driver Version: 470.141.03

  • CUDA Version: 11.4.152

  • DCGM: 2.4.5

DGX-1, DGX-2: Log displays CEC error

Issue

DGX A100/A800 Firmware Update Container log may show error messages such as "Unable to send RAW command (channel=0x0 netfn=0x3c lun=0x0 cmd=0xf rsp=0xd3): Destination unavailable"

This error will be displayed when running supported commands and may be safely ignored.

DGX-1: NVSM show controllers SerialNumber shows “NOT_SET”

Issue

After rebooting, nvsm show controllers may display a blank serial number.

Explanation

This issue is specific to the DGX-1 platform with the MegaRAID controller and can be remedied by restarting the nvsm service after 30 minutes. To restart the service, run systemctl restart nvsm