Known Issues: DGX-2

See the sections for specific versions to see which issues are open in those versions.

Docker GPU Containers Cannot be Run

Issue

Attempting to run GPU-accelerated Docker containers may return the following error.
Failed to initialize NVML: Unknown Error

Explanation and Workaround

This issue occurs if you have installed docker-1.13.1-108 provided by Red Hat Enterprise Linux.

An updated Docker version that resolves the issue is now available. To obtain the update, issue the following:
sudo yum update

DGX-2: NVSM Error Occurs When Accessing Systems/Localhost

Issue

Switching to the systems/localhost folder results in an error with message "Error connecting to NVSM backend".

Explanation and Workaround

This is due to the NVSM mosquitto service accessing the IPv6 interface. To work around, inspect the /etc/hosts config file for the following lines:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 
::1      localhost localhost.localdomain localhost6 localhost6.localdomain6 
Comment out the localhost6 line.
#::1      localhost localhost.localdomain localhost6 localhost6.localdomain6 

DGX: NVSM Services May Fail to Load

Issue

After installing the DGX software stack for Red Hat Enterprise Linux, the following query returns messages indicating that various nvsm services have failed to load.
sudo systemctl --state=failed

Explanation and Workaround

This issue occurs with later versions of the Mosquitto messaging service installed with Red Hat Enterprise Linux 7.7. The latest version compatible with some NVSM services is 1.5.8. To work around, restore the Mosquitto service to version 1.5.8 as follows:
  1. Downgrade the Mosquitto service to 1.5.8.
    sudo yum downgrade mosquitto-1.5.8-1.el7.x86_64
  2. Restart NVSM services.
    sudo systemctl restart nvsm*

DGX-2: NVSM Erroneously Reports PSUs and Fans as Unhealthy

This issue is fixed in EL7-19.10.

Issue

After updating the BMC to version 1.05.07, output from nvsm show health reports PSUs and Fans as "unhealthy" and that they cannot be detected, even though they are fine as indicated when using ipmitool.

Explanation

The "unhealthy" status is erroneous and does not impact functionality.

DGX-2: NVSM Does not Show Alerts for Modifed EFI Directory on Boot Drive

This issue was fixed in EL7-19.07.

Issue

If the EFI directory of one of the RAID 1 OS drives is inadvertently modified, the system will boot off the good drive but NVSM does not show an alert. The nvsm show command reports the drive as healthy

Explanation

The EFI directory is used to hold the UEFI boot file. The ESP monitor will not be aware of changes to the directory name and will not generate an alert. This will be resolved in a future release of the NVSM software.

DGX-2, DGX Station: Ubuntu Boot Option Appears After Installing Red Hat Enterprise Linux

Issue

After installing Red Hat Enterprise Linux 7.6 and rebooting, the Ubuntu boot option still appears in the boot menu.

Explanation and Workaround

After installing Red Hat Enterprise Linux, the OS leaves entries from the previous DGX OS in the EFI boot table. These entries have no affect on the system other than potentially causing confusion. You can manually remove the entries as follows.

  1. Obtain a list of all the entires in the boot table.
    efibootmgr list
  2. To remove an entry, run the following.
    sudo efibootmgr -b <xxxx> -B

    Where <xxxx> is the boot entry number.

    Example: To remove the following boot entry

    Boot000A* ubuntu    HD(1,GPT,ae7ba5cb-d73f-43af-ae8c-96d8579d7299,0x800,0x100000)/File(\EFI\UBUNTU\GRUBX64.EFI)..BO

    run
    sudo efiboomgr -b 000A -B
    .

DGX-2: NVSM reports "System has unsupported drive" during RAID 1 rebuild

This issue was fixed in EL7-19.07.

Issue

While rebuilding the RAID 1 array, "unsupported drive" alerts appear for the volume being rebuilt.

Workaround

This is an erroneous alert and can be ignored. To prevent the alert from being raised, mute monitoring for all storage components, including drives and volumes, before rebuilding the RAID array as follows.

 # nvsm set /systems/localhost/storage/policy drive_mute_monitoring=Slot0,Slot1,Slot2,Slot3,Slot4,Slot5,Slot6,Slot7,Slot8,Slot9,Slot10,Slot11,Slot12,Slot13,Slot14,Slot15
 # nvsm set /systems/localhost/storage/policy volume_mute_monitoring=md0,md1 

DGX-2: NVSM EFI Sync Hangs on CentOS

Issue

On CentOS, when attempting to replicate the EFI partition and rebuild RAID 1, the rebuild process hangs.

Explanation

This is an issue with sync'ing EFI on CentOS, and is resolved in EL7-19.07.