Known Issues: DGX-2

See the sections for specific versions to see which issues are open in those versions.

Issue (fixed in 20.02)

Attempting to run GPU-accelerated Docker containers may return the following error.

Copy
Copied!
            

Failed to initialize NVML: Unknown Error

Explanation and Workaround

This issue occurs if you have installed docker-1.13.1-108 provided by Red Hat Enterprise Linux.

An updated Docker version that resolves the issue is now available. To obtain the update, issue the following:

Copy
Copied!
            

sudo yum update

Fixed in EL7-20.02

Issue

Switching to the systems/localhost folder results in an error with message “Error connecting to NVSM backend”.

Explanation and Workaround

This is due to the NVSM mosquitto service accessing the IPv6 interface. To work around, inspect the /etc/hosts config file for the following lines:

Copy
Copied!
            

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

Comment out the localhost6 line.

Copy
Copied!
            

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

Fixed in EL7-20.02

Issue

After installing the DGX software stack for Red Hat Enterprise Linux, the following query returns messages indicating that various nvsm services have failed to load.

Copy
Copied!
            

sudo systemctl --state=failed

Explanation and Workaround

This issue occurs with later versions of the Mosquitto messaging service installed with Red Hat Enterprise Linux 7.7. The latest version compatible with some NVSM services is 1.5.8. To work around, restore the Mosquitto service to version 1.5.8 as follows:

  1. Downgrade the Mosquitto service to 1.5.8.

    Copy
    Copied!
                

    sudo yum downgrade mosquitto-1.5.8-1.el7.x86_64

  2. Restart NVSM services.

    Copy
    Copied!
                

    sudo systemctl restart nvsm*

This issue is fixed in EL7-19.10.

Issue

After updating the BMC to version 1.05.07, output from nvsm show health reports PSUs and Fans as “unhealthy” and that they cannot be detected, even though they are fine as indicated when using ipmitool.

Explanation

The “unhealthy” status is erroneous and does not impact functionality.

This issue was fixed in EL7-19.07.

Issue

If the EFI directory of one of the RAID 1 OS drives is inadvertently modified, the system will boot off the good drive but NVSM does not show an alert. The nvsm show command reports the drive as healthy

Explanation

The EFI directory is used to hold the UEFI boot file. The ESP monitor will not be aware of changes to the directory name and will not generate an alert. This will be resolved in a future release of the NVSM software.

Issue

After installing Red Hat Enterprise Linux 7.6 and rebooting, the Ubuntu boot option still appears in the boot menu.

Explanation and Workaround

After installing Red Hat Enterprise Linux, the OS leaves entries from the previous DGX OS in the EFI boot table. These entries have no affect on the system other than potentially causing confusion. You can manually remove the entries as follows.

  1. Obtain a list of all the entries in the boot table.

Copy
Copied!
            

efibootmgr list

  1. To remove an entry, run the following.

Copy
Copied!
            

sudo efibootmgr -b <xxxx> -B Where ``<xxxx>`` is the boot entry number. **Example**: To remove the following boot entry ``Boot000A* ubuntu HD(1,GPT,ae7ba5cb-d73f-43af-ae8c-96d8579d7299,0x800,0x100000)/File(\EFI\UBUNTU\GRUBX64.EFI)..BO``

run .. code:: bash

sudo efiboomgr -b 000A -B

This issue was fixed in EL7-19.07.

Issue

While rebuilding the RAID 1 array, “unsupported drive” alerts appear for the volume being rebuilt.

Workaround

This is an erroneous alert and can be ignored. To prevent the alert from being raised, mute monitoring for all storage components, including drives and volumes, before rebuilding the RAID array as follows.

Copy
Copied!
            

nvsm set /systems/localhost/storage/policy drive_mute_monitoring=Slot0,Slot1,Slot2,Slot3,Slot4,Slot5,Slot6,Slot7,Slot8,Slot9,Slot10,Slot11,Slot12,Slot13,Slot14,Slot15 nvsm set /systems/localhost/storage/policy volume_mute_monitoring=md0,md1

Issue

On CentOS, when attempting to replicate the EFI partition and rebuild RAID 1, the rebuild process hangs.

Explanation

This is an issue with syncing EFI on CentOS, and is resolved in EL7-19.07.

© Copyright 2022-2023, NVIDIA. Last updated on Jun 27, 2023.