Known Issues: DGX-1

See the sections for specific versions to see which issues are open in those versions.

Issue (fixed in 20.02)

Attempting to run GPU-accelerated Docker containers may return the following error. .. code:: bash

Failed to initialize NVML: Unknown Error

Explanation and Workaround

This issue occurs if you have installed docker-1.13.1-108 provided by Red Hat Enterprise Linux.

An updated Docker version that resolves the issue is now available. To obtain the update, issue the following: .. code:: bash

sudo yum update

Fixed in EL7-20.02

Issue

After installing the DGX software stack for Red Hat Enterprise Linux, the following query returns messages indicating that various nvsm services have failed to load. .. code:: bash

sudo systemctl –state=failed

Explanation and Workaround

This issue occurs with later versions of the Mosquitto messaging service installed with Red Hat Enterprise Linux 7.7. The latest version compatible with some NVSM services is 1.5.8. To work around, restore the Mosquitto service to version 1.5.8 as follows:

  1. Downgrade the Mosquitto service to 1.5.8.

    Copy
    Copied!
                

    sudo yum downgrade mosquitto-1.5.8-1.el7.x86_64

  2. Restart NVSM services.

    Copy
    Copied!
                

    sudo systemctl restart nvsm*

Issue

This issue is resolved with updated installation instructions.

When updating the driver, the DKMS module may not build for a newly installed kernel, resulting in a driver/library mismatch. This can be confirmed by the following output when issuing nvidia-smi:

Copy
Copied!
            

Failed to initialize NVML: Driver/library version mismatch

Workaround

Initiate a DKMS build manually by issuing the following:

Copy
Copied!
            

$ sudo dkms install nvidia/418.67 -k $(uname -r)

Issue

(Fixed in EL7-19.09) After removing one of the cache SSDs from the DGX-1, checking the status using NVSM CLI, and then hot-plugging the SSD back in, NVSM CLI reports an HTTP code 500 error.

Example, where drive 20:4 is the reinserted SSD (20 is the enclosure ID and 4 is the drive slot):

Copy
Copied!
            

nvsm-> show /systems/localhost/storage/drives/20:4

Copy
Copied!
            

/systems/localhost/storage/drives/20:4

Copy
Copied!
            

ERROR:nvsm:Bad HTTP status code "500" from NVSM backend: Internal Server Error

Explanation and Workaround

After re-inserting the SSD back into the system, NVSM recognizes the drive but fails to get full device information from storCLI.  Additionally, the RAID controller sets the array to offline and marks the re-inserted SSD as Unconfigured_Bad (UBad).  This prevents the RAID 0 array from being recreated.

To correct this condition,

  1. Set the drive back to a good state.

Copy
Copied!
            

# sudo /opt/MegaRAID/storcli/storcli64 /c0/e /s set good force

  1. Run the script to recreate the array.

Copy
Copied!
            

# sudo configure_raid_array.py -c -f

This issue was fixed in EL7-19.07.

Issue

The alert that comes up (for example, from the “nvsm show alerts” command) when removing the RAID 0 data drive is not cleared after replacing the drive,recreating the RAID 0 array, and then rebooting the system.

Workaround

To clear the alerts, run the following command:

Copy
Copied!
            

systemctl restart nvsm-storage-dshm

Issue

(Resolved in Red Hat Enterprise Linux 7.7)Upon rebooting the server, you may see the following message on the boot screen.

Copy
Copied!
            

error: failure reading sector 0x0 from 'hd0'. Press any key to continue ...

Action to Take

Press any key to continue. The server continues the boot process without other problems.

Resolution

Upgrade to Red Hat Enterprise Linux 7.7.

Fixed in EL7-20.02

Issue

The DGX serial number returned by “nvsm show health” is a generic serial number that does not reflect the actual number. The same occurs with the NVSM API.

Resolution

This issue will be resolved in a future update.

© Copyright 2022-2023, NVIDIA. Last updated on Jun 27, 2023.