Known Issues: DGX-2
See the sections for specific versions to see which issues are open in those versions.
Issue (fixed in 20.02)
Attempting to run GPU-accelerated Docker containers may return the following error.
Failed to initialize NVML: Unknown Error
Explanation and Workaround
This issue occurs if you have installed docker-1.13.1-108 provided by Red Hat Enterprise Linux.
An updated Docker version that resolves the issue is now available. To obtain the update, issue the following:
sudo yum update
Fixed in EL7-20.02
Issue
Switching to the systems/localhost folder results in an error with message “Error connecting to NVSM backend”.
Explanation and Workaround
This is due to the NVSM mosquitto service accessing the IPv6 interface. To work around, inspect the /etc/hosts
config file for the following lines:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
Comment out the localhost6 line.
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
Fixed in EL7-20.02
Issue
After installing the DGX software stack for Red Hat Enterprise Linux, the following query returns messages indicating that various nvsm services have failed to load.
sudo systemctl --state=failed
Explanation and Workaround
This issue occurs with later versions of the Mosquitto messaging service installed with Red Hat Enterprise Linux 7.7. The latest version compatible with some NVSM services is 1.5.8. To work around, restore the Mosquitto service to version 1.5.8 as follows:
Downgrade the Mosquitto service to 1.5.8.
sudo yum downgrade mosquitto-1.5.8-1.el7.x86_64
Restart NVSM services.
sudo systemctl restart nvsm*
This issue is fixed in EL7-19.10.
Issue
After updating the BMC to version 1.05.07, output from nvsm show health
reports PSUs and Fans as “unhealthy” and that they cannot be detected, even though they are fine as indicated when using ipmitool.
Explanation
The “unhealthy” status is erroneous and does not impact functionality.
This issue was fixed in EL7-19.07.
Issue
If the EFI directory of one of the RAID 1 OS drives is inadvertently modified, the system will boot off the good drive but NVSM does not show an alert. The nvsm show command reports the drive as healthy
Explanation
The EFI directory is used to hold the UEFI boot file. The ESP monitor will not be aware of changes to the directory name and will not generate an alert. This will be resolved in a future release of the NVSM software.
Issue
After installing Red Hat Enterprise Linux 7.6 and rebooting, the Ubuntu boot option still appears in the boot menu.
Explanation and Workaround
After installing Red Hat Enterprise Linux, the OS leaves entries from the previous DGX OS in the EFI boot table. These entries have no affect on the system other than potentially causing confusion. You can manually remove the entries as follows.
Obtain a list of all the entries in the boot table.
efibootmgr list
To remove an entry, run the following.
sudo efibootmgr -b <xxxx> -B
Where ``<xxxx>`` is the boot entry number.
**Example**: To remove the following boot entry
``Boot000A* ubuntu HD(1,GPT,ae7ba5cb-d73f-43af-ae8c-96d8579d7299,0x800,0x100000)/File(\EFI\UBUNTU\GRUBX64.EFI)..BO``
run .. code:: bash
sudo efiboomgr -b 000A -B
This issue was fixed in EL7-19.07.
Issue
While rebuilding the RAID 1 array, “unsupported drive” alerts appear for the volume being rebuilt.
Workaround
This is an erroneous alert and can be ignored. To prevent the alert from being raised, mute monitoring for all storage components, including drives and volumes, before rebuilding the RAID array as follows.
nvsm set /systems/localhost/storage/policy drive_mute_monitoring=Slot0,Slot1,Slot2,Slot3,Slot4,Slot5,Slot6,Slot7,Slot8,Slot9,Slot10,Slot11,Slot12,Slot13,Slot14,Slot15
nvsm set /systems/localhost/storage/policy volume_mute_monitoring=md0,md1
Issue
On CentOS, when attempting to replicate the EFI partition and rebuild RAID 1, the rebuild process hangs.
Explanation
This is an issue with syncing EFI on CentOS, and is resolved in EL7-19.07.