Known Issues: DGX A100
See the sections for specific versions to see which issues are open in those versions.
Issue (fixed with EL7-21.10/R470)
On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises ‘md1 is corrupted’ alerts.
Explanation and Workaround
The OS RAID 1 drives are running in a non-standard configuration, resulting in erroneous alert messages. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.
To resolve, enable the R470 repository and update to EL7-21.10 or later, otherwise configure NVSM to support a custom drive partitioning by performing the following.
Stop NVSM services.
$ systemctl stop nvsm
Edit
/etc/nvsm/nvsm.config
and set the “use_standard_config_storage
” parameter tofalse
."use_standard_config_storage":false
Remove the NVSM database.
$ sudo rm /var/lib/nvsm/sqlite/nvsm.db
Restart NVSM.
$ systemctl restart nvsm
Issue (fixed in EL7-21.01)
After installing nvidia-peer-memory-dkms
in order to use InfiniBand on DGX servers, the nv_peer_mem
module is not loaded.
Explanation and Workaround
The nv_peer_mem module needs to be loaded, either
Manually, by issuing
sudo systemctl start nv_peer_mem
, orAutomatically on every system boot by performing the following:
Create a file
/etc/modules-load.d/nv-peer-mem.conf
with contents “nv_peer_mem
”.Issue
dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
Reboot
Issue (fixed in 21.01)
With eight U.2 NVMe drives installed, the nvsm-plugin-pcie service reports ERROR: Device not found in mapping table” (for example, in response to systemctl status nvsm*) for the additional four drives.
Explanation and Workaround
This is an issue with the NVSM plugin PCIe service, which is not detecting the additional four drives. nvsm show health and nvsm dump health function normally and no false alerts are raised in connection with this issue.
Issue (fixed in EL7-20.9)
When nvidia-smi
is run within a non-privileged container, the output shows that persistence mode is off for all the GPUs.
Explanation and Workaround
Within non-privileged containers, persistence mode cannot be viewed or managed. Persistence mode for the GPUs is actually ON as demonstrated when running nvidia-smi
outside of the container.