Known Issues: DGX A100

See the sections for specific versions to see which issues are open in those versions.

NVSM May Raise 'md1 is corrupted' Alert

Issue (fixed with EL7-21.10/R470)

On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises 'md1 is corrupted' alerts.

Explanation and Workaround

The OS RAID 1 drives are running in a non-standard configuration, resulting in erroneous alert messages. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.

To resolve, enable the R470 repository and update to EL7-21.10 or later, otherwise configure NVSM to support a custom drive partitioning by performing the following.

  1. Stop NVSM services.
    $ systemctl stop nvsm
  2. Edit /etc/nvsm/nvsm.config and set the "use_standard_config_storage" parameter to false.
    "use_standard_config_storage":false
  3. Remove the NVSM database.
    $ sudo rm /var/lib/nvsm/sqlite/nvsm.db
  4. Restart NVSM.
    $ systemctl restart nvsm

nv_peer_mem Doesn't Start Automatically

Issue (fixed in EL7-21.01)

After installing nvidia-peer-memory-dkms in order to use InfiniBand on DGX servers, the nv_peer_mem module is not loaded.

Explanation and Workaround

The nv_peer_mem module needs to be loaded, either

  • Manually, by issuing sudo systemctl start nv_peer_mem, or
  • Automatically on every system boot by performing the following:
    1. Create a file /etc/modules-load.d/nv-peer-mem.conf with contents "nv_peer_mem".
    2. Issue dracut --force /boot/initramfs-$(uname -r).img $(uname -r)
    3. Reboot

With Eight NVMe drives installed, nvsm-plugin-pcie generates "ERROR Device not found in mapping table" error

Issue (fixed in 21.01)

With eight U.2 NVMe drives installed, the nvsm-plugin-pcie service reports ERROR: Device not found in mapping table" (for example, in response to systemctl status nvsm*) for the additional four drives.

Explanation and Workaround

This is an issue with the NVSM plugin PCIe service, which is not detecting the additional four drives. nvsm show health and nvsm dump health function normally and no false alerts are raised in connection with this issue.

nvidia-smi Reports Persistence Mode is Off Within a Container

Issue (fixed in EL7-20.9)

When nvidia-smi is run within a non-privileged container, the output shows that persistence mode is off for all the GPUs.

Explanation and Workaround

Within non-privileged containers, persistence mode cannot be viewed or managed. Persistence mode for the GPUs is actually ON as demonstrated when running nvidia-smi outside of the container.