Known Issues: DGX A100

See the sections for specific versions to see which issues are open in those versions.

Issue (fixed with EL7-21.10/R470)

On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises ‘md1 is corrupted’ alerts.

Explanation and Workaround

The OS RAID 1 drives are running in a non-standard configuration, resulting in erroneous alert messages. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.

To resolve, enable the R470 repository and update to EL7-21.10 or later, otherwise configure NVSM to support a custom drive partitioning by performing the following.

  1. Stop NVSM services.

    Copy
    Copied!
                

    $ systemctl stop nvsm

  2. Edit/etc/nvsm/nvsm.config and set the “use_standard_config_storage” parameter to false.

    Copy
    Copied!
                

    "use_standard_config_storage":false

  3. Remove the NVSM database.

    Copy
    Copied!
                

    $ sudo rm /var/lib/nvsm/sqlite/nvsm.db

  4. Restart NVSM.

    Copy
    Copied!
                

    $ systemctl restart nvsm

Issue (fixed in EL7-21.01)

After installing nvidia-peer-memory-dkms in order to use InfiniBand on DGX servers, the nv_peer_mem module is not loaded.

Explanation and Workaround

The nv_peer_mem module needs to be loaded, either

  • Manually, by issuing sudo systemctl start nv_peer_mem, or

  • Automatically on every system boot by performing the following:

    1. Create a file /etc/modules-load.d/nv-peer-mem.confwith contents “nv_peer_mem”.

    2. Issue dracut --force /boot/initramfs-$(uname -r).img $(uname -r)

    3. Reboot

Issue (fixed in 21.01)

With eight U.2 NVMe drives installed, the nvsm-plugin-pcie service reports ERROR: Device not found in mapping table” (for example, in response to systemctl status nvsm*) for the additional four drives.

Explanation and Workaround

This is an issue with the NVSM plugin PCIe service, which is not detecting the additional four drives. nvsm show health and nvsm dump health function normally and no false alerts are raised in connection with this issue.

Issue (fixed in EL7-20.9)

When nvidia-smi is run within a non-privileged container, the output shows that persistence mode is off for all the GPUs.

Explanation and Workaround

Within non-privileged containers, persistence mode cannot be viewed or managed. Persistence mode for the GPUs is actually ON as demonstrated when running nvidia-smi outside of the container.

© Copyright 2022-2023, NVIDIA. Last updated on Jun 27, 2023.