Release Notes

NVSM 21.07.15 Release

NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system alerts, and log generation. See the NVSM User Guide for more information.

  • NVSM Versions 21.07.14 and 21.07.14 were released in August 2021.

  • NVSM Version 21.07.14 is provided as part of the EL8-21.08 software.

  • NVSM Version 21.07.15 is provided as part of the DGX OS 5.1.0 software.

Changes and New Features

The following are the changes since Release 20.09.33.

  • Added ability to generate a test alert and email.

  • [DGX Server]: Added firmware versions to the list of versions returned by the nvsm show versions command.

  • [DGX Server]: Added firmware to the list of components checked with nvsm show/dump health commands.

  • NVSM now binds port 273 to 127.0.0.1 to limit external communications.

    To open other ports for IPV4 or IPV6, edit nvsm.config (bindaddress) and then restart NVSM.

Bug Fixes

  • [DGA A100]: On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises ‘md1 is corrupted’ alerts.

  • [DGX A100]: nvsm dump health can take up to 25 minutes to complete, depending on the size of the log file.

  • [DGX A100]: NVSM Enumerates NVSwitches as 8-13 Instead of 0-5

Known Issues

  • [DGX A100]: NVSM does not raise alerts for missing ESP partition.

  • [DGX-2]: NVSM does not detect downgraded GPU PCIe Link

  • [DGX-2 KVM]: nvidia-vm vmshow command does not work for running VMs.

  • configure_raid_array.py script cannot recreate RAID array after re-inserting a known good SSD.