Release Notes
NVSM 21.07.15 Release
NVIDIA® System Management (NVSM) is a software framework for monitoring NVIDIA DGX™ nodes in a data center. It includes active health monitoring, system alerts, and log generation. See the NVSM User Guide for more information.
NVSM Versions 21.07.14 and 21.07.14 were released in August 2021.
NVSM Version 21.07.14 is provided as part of the EL8-21.08 software.
NVSM Version 21.07.15 is provided as part of the DGX OS 5.1.0 software.
Changes and New Features
The following are the changes since Release 20.09.33.
Added ability to generate a test alert and email.
[DGX Server]: Added firmware versions to the list of versions returned by the
nvsm show versions
command.[DGX Server]: Added firmware to the list of components checked with
nvsm show/dump health
commands.NVSM now binds port 273 to 127.0.0.1 to limit external communications.
To open other ports for IPV4 or IPV6, edit
nvsm.config
(bindaddress
) and then restart NVSM.
Bug Fixes
[DGA A100]: On a system where one OS drive is used for the EFI boot partition and one is used for the root file system (each configured as RAID 1), NVSM raises ‘md1 is corrupted’ alerts.
[DGX A100]:
nvsm dump health
can take up to 25 minutes to complete, depending on the size of the log file.[DGX A100]: NVSM Enumerates NVSwitches as 8-13 Instead of 0-5
Known Issues
[DGX A100]: NVSM does not raise alerts for missing ESP partition.
[DGX-2]: NVSM does not detect downgraded GPU PCIe Link
[DGX-2 KVM]:
nvidia-vm vmshow
command does not work for running VMs.configure_raid_array.py script cannot recreate RAID array after re-inserting a known good SSD.