Using the NVSM CLI
NVIDIA DGX-2 servers running DGX OS version 4.0.1 or later should come with NVSM pre-installed.
NVSM CLI communicates with the privileged NVSM API server, so NVSM CLI requires superuser privileges to run. All examples given in this guide are prefixed with the sudo
command.
Using the NVSM CLI Interactively
Starting an interactive session
The command “sudo nvsm” will start an NVSM CLI interactive session.
user@dgx-2:~$ sudo nvsm
[sudo] password for user:
nvsm->
Once at the “nvsm->
” prompt, the user can enter NVSM CLI commands to view and manage the DGX system.
Example command
One such command is “show fans”, which prints the state of all fans known to NVSM.
nvsm-> show fans
/chassis/localhost/thermal/fans/FAN10_F
Properties:
Status_State = Enabled
Status_Health = OK
Name = FAN10_F
MemberId = 19
ReadingUnits = RPM
LowerThresholdNonCritical = 5046.000
Reading = 9802 RPM
LowerThresholdCritical = 3596.000
...
/chassis/localhost/thermal/fans/PDB_FAN4
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB_FAN4
MemberId = 23
ReadingUnits = RPM
LowerThresholdNonCritical = 11900.000
Reading = 14076 RPM
LowerThresholdCritical = 10744.000
nvsm->
Leaving an interactive session
To leave the NVSM CLI interactive session, use the “exit” command.
nvsm-> exit
user@dgx2:~$
Using the NVSM CLI Non-Interactively
Any NVSM CLI command can be invoked from the system shell, without starting an NVSM CLI interactive session. To do this, simply append the desired NVSM CLI command to the “sudo nvsm” command. The “show fans” command given above can be invoked directly from the system shell as follows.
user@dgx2:~$ sudo nvsm show fans
/chassis/localhost/thermal/fans/FAN10_F
Properties:
Status_State = Enabled
Status_Health = OK
Name = FAN10_F
MemberId = 19
ReadingUnits = RPM
LowerThresholdNonCritical = 5046.000
Reading = 9802 RPM
LowerThresholdCritical = 3596.000
...
/chassis/localhost/thermal/fans/PDB_FAN4
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB_FAN4
MemberId = 23
ReadingUnits = RPM
LowerThresholdNonCritical = 11900.000
Reading = 14076 RPM
LowerThresholdCritical = 10744.000
user@dgx2:~$
The output of some NVSM commands can be too large to fit on one screen, it is sometimes useful to pipe this output to a paging utility such as “less”.
user@dgx2:~$ sudo nvsm show fans | less
Throughout this chapter, examples are given for both interactive and non-interactive NVSM CLI use cases. Note that these interactive and non-interactive examples are interchangeable.
Getting Help
Apart from the NVSM CLI User Guide (this document), there are many sources for finding additional help for NVSM CLI and the related NVSM tools.
nvsm “man” Page
A man page for NVSM CLI is included on DGX systems with NVSM installed. The user can view this man page by invoking the “man nvsm” command.
user@dgx2:~$ man nvsm
nvsm –help/-h Flag
By passing the –help or -h flag, the nvsm command will display a help message that is similar to “man nvsm
”. The help message can also be invoked through “nvsm --help
”. It shows a description, nvsm command verbs, options and a few examples
Example output:
user@dgxa100:~$ sudo nvsm --help
Run 'sudo nvsm [command] -h' for a command-specific help message
NVSM(1) NVSM CLI NVSM(1)
NAME
nvsm - NVSM CLI Documentation
User Guide: https://docs.nvidia.com/datacenter/nvsm/latest/pdf/nvsm-user-guide.pdf
SYNOPSIS
nvsm [help] [--color WHEN] [-i] [--log-level LEVEL] [--] [<command>]
DESCRIPTION
nvsm(1), also known as NVSM CLI, is a command-line interface for System Management on
NVIDIA DGX systems. Internally, NVSM CLI is a client of the NVSM (NVIDIA System Management)
API server, which is facilitated by the nvsm(1) daemon.
Invoking the nvsm(1) command without any arguments will start an NVSM CLI interactive session.
Alternatively, by passing commands as part of the [<command>] argument, NVSM CLI can be run
in a non-interactive mode.
Note: nvsm must be run with root privileges.
NVSM COMMANDS
nvsm show [-h, --help] [-level LEVEL] [-display CATEGORIES] [-all] [target] [where] :
Display information about devices and other entities managed by NVSM
nvsm cd [-h, --help] [target]:
Change the working target address used by NVSM verbs
nvsm set [-h, --help] [target] :
Change the value of NVSM target properties
nvsm start [-h, --help] [-noblock] [-force] [-quiet] [-timeout TIMEOUT] [target] :
Start a job managed by NVSM
nvsm dump health [-h, --help] [-o OUTPUT] [-t, -tags "tag1,tag2"]
[-tfp, -tar_file_path "/x/y/path"] [-tfn, -tar_file_name "name.tar.xz"] :
Generates a health report file
nvsm stress-test [--usage, -h, --help] [-force] [-no-prompt] [<test>] [DURATION] :
NVIDIA System Management Stress Testing
nvsm lock [-h, --help] [target] :
Enable locking of SED
nvsm create [-h, --help] [target] :
The create command is used to generate new resources on demand
OPTIONS
--color WHEN
Control colorization of output. Possible values for WHEN are "always", "never", or "auto".
Default value is "auto".
-i, --interactive
When this option is given, run in interactive mode. The default is automatic.
--log-level LEVEL
Set the output logging level. Possible values for LEVEL are "debug", "info", "warning",
"error", and "critical". The default value is "warning".
EXAMPLES
sudo nvsm help
Display the help message for NVSM CLI
sudo nvsm show -h
Display the help message for the NVSM show command
sudo nvsm show gpus
Display information for all GPUs in the system.
sudo nvsm
Run nvsm in interactive mode
sudo nvsm show versions
Display system version properties
sudo nvsm update firmware
Run through the steps of selecting a firmware update container on the local DGX system,
and running it to update the firmware on the system. This requires that you have already
loaded the container onto the DGX system.
sudo nvsm dump health
Produce a health report file suitable for attaching to support tickets.
AUTHOR
NVIDIA Corporation
COPYRIGHT
2021, NVIDIA Corporation
Help for NVSM CLI Commands
Each NVSM command verb within the NVSM CLI interactive session, such as show, cd, set, start, dump health, stress-test, lock
and create
recognizes a “-h
” or “--help
” flag that describes the NVSM command and its arguments. These commands also have their own man pages, which can be invoked, for example, using “man nvsm_show
”.
The help messages show the description, NVSM command nouns (or sub commands), options and examples.
Example output:
user@dgxa100:~$ sudo nvsm show -h
NVSM_SHOW(1) NVSM CLI NVSM_SHOW(1)
NAME
nvsm_show - NVSM SHOW CLI Documentation
SYNOPSIS
nvsm show [-h, --help] [-level LEVEL] [-display CATEGORIES] [-all] [target] [where]
DESCRIPTION
Show is used to display information about system components. It displays information
about devices and other entities managed by NVSM
OPTIONS
--help, -h
show this help message and exit
-level LEVEL, -l LEVEL
Specify the target depth level to which the show command will traverse the
target hierarchy.
The default value for LEVEL is 1, which means "the current target only".
-display CATEGORIES, -d CATEGORIES
Select the categories of information displayed about the given target.
Valid values for CATEGORIES are 'associations', 'targets', 'properties', 'verbs',
and 'all'. The default value for CATEGORY is 'all'. Multiple values can be
specified by separating those values with colon. Sub-arguments for properties
are supported which are separated by comma with paranthesis as optional.
-all, -a
Show data that are normally hidden. This includes OEM properties and OEM targets
unique to NVSM.
target The target address of the Managed Element to show. The target address can be relative
to the current working target, or it can be absolute. Simple globbing to select multiple
Managed Elements is also possible.
where Using this argument, targets can be filtered based on the value of their properties.
This can be used to quickly find targets with interesting properties. Currently this
supports '==' and '!=' operations, which mean 'equal' and 'not equal' respectively.
UNIX-style wildcards using '*' are also supported.
COMMANDS
show alerts
Display warnings and critical alerts for all subsystems
show drives
Display the storage drives
show versions
Display system version properties
show fans
Display information for all the fans in the system.
show firmware
Walk through steps of selecting a firmware update container on the local DGX system,
and run it to show the firmware versions installed on system. This requires that you
have already loaded the container onto the DGX system.
update firmware
Walk through steps of selecting a firmware update container on the local DGX system,
and run it to update the firmware on system. This requires that you have already loaded
the container onto the DGX system.
show gpus
Display information for all GPUs in the system
show health
Display overall system health
show memory
Display information for all installed DIMMs
show networkadapters
Display information for the physical network adapters
show networkdevicefunctions
Display information for the PCIe functions for a given network adapter
show networkinterfaces
Display information for each logical network adapter on the system.
show networkports
Display information for the network ports of a given networkadapter
show nvswitches
Display information for all the NVSwitch interconnects in the system.
show policy
Display alert policies for subsystems
show power
Display information for all power supply units (PSUs) in the system.
show processors
Display information for all processors in the system.
show storage
Display storage related information
show temperature
Display temperature information for all sensors in the system
show volumes
Show storage volumes
show powermode
Display the current system power mode
show led
Lists values for available system LED status. Includes u.2 NVME, Chassis/Blade LED
status(on applicable platforms) disable exporters
Disable NVSM metric collection data
show controllers
List applicable controllers properties. Applicable for SAS storage controller in dgx1,
and M.2 and U.2 NVMe controller properties for other platforms.
EXAMPLES
sudo nvsm show -h
Display the help message for the NVSM show command
sudo nvsm show health -h
Display the help message for the NVSM show health command
sudo nvsm show gpus
Display information for all GPUs in the system.
sudo nvsm show versions
Display system version properties
sudo nvsm show storage
View all storage-related information
sudo nvsm show processors
Information for all CPUs installed on the system
AUTHOR
NVIDIA Corporation
COPYRIGHT
2021, NVIDIA Corporation
21.07.12-4-g5586f4ba Aug 04, 2021 NVSM_SHOW(1)
When a wrong command is entered, the CLI prompts the user to check the specified help message.
:~$ sudo nvsm show wrong_command
ERROR:nvsm:Target address "wrong_command" does not exist
Run: 'sudo nvsm show --help' for more options
Setting DGX H100 BMC Redfish Password
In DGX H100, Redfish services in BMC can be accessed using the BMC Redfish host IP address, which is termed the Host Interface. NVSM deployed on the Host OS communicates over Host Interface with the BMC Redfish services for the system data.
The Redfish host interface is a secured communication channel. As a prerequisite, BMC credentials with minimal read type access is set up in the Host OS before making any communication with the BMC Redfish services via NVSM.
The following NVSM commands sets up the BMC credentials for NVSM consumption in Host OS:
# nvsm set -bmccred (or) # nvsm set --bmccredentials
$ sudo nvsm set -bmccred
BMC credentials entered will be encrypted and stored.
Enter BMC username: admin
Enter BMC password:
Re Enter BMC password:
Entered credentials stored successfully.
Credentials get encrypted and stored on the Host.
Examining System Health
The most basic functionality of NVSM CLI is examination of system state. NVSM CLI provides a “show” command for this purpose.
Because NVSM CLI is modeled after the SMASH CLP, the output of the NVSM CLI “show” command should be familiar to users of BMC command line interfaces.
List of Basic Commands
The following table lists the basic commands (primarily “show”). Detailed use of these commands are explained in subsequent sections of the document.
Note
On DGX Station, the following are the only commands supported.
nvsm show health
nvsm dump health
Global Commands |
Descriptions |
---|---|
$ sudo nvsm show alerts |
Displays warnings and critical alerts for all subsystems |
$ sudo nvsm show policy |
Displays alert policies for subsystems |
$ sudo nvsm show versions |
Displays system version properties |
Health Commands |
Descriptions |
---|---|
$ sudo nvsm show health |
Displays overall system health |
$ sudo nvsm dump health |
Generates a health report file |
Storage Commands |
Descriptions |
---|---|
$ sudo nvsm show storage |
Displays all storage-related information |
$ sudo nvsm show drives |
Displays the storage drives |
$ sudo nvsm show controllers |
Display the storage controllers |
$ sudo nvsm show volumes |
Displays the storage volumes |
GPU Commands |
Descriptions |
---|---|
$ sudo nvsm show gpus |
Displays information for all GPUs in the system. |
Processor Commands |
Descriptions |
---|---|
$ sudo nvsm show processors |
Displays information for all CPUs in the system |
$ sudo nvsm show cpus |
Alias for “show processors” |
Memory Commands |
Descriptions |
---|---|
$ sudo nvsm show memory |
Displays information for all installed DIMMs |
$ sudo nvsm show dimms |
Alias for “show memory” |
Thermal Commands |
Descriptions |
---|---|
$ sudo nvsm show fans |
Displays information for all the fans in the system. |
$ sudo nvsm show temperatures |
Displays temperature information for all sensors in the system |
$ sudo nvsm show temps |
Alias for “show temperatures” |
Network Commands |
Descriptions |
---|---|
$ sudo nvsm show networkadapters |
Displays information for the physical network adapters |
$ sudo nvsm show networkinterfaces |
Displays information for the logical network interfaces |
$ sudo nvsm show networkports |
Displays information for the network ports of a given network adapter |
$ sudo nvsm show networkdevicefunctions |
Displays information for the PCIe functions for a given network adapter |
Power Commands |
Descriptions |
---|---|
$ sudo nvsm show power |
Displays information for all power supply units (PSUs) in the system. |
$ sudo nvsm show powermode |
Display the current system power mode |
$ sudo nvsm show psus |
Alias for “show power” |
NVSwitch Commands |
Descriptions |
---|---|
$ sudo nvsm show nvswitches |
Displays information for all the NVSwitch interconnects in the system. |
Firmware Commands |
Descriptions |
---|---|
$ sudo nvsm show firmware |
Guides you through the steps of selecting a firmware update container on your local DGX system, and running it to show the firmware versions installed on the system. This requires that you have already loaded the container onto the DGX system. |
$ sudo nvsm update firmware |
Guides you through the steps of selecting a firmware update container on your local DGX system, and running it to update the firmware on the system. This requires that you have already loaded the container onto the DGX system. |
Show Health
The “show health” command can be used to quickly assess overall system health.
user@dgx-2:~$ sudo nvsm show health
Example output:
...
Checks
------Verify installed DIMM memory sticks..........................
HealthyNumber of logical CPU cores [96].............................
HealthyGPU link speed [0000:39:00.0][8GT/s].........................
HealthyGPU link width [0000:39:00.0][x16]...........................
Healthy
...
Health Summary
--------------
205 out of 205 checks are Healthy
Overall system status is Healthy
If any system health problems are found, this will be reflected in the health summary at the bottom of the “show health” output”. Detailed information on health checks performed will appear above.
Dump Health
The “dump health” command produces a health report file suitable for attaching to support tickets.
user@dgx-2:~$ sudo nvsm dump health
Example output:
Writing output to /tmp/nvsm-health-dgx-1-20180907085048.tar.xzDone.
The file produced by “dump health” is a familiar compressed tar archive, and its contents can be examined by using the “tar” command as shown in the following example.
user@dgx-2:~$ cd /tmp
user@dgx-2:/tmp$ sudo tar xlf nvsm-health-dgx-1-20180907085048.tar.xz
user@dgx-2:/tmp$ sudo ls ./nvsm-health-dgx-1-20180907085048
date java nvsysinfo_commands sos_reports
df last nvsysinfo_log.txt sos_strings
dmidecode lib proc sys
etc lsb-release ps uname
free lsmod pstree uptime
hostname lsof route usr
initctl lspci run var
installed-debs mount sos_commands version.txt
ip_addr netstat sos_logs vgdisplay
The option -qkd
or --quick_dump
can be used to collect the health report more quickly, at the cost of higher CPU/memory consumption.
# nvsm dump health -qkd
Show Versions
The “nvsm show versions” command displays hardware components on board, along with their firmware versions. It also shows the installed version of NVSM, Datacenter GPU Manager, and OS among others.
user@dgxa100:~$ sudo nvsm show versions
Example output:
itializing NVSM Core...
/versions
Properties:
dgx-release = 5.1.0
nvidia-driver = 470.57.01
cuda-driver = 11.4
os-release = Ubuntu 20.04.2 LTS (Focal Fossa)
kernel = 5.4.0-77-generic
nvidia-container-runtime-docker = 3.4.0-1
docker-ce = 20.10.7
platform = DGXA100
nvsm = 21.07.12-5-g9775e940-dirty
mlnx-ofed = MLNX_OFED_LINUX-5.4-1.0.3.0:
datacenter-gpu-manager = 1:2.2.9
datacenter-gpu-manager-fabricmanager = 470.57.01-1
sBIOS = 1.03
vBIOS-GPU-0 = 92.00.45.00.06
vBIOS-GPU-1 = 92.00.45.00.06
vBIOS-GPU-2 = 92.00.45.00.06
vBIOS-GPU-3 = 92.00.45.00.06
vBIOS-GPU-4 = 92.00.45.00.06
vBIOS-GPU-5 = 92.00.45.00.06
vBIOS-GPU-6 = 92.00.45.00.06
vBIOS-GPU-7 = 92.00.45.00.06
BMC = 0.14.17
CEC-BMC-1 = 03.28
CEC-Delta-2 = 04.00
PSU-0 Chassis-1 = 01.05.01.05.01.05
PSU-1 Chassis-1 = 01.05.01.05.01.05
PSU-2 Chassis-1 = 01.05.01.05.01.05
PSU-3 Chassis-1 = 01.05.01.05.01.05
PSU-4 Chassis-1 = 01.05.01.05.01.05
PSU-5 Chassis-1 = 01.07.01.05.01.06
MB-FPGA = 0.01.03
MID-FPGA = 0.01.03
NvSwitch-0 = 92.10.18.00.02
NvSwitch-1 = 92.10.18.00.02
NvSwitch-2 = 92.10.18.00.02
NvSwitch-3 = 92.10.18.00.02
NvSwitch-4 = 92.10.18.00.02
NvSwitch-5 = 92.10.18.00.02
SSD-nvme0 (S/N S4YPNE0MB00495) System-1 = EPK9CB5Q
SSD-nvme1 (S/N S436NA0M510827) System-1 = EDA7602Q
SSD-nvme2 (S/N S436NA0M510817) System-1 = EDA7602Q
SSD-nvme3 (S/N S4YPNE0MB01307) System-1 = EPK9CB5Q
SSD-nvme4 (S/N S4YPNE0MC01447) System-1 = EPK9CB5Q
Show Storage
NVSM CLI provides a “show storage” command to view all storage-related information. This command can be invoked from the command line as follows.
user@dgx-2:~$ sudo nvsm show storage
The following NVSM commands also show storage-related information.
user@dgx-2:~$ sudo nvsm show drives
user@dgx-2:~$ sudo nvsm show volumes
user@dgx-2:~$ sudo nvsm show controllers
user@dgx-2:~$ sudo nvsm show led
Within an NVSM CLI interactive session, the CLI targets related to storage are located under the /systems/localhost/storage/1 target.
user@dgx2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/
nvsm(/systems/localhost/storage/)-> show
Example output:
/systems/localhost/storage/
Properties:
DriveCount = 10
Volumes = [ md0, md1, nvme0n1p1, nvme1n1p1 ]
Targets:
alerts
drives
policy
volumes
Verbs:
cd
show
Show Storage Alerts
Storage alerts are generated when the DSHM monitoring daemon detects a storage-related problem and attempts to alert the user (via email or otherwise). Past storage alerts can be viewed within an NVSM CLI interactive session under the /systems/localhost/storage/1/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/alerts
nvsm(/systems/localhost/storage/alerts)-> show
Example output:
/systems/localhost/storage/alerts
Targets:
alert0
alert1
Verbs:
cd
show
In this example listing, there appear to be two storage alerts associated with this system. The contents of these alerts can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/storage/alerts)-> show alert1
/systems/localhost/storage/alerts/alert1
Properties:
system_name = dgx-2
message_details = EFI System Partition 1 is corrupted
nvme0n1p1
component_id = nvme0n1p1
description = Storage sub-system is reporting an error
event_time = 2018-07-14 12:51:19
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
alert_id = NV-VOL-03
system_serial = productserial
message = System entered degraded mode, storage sub-system is reporting an error
severity = Warning
Verbs:
cd
show
The message seen in this alert suggests a possible EFI partition corruption, which is an error condition that might adversely affect this system’s ability to boot. Note that the text seen here reflects the exact message that the user would have seen when this alert was generated.
Possible categories for storage alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-DRIVE-01 |
Critical |
Drive missing |
NV-DRIVE-07 |
Warning |
System has unsupported drive |
NV-DRIVE-09 |
Warning |
Unsupported SED drive configuration |
NV-DRIVE-10 |
Critical |
Unsupported volume encryption configuration |
NV-DRIVE-11 |
Warning |
M.2 firmware version mismatch |
NV-VOL-01 |
Critical |
RAID-0 corruption observed |
NV-VOL-02 |
Critical |
RAID-1 corruption observed |
NV-VOL-03 |
Warning |
EFI System Partition 1 corruption observed |
NV-VOL-04 |
Warning |
EFI System Partition 2 corruption observed |
NV-CONTROLLER-01 |
Warning |
Controller is reporting an error |
NV-CONTROLLER-02 |
Warning |
Storage controller is reporting PHY error |
NV-CONTROLLER-03 |
Warning |
Controller set at lower than expected speed |
NV-CONTROLLER-04 |
Critical |
Controller is reporting an error |
NV-CONTROLLER-05 |
Critical |
Controller is reporting an error |
NV-CONTROLLER-06 |
Critical |
Controller is reporting an error |
NV-CONTROLLER-07 |
Critical |
LEDStatus for controller needs to be cleared |
Show Storage Drives
Within an NVSM CLI interactive session, each storage drive on the system is represented by a target under the /systems/localhost/storage/drives target. A listing of drives can be obtained as follows.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/drives
nvsm(/systems/localhost/storage/drives)-> show
Example output:
/systems/localhost/storage/drives
Targets:
nvme0n1
nvme1n1
nvme2n1
nvme3n1
nvme4n1
nvme5n1
nvme6n1
nvme7n1
nvme8n1
nvme9n1
Verbs:
cd
show
Details for any particular drive can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/storage/drives)-> show nvme2n1
/systems/localhost/storage/drives/nvme2n1
Properties:
Capacity = 3840755982336
BlockSizeBytes = 7501476528
SerialNumber = 18141C244707
PartNumber = N/A
Model = Micron_9200_MTFDHAL3T8TCT
Revision = 100007C0
Manufacturer = Micron Technology Inc
Status_State = Enabled
Status_Health = OK
Name = Non-Volatile Memory Express
MediaType = SSD
IndicatorLED = N/A
EncryptionStatus = N/A
HotSpareType = N/A
Protocol = NVMe
NegotiatedSpeedsGbs = 0
Id = 2
Verbs:
cd
show
Show Storage Volumes
Within an NVSM CLI interactive session, each storage volume on the system is represented by a target under the /systems/localhost/storage/volumes target. A listing of volumes can be obtained as follows.
user@dgx-2:~$ sudo nvsm
nvsmnvsm-> cd /systems/localhost/storage/volumes
nvsm(/systems/localhost/storage/volumes)-> show
Example output:
/systems/localhost/storage/volumes
Targets:
md0
md1
nvme0n1p1
nvme1n1p1
Verbs:
cd
show
Details for any particular volume can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/storage/volumes)-> show md0
/systems/localhost/storage/volumes/md0P
roperties:
Status_State = Enabled
Status_Health = OK
Name = md0
Encrypted = False
VolumeType = RAID-1
Drives = [ nvme0n1, nvme1n1 ]
CapacityBytes = 893.6G
Id = md0
Verbs:
cd
show
Show GPUs
Information for all GPUs installed on the system can be viewed invoking the “show gpus” command as follows.
user@dgx-2:~$ sudo nvsm show gpus
Within an NVSM CLI interactive session, the same information can be accessed under the /systems/localhost/gpus CLI target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/gpus
nvsm(/systems/localhost/gpus)-> show
Example output:
/systems/localhost/gpus
Targets:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Verbs:
cd
show
Details for any particular GPU can also be viewed with the “show” command.
For example:
nvsm(/systems/localhost/gpus)-> show 6
/systems/localhost/gpus/6
Properties:
Inventory_ModelName = Tesla V100-SXM3-32GB
Inventory_UUID = GPU-4c653056-0d6e-df7d-19c0-4663d6745b97
Inventory_SerialNumber = 0332318503073
Inventory_PCIeDeviceId = 1DB810DE
Inventory_PCIeSubSystemId = 12AB10DE
Inventory_BrandName = Tesla
Inventory_PartNumber = 699-2G504-0200-000
Verbs:
cd
show
Showing Individual GPUs
Details for any particular GPU can also be viewed with the “show” command.
For example:
nvsm(/systems/localhost/gpus)-> show GPU6
/systems/localhost/gpus/GPU6
Properties:
Inventory_ModelName = Tesla V100-SXM3-32GB
Inventory_UUID = GPU-4c653056-0d6e-df7d-19c0-4663d6745b97
Inventory_SerialNumber = 0332318503073
Inventory_PCIeDeviceId = 1DB810DE
Inventory_PCIeSubSystemId = 12AB10DE
Inventory_BrandName = Tesla
Inventory_PartNumber = 699-2G504-0200-000
Specifications_MaxPCIeGen = 3
Specifications_MaxPCIeLinkWidth = 16x
Specifications_MaxSpeeds_GraphicsClock = 1597 MHz
Specifications_MaxSpeeds_MemClock = 958 MHz
Specifications_MaxSpeeds_SMClock = 1597 MHz
Specifications_MaxSpeeds_VideoClock = 1432 MHz
Connections_PCIeGen = 3
Connections_PCIeLinkWidth = 16x
Connections_PCIeLocation = 00000000:34:00.0
Power_PowerDraw = 50.95 W
Stats_ErrorStats_ECCMode = Enabled
Stats_FrameBufferMemoryUsage_Free = 32510 MiB
Stats_FrameBufferMemoryUsage_Total = 32510 MiB
Stats_FrameBufferMemoryUsage_Used = 0 MiB
Stats_PCIeRxThroughput = 0 KB/s
Stats_PCIeTxThroughput = 0 KB/s
Stats_PerformanceState = P0
Stats_UtilDecoder = 0 %
Stats_UtilEncoder = 0 %
Stats_UtilGPU = 0 %
Stats_UtilMemory = 0 %
Status_Health = OK
Verbs:
cd
show
Identifying GPU Health Incidents
Explain the benefits of the task, the purpose of the task, who should perform the task, and when to perform the task in 50 words or fewer.
NVSM uses NVIDIA Data Center GPU Manager (DCGM) to continuously monitor GPU health, and reports GPU health issues as “GPU health incidents”. Whenever GPU health incidents are present, NVSM indicates this state in the “Status_HealthRollup
” property of the /systems/localhost/gpus
CLI target.
“Status_HealthRollup
” captures the overall health of all GPUs in the system in a single value. Check the “Status_HealthRollup
” property before checking other properties when checking for GPU health incidents.
To check for GPU health incidents, do the following,
Display the “Properties” section of GPU health
~$ sudo nvsm nvsm-> cd /systems/localhost/gpus nvsm(/systems/localhost/gpus)-> show -display properties
A system with a GPU-related issue might report the following.
Properties: Status_HealthRollup = Critical Status_Health = OK
The “
Status_Health = OK
” property in this example indicates that NVSM did not find any system-level problems, such as missing drivers or incorrect device file permissions.The “
Status_HealthRollup = Critical
” property indicates that at least one GPU in this system is exhibiting a “Critical” health incident.To find this GPU, issue the following command to list the health status for each GPU..
~$ sudo nvsm nvsm-> show -display properties=*health /systems/localhost/gpus/*
The GPU with the health incidents will be reported as in the following example for GPU14.
/systems/localhost/gpus/GPU14 Properties: Status_Health = Critica
Issue the following command to show the detailed health information for a particular GPU (GPU14 in this example).
nvsm-> cd /systems/localhost/gpus nvsm(/systems/localhost/gpus)-> show -level all GPU14/health
The output shows all the incidents involving that particular GPU.
/systems/localhost/gpus/GPU14/health Properties: Health = Critical Targets: incident0 Verbs: cd show/systems/localhost/gpus/GPU2/health/incident0 Properties: Message = GPU 14's NvLink link 2 is currently down. Health = Critical System = NVLink Verbs: cd show
The output in this example narrows down the scope to a specific incident (or incidents) on a specific GPU. DCGM will monitor for a variety of GPU conditions, so check “Status_HealthRollup
” using NVSM CLI to understand each incident.
Show Processors
Information for all CPUs installed on the system can be viewed using the “show processors” command.
user@dgx-2$ sudo nvsm show processors
From within an NVSM CLI interactive session, the same information is available under the /systems/localhost/processors target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/processors
nvsm(/systems/localhost/processors)-> show
Example output:
/systems/localhost/processors
Targets:
CPU0
CPU1
alerts
policy
Verbs:
cd
show
Details for any particular CPU can be viewed using the “show” command.
For example:
nvsm(/systems/localhost/processors)-> show CPU0/systems/localhost/processors/CPU0
Properties:
Id = CPU0
InstructionSet = x86-64
Manufacturer = Intel(R) Corporation
MaxSpeedMHz = 3600
Model = Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Name = Central Processor
ProcessorArchitecture = x86
ProcessorId_EffectiveFamily = 6
ProcessorId_EffectiveModel = 85
ProcessorId_IdentificationRegisters = 0xBFEBFBFF00050654
ProcessorId_Step = 4
ProcessorId_VendorId = GenuineIntel
ProcessorType = CPU
Socket = CPU 0
Status_Health = OK
Status_State = Enabled
TotalCores = 24
TotalThreads = 48
Verbs:
cd
show
Show Processor Alerts
Processor alerts are generated when the DSHM monitoring daemon detects a CPU Internal Error (IERR) or Thermal Trip and attempts to alert the user (via email or otherwise). Past processor alerts can be viewed within an NVSM CLI interactive session under the /systems/localhost/processors/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/processors/alerts
nvsm(/systems/localhost/processors/alerts)-> show
Example output:
/systems/localhost/processors/alerts
Targets:
alert0
alert1
alert2
Verbs:
cd
show
This example listing appears to show three processor alerts associated with this system. The contents of these alerts can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/processors/alerts)-> show alert2
/systems/localhost/processors/alerts/alert2
Properties:
system_name = xpl-bu-06
component_id = CPU0
description = CPU is reporting an error.
event_time = 2018-07-18T16:42:20.580050
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Critical
alert_id = NV-CPU-02
system_serial = To be filled by O.E.M.
message = System entered degraded mode, CPU0 is reporting an error.
message_details = CPU Thermtrip has occurred, processor socket temperature exceeded the thermal specifications of the component.
Verbs:
cd
show
Possible categories for processor alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-CPU-01 |
Critical |
An unrecoverable CPU Internal error has occurred. |
NV-CPU-02 |
Critical |
CPU Thermtrip has occurred, processor socket temperature exceeded the thermal specifications of the component. |
Show Memory
Information for all system memory (i.e. all DIMMs installed near the CPU, not including GPU memory) can be viewed using the “show memory” command.
user@dgx-2:~$ sudo nvsm show memory
From within an NVSM CLI interactive session, system memory information is accessible under the /systems/localhost/memory target.
lab@xpl-dvt-42:~$ sudo nvsm
nvsm-> cd /systems/localhost/memory
nvsm(/systems/localhost/memory)-> show
Example output:
/systems/localhost/memory
Targets:
CPU0_DIMM_A1
CPU0_DIMM_A2
CPU0_DIMM_B1
CPU0_DIMM_B2
CPU0_DIMM_C1
CPU0_DIMM_C2
CPU0_DIMM_D1
CPU0_DIMM_D2
CPU0_DIMM_E1
CPU0_DIMM_E2
CPU0_DIMM_F1
CPU0_DIMM_F2
CPU1_DIMM_G1
CPU1_DIMM_G2
CPU1_DIMM_H1
CPU1_DIMM_H2
CPU1_DIMM_I1
CPU1_DIMM_I2
CPU1_DIMM_J1
CPU1_DIMM_J2
CPU1_DIMM_K1
CPU1_DIMM_K2
CPU1_DIMM_L1
CPU1_DIMM_L2
alerts policy
Verbs:
cd
show
Details for any particular memory DIMM can be viewed using the “show” command.
For example:
nvsm(/systems/localhost/memory)-> show CPU2_DIMM_B1
/systems/localhost/memory/CPU2_DIMM_B1
Properties:
CapacityMiB = 65536
DataWidthBits = 64
Description = DIMM DDR4 Synchronous
Id = CPU2_DIMM_B1
Name = Memory Instance
OperatingSpeedMhz = 2666
PartNumber = 72ASS8G72LZ-2G6B2
SerialNumber = 1CD83000
Status_Health = OK
Status_State = Enabled
VendorId = Micron
Verbs:
cd
show
Show Memory Alerts
On DGX systems with a Baseboard Management Controller (BMC), the BMC will monitor DIMMs for correctable and uncorrectable errors. Whenever memory error counts cross a certain threshold (as determined by SBIOS), a memory alert is generated by the DSHM daemon in an attempt to notify the user (via email or otherwise).
Past memory alerts are accessible from an NVSM CLI interactive session under the /systems/localhost/memory/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/memory/alerts
nvsm(/systems/localhost/memory/alerts)-> show
Example output:
/systems/localhost/memory/alerts
Targets:
alert0
Verbs:
cd
show
This example listing appears to show one memory alert associated with this system. The contents of this alert can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/memory/alerts)-> show alert0
/systems/localhost/memory/alerts/alert0
Properties:
system_name = xpl-bu-06
component_id = CPU1_DIMM_A2
description = DIMM is reporting an error.
event_time = 2018-07-18T16:48:09.906572
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Critical
alert_id = NV-DIMM-01
system_serial = To be filled by O.E.M.
message = System entered degraded mode, CPU1_DIMM_A2 is reporting an error.
message_details = Uncorrectable error is reported.
Verbs:
cd
show
Possible categories for memory alerts are given in the table below.
Alert Type |
Severity |
Details |
---|---|---|
NV-DIMM-01 |
Critical |
Uncorrectable error is reported. |
Show Fans and Temperature
NVSM CLI provides a “show fans” command to display information for each fan on the system.
~$ sudo nvsm show fans
Likewise, NVSM CLI provides a “show temperatures” command to display temperature information for each temperature sensor known to NVSM.
~$ sudo nvsm show temperatures
Within an NVSM CLI interactive session, targets related to fans and temperature are located under the /chassis/localhost/thermal target.
~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal
nvsm(/chassis/localhost/thermal)-> show
Example output:
/chassis/localhost/thermal
Targets:
alerts
fans
policy
temperatures
Verbs:
cd
show
Show Thermal Alerts
The DSHM daemon monitors fan speed and temperature sensors. When the values of these sensors violate certain threshold criteria, DSHM generates a thermal alert in an attempt to notify the user (via email or otherwise).
Past thermal alerts can be viewed in an NVSM CLI interactive session under the /chassis/localhost/thermal/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal/alerts
nvsm(/chassis/localhost/thermal/alerts)-> show
Example output:
/chassis/localhost/thermal/alerts
Targets:
alert0
Verbs:
cd
show
This example listing appears to show one thermal alert associated with this system. The contents of this alert can be viewed with the “show” command.
For example:
nvsm(/chassis/localhost/thermal/alerts)-> show alert0
/chassis/localhost/thermal/alerts/alert0
Properties:
system_name = system-name
component_id = FAN1_R
description = Fan Module is reporting an error.
event_time = 2018-07-12T15:12:22.076814
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin 3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Critical
alert_id = NV-FAN-01
system_serial = To be filled by O.E.M.
message = System entered degraded mode, FAN1_R is reporting an error.
message_details = Fan speed reading has fallen below the expected speed setting.
Verbs: cd show
From the message in this alert, it appears that one of the rear fans is broken in this system. This is the exact message that the user would have received at the time this alert was generated, assuming alert notifications were enabled.
Possible categories for thermal-related (fan and temperature) alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-FAN-01 |
Critical |
Fan speed reading has fallen below the expected speed setting. |
NV-FAN-02 |
Critical |
Fan readings are inaccessible. |
NV-PDB-01 |
Critical |
Operating temperature exceeds the thermal specifications of the component. |
Show Fans
Within an NVSM CLI interactive session, each fan on the system is represented by a target under the /chassis/localhost/thermal/fans target. The “show” command can be used to obtain a listing of fans on the system.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal/fans
nvsm(/chassis/localhost/thermal/fans)-> show
Example output:
/chassis/localhost/thermal/fans
Targets:
FAN10_F
FAN10_R
FAN1_F
FAN1_R
FAN2_F
FAN2_R
FAN3_F
FAN3_R
FAN4_F
FAN4_R
FAN5_F
FAN5_R
FAN6_F
FAN6_R
FAN7_F
FAN7_R
FAN8_F
FAN8_R
FAN9_F
FAN9_R
PDB_FAN1
PDB_FAN2
PDB_FAN3
PDB_FAN4
Verbs:
cd
show
Again using the “show” command, the details for any given fan can be obtained as follows.
For example:
nvsm(/chassis/localhost/thermal/fans)-> show PDB_FAN2
/chassis/localhost/thermal/fans/PDB_FAN2
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB_FAN2
MemberId = 21
ReadingUnits = RPM
LowerThresholdNonCritical = 11900.000
Reading = 13804 RPM
LowerThresholdCritical = 10744.000
Verbs:
cd
show
Show Temperatures
Each temperature sensor known to NVSM is represented as a target under the /chassis/localhost/thermal/temperatures target. A listing of temperature sensors on the system can be obtained using the following commands.
nvsm(/chassis/localhost/thermal/temperatures)-> show
Example output:
/chassis/localhost/thermal/temperatures
Targets:
PDB1
PDB2
Verbs:
cd
show
As with fans, the details for any temperature sensor can be viewed with the “show” command.
For example:
nvsm(/chassis/localhost/thermal/temperatures)-> show PDB2
/chassis/localhost/thermal/temperatures/PDB2
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB2
PhysicalContext = PDB
MemberId = 1
ReadingCelsius = 20 degrees C
UpperThresholdNonCritical = 127.000
SensorNumber = 66h
UpperThresholdCritical = 127.000
Verbs:
cd
show
Show Power Supplies
NVSM CLI provides a “show power” command to display information for all power supplies present on the system.
user@dgx-2:~$ sudo nvsm show power
From an NVSM CLI interactive session, power supply information can be found under the /chassis/localhost/power target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/power
nvsm(/chassis/localhost/power)-> show
Example output:
/chassis/localhost/power
Targets:
PSU1
PSU2
PSU3
PSU4
PSU5
PSU6
alerts policyVerbs: cd show
Details for any particular power supply can be viewed using the “show” command as follows.
For example:
nvsm(/chassis/localhost/power)-> show PSU4
/chassis/localhost/power/PSU4
Properties:
Status_State = Present
Status_Health = OK
LastPowerOutputWatts = 442
Name = PSU4
SerialNumber = DTHTCD18240
MemberId = 3
PowerSupplyType = AC
Model = ECD16010081
Manufacturer = Delta
Verbs:
cd
show
Show Power Alerts
The DSHM daemon monitors PSU status. When the PSU status is not Ok, DSHM generates a power alert in an attempt to notify the user (via email or otherwise).
Prior power alerts can be viewed under the /chassis/localhost/power/alerts target of an NVSM CLI interactive session.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/power/alerts
nvsm(/chassis/localhost/power/alerts)-> show
Example output:
/chassis/localhost/power/alerts
Targets:
alert0
alert1
alert2
alert3
alert4
Verbs:
cd
show
This example listing shows a system with five prior power alerts. The details for any one of these alerts can be viewed using the “show” command.
For example:
nvsm(/chassis/localhost/power/alerts)-> show alert4
/chassis/localhost/power/alerts/alert4
Properties:
system_name = system-name
component_id = PSU4
description = PSU is reporting an error.
event_time = 2018-07-18T16:01:27.462005
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Warning
alert_id = NV-PSU-05
system_serial = To be filled by O.E.M.
message = System entered degraded mode, PSU4 is reporting an error.
message_details = PSU is missing
Verbs:
cd
show
Possible categories for power alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-PSU-01 |
Critical |
Power supply module has failed. |
NV-PSU-02 |
Warning |
Detected predictive failure of the Power supply module. |
NV-PSU-03 |
Critical |
Input to the Power supply module is missing. |
NV-PSU-04 |
Critical |
Input voltage is out of range for the Power Supply Module. |
NV-PSU-05 |
Warning |
PSU is missing |
Show Network Adapters
NVSM CLI provides a show networkadapters
command to display information for each physical network adapter in the chassis.
~$ sudo nvsm show networkadapters
Within an NVSM CLI interactive session, targets related to network adapters are located under the /chassis/localhost/NetworkAdapters
target.
~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters
nvsm(/chassis/localhost/NetworkAdapters)-> show
Display a List of Muted Adapters
To display a list of the muted adapters, run the following command:
$ sudo nvsm show /chassis/localhost/NetworkAdapters/policy
/chassis/localhost/NetworkAdapters/policy
Properties:
mute_monitoring = <NOT_SET>
mute_notification = <NOT_SET>
Show Network Ports
NVSM CLI provides a show networkports
command to display information for each physical network port in the chassis.
~$ sudo nvsm show networkports
Within an NVSM CLI interactive session, targets related to network adapters are located under the /chassis/localhost/NetworkAdapter/<id>/NetworkPort target, where <id> is one of the network adapter IDs displayed from the nvsm show networkadapters
command.
~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters/<id>/NetworkPorts
nvsm(/chassis/localhost/NetworkAdapters/<id>/NetworkPorts)-> show
Show Network Device Functions
NVSM CLI provides a show networkdevicefunctions
command to display information for each network adapter-centric PCIe function in the chassis.
~$ sudo nvsm show networkdevicefunctions
Within an NVSM CLI interactive session, targets related to network adevice functions are located under the /chassis/localhost/NetworkAdapter/<id>/NetworkDeviceFunctions
target, where<id> is one of the network adapter IDs displayed from the nvsm show networkadapters
command.
~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters/<id>/NetworkDeviceFunctions
nvsm(/chassis/localhost/NetworkAdapters/<id>/NetworkDeviceFunctions)-> show
Display a List of Interfaces
Run the following command:
$ sudo nvsm show /chassis/localhost/NetworkAdapters
/chassis/localhost/NetworkAdapters
Targets:
PCI0000_0c_00
PCI0000_12_00
PCI0000_4b_00
PCI0000_54_00
PCI0000_8d_00
PCI0000_94_00
PCI0000_ba_00
PCI0000_cc_00
PCI0000_e1_00
PCI0000_e2_00
Show Network Interfaces
NVSM CLI provides a show networkinterfaces
command to display information for each logical network adapter on the system.
~$ sudo nvsm show networkinterfaces
In an NVSM CLI interactive session, targets related to network adapters are located under the /system/localhost/networkinterfaces
target.
~$ sudo nvsm
nvsm-> cd /system/localhost/NetworkInterfaces
nvsm(/system/localhost/NetworkInterfaces)-> show
Add an Interface to the Mute Notifications
Here is an example of a command you can run to add an interface to the mute notifications:
$ sudo nvsm set chassis/localhost/NetworkAdapters/policy mute_notification=PCI0000_0c_00,PCI0000_12_00
System Monitoring Configuration
NVSM provides a DSHM service that monitors the state of the DGX system.
NVSM CLI can be used to interact with the DSHM system monitoring service via the NVSM API server.
Configuring Email Alerts
In order to receive the Alerts generated by DSHM through email, configure the Email settings in the global policy using NVSM CLI. User shall receive email whenever a new alert gets generated. The sender address, recipient address(es), SMTP server IP address and SMTP server Port number must be configured according to the SMTP server settings hosted by the user.
Email configuration properties
Property |
Description |
---|---|
email_sender |
Sender email address Must be a valid email address, otherwise no emails will be sent. |
email_recipients |
List of recipients to which the email shall be sent [ user1@domain.com,user2@domain.com ] |
email_smtp_server_name |
SMTP server name that the user wants to use for relaying email [ smtp.domain.com ] |
email_smtp_server_port |
Port Number used by the SMTP server for providing SMTP relay service. Numeric value |
The following examples illustrate how to configure email settings in global policy using NVSM CLI.
user@dgx-2:~$sudo nvsm set /policy email_sender=dgx-admin@nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_smtp_server_name=smtpserver.nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_recipients=jdoe@nvidia.com,jdeer@nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_smtp_server_port=465
Generating a Test Alert for Email
From within an NVSM CLI interactive session, a user may generate a test alert in order to trigger an SMTP instance and receive an email notification.
Creating a Test Alert
NVSM CLI provides a “create testalert
” command to generate a dummy alert that will trigger any SMTP or Call Home defined notification. Within an NVSM CLI interactive session, this basic command generates a dummy alert with default component_``id = Test0`` and severity = Warning
.
~$ sudo nvsm create testalert
To configure the Severity and Component of a test alert, issue the following:
~$ sudo nvsm create testalert <component_id> <severity>
Example of generating a dummy alert with component_id = Email1
and severity = Critical
:
~$ sudo nvsm create testalert Email1 Critical
Clearing a Test Alert
NVSM CLI also provides a “clear testalert
” command to dismiss a generated dummy alert. Within an NVSM CLI interactive session, this basic command will clear any test alert with component_id=Test0
, even if there are multiple such alerts.
~$ sudo nvsm clear testalert
To specify which test alert to dismiss, issue the following:
~$ sudo nvsm clear testalert <component_id>
Showing a Test Alert
To display all generated test alerts, the NVSM CLI provides a “show testalerts
” command
~$ sudo nvsm show testalerts
Example output:
/systems/localhost/testalerts/alert0
Properties:
system_name = system-name5
message_details = Dummy Test
component_id = Test0
description = No component is reporting an error. This is a test.
event_time = 2021-08-04T15:55:46.926710484-07:00
recommended_action = Please run 'sudo nvsm clear testalert' to dismiss this alert.
alert_id = NV-TEST-01
system_serial = To be filled by O.E.M.
message = Test Alert.
severity = Warning
clear_time = -
hidden = false
type = TestAlerts
Understanding System Monitoring Policies
From within an NVSM CLI interactive session, system monitor policy settings are accessible under the following targets.
CLI Target |
Description |
---|---|
/policy |
Global NVSM monitoring policy, such as email settings for alert notifications. |
/systems/localhost/gpus/policy |
|
/systems/localhost/memory/policy |
NVSM policy for monitoring DIMM correctable and uncorrectable errors. |
/systems/localhost/processors/policy |
NVSM policy for monitoring CPU machine-check exceptions (MCE) |
/systems/localhost/storage/policy |
NVSM policy for monitoring storage drives and volumes |
/chassis/policy |
|
/chassis/localhost/thermal/policy |
NVSM policy for monitoring fan speed and temperature as reported by the baseboard management controller (BMC) |
/chassis/localhost/power/policy |
NVSM policy for monitoring power supply voltages as reported by the BMC |
/chassis/localhost/NetworkAdapters/policy |
NVSM policy for monitoring the physical network adapters |
/chassis/localhost/NetworkAdapters/<ETH x >/NetworkPorts/policy |
NVSM policy for monitoring the network ports for the specified Ethernet network adapter |
/chassis/localhost/NetworkAdapters/<IB y >/NetworkPorts/policy |
NVSM policy for monitoring the network ports for the specified InfiniBand network adapter |
/chassis/localhost/NetworkAdapters/<ETH x >/NetworkDeviceFunctions/policy |
NVSM policy for monitoring the PCIe functions for the specified Ethernet network adapter |
/chassis/localhost/NetworkAdapters/<IB y >/NetworkDeviceFunctions/policy |
NVSM policy for monitoring the PCIe functions for the specified InfiniBand network adapter |
Global Monitoring Policy
Global monitoring policy is represented by the /policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /policy
Example output:
/policy
Properties:
email_sender = NVIDIA DSHM Service
email_smtp_server_name = smtp.example.com
email_recipients = jdoe@nvidia.com,jdeer@nvidia.com
email_smtp_server_port = 465
Verbs:
cd
set
show
The properties for global monitoring policy are described in the table below.
Property |
Description |
---|---|
email_sender |
Sender email address |
email_recipients |
List of recipients to which the email shall be sent [ user1@domain.com,user2@domain.com ] |
email_smtp_server_name |
SMTP server name that the user wants to use for relaying email [ smtp.domain.com ] |
email_smtp_server_port |
Port Number used by the SMTP server for providing SMTP relay service. Numeric value |
Memory Monitoring Policy
Memory monitoring policy is represented by the /systems/localhost/memory/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /systems/localhost/memory/policy
Example output:
/systems/localhost/memory/policy
Properties:
mute_notification = <NOT_SET>
mute_monitoring = <NOT_SET>
Verbs:
cd
set
show
The properties for memory monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated DIMM IDs Example: CPU1_DIMM_A1,CPU2_DIMM_F2 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated DIMM IDs Example: CPU1_DIMM_A1,CPU2_DIMM_F2 |
Health monitoring is suppressed for devices in the list. |
Processor Monitoring Policy
Processor monitoring policy is represented by the /systems/localhost/processors/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /systems/localhost/processors/policy
Example output:
/systems/localhost/processors/policy
Properties:
mute_notification = <NOT_SET>
mute_monitoring = <NOT_SET>
Verbs:
cd
set
show
The properties for processor monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated CPU IDs. Example: CPU0,CPU1 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated CPU IDs Example: CPU0,CPU1 |
Health monitoring is suppressed for devices in the list. |
Storage Monitoring Policy
Storage monitoring policy is represented by the /systems/localhost/storage/1/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /systems/localhost/storage/policy
Example output:
/systems/localhost/storage/policy
Properties:
volume_mute_monitoring = <NOT_SET>
volume_poll_interval = 10
drive_mute_monitoring = <NOT_SET>
drive_mute_notification = <NOT_SET>
drive_poll_interval = 10
volume_mute_notification = <NOT_SET>
Verbs:
cd
set
show
The properties for storage monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
drive_mute_notification |
List of comma separated drive slots Example: 0, 1 etc |
Email alert notification is suppressed for drives in the list. |
drive_mute_monitoring |
List of comma separated drive slots Example: 0, 1 etc |
Health monitoring is suppressed for drives in the list. |
drive_poll_interval |
Positive integer |
DSHM checks the health of the drives periodically. By default, this polling occurs every 10 seconds. The poll interval can be configured through this property. |
volume_mute_notification |
List of comma separated volume identifier Example: md0, md1 etc |
Email alert notification is suppressed for volumes in the list |
volume_mute_monitoring |
List of comma separated volume identifier Example: md0, md1 etc |
Health monitoring is suppressed for volumes in the list |
volume_poll_interval |
Positive integer |
DSHM checks the health of the volumes periodically. By default, this polling occurs every 10 seconds. The poll interval can be configured through this property. |
Storage volumes are identified by NVSM uniquely by their associated UUID. The mute monitoring for volume resources will hence use UUID instead of volume name. This is required for NVSM versions greater than 21.09.
Steps to identify the UUID of a volume to be set in mute monitoring & notification are listed below.
To get the list of volumes in the server run the below command:
# nvsm show volumes
# nvsm show volumes /systems/localhost/storage/volumes/md0 Properties: CapacityBytes = 1918641373184 Encrypted = False Id = md0 Name = md0 Status_Health = OK Status_State = Enabled VolumeType = Mirrored
To find the UUID of a particular volume, run the below command. The command lists properties which contain the UUID for the volume with the name md0:
# mdadm --detail /dev/{volume name}
# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Feb 23 18:04:37 2021 Raid Level : raid1 Array Size : 1873673216 (1786.87 GiB 1918.64 GB) Used Dev Size : 1873673216 (1786.87 GiB 1918.64 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Tue Apr 11 08:13:48 2023 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Consistency Policy : bitmap Name : dgx-20-04:0 UUID : 3568aa82:dc3da8ac:5c17ea13:b04cf894 Events : 78460 Number Major Minor RaidDevice State 0 259 5 0 active sync /dev/nvme2n1p2 1 259 15 1 active sync /dev/nvme3n1p2
Run the below command to set the UUID for mute monitoring:
# nvsm set /systems/localhost/storage/policy volume_mute_monitoring=<UUID>
# nvsm set /systems/localhost/storage/policy volume_mute_monitoring=3568aa82:dc3da8ac:5c17ea13:b04cf894
Note
If you get an error message that states:
WARNING:nvsm:Unknown volume ID: 3568aa82:dc3da8ac:5c17ea13:b04cf894
, it can be ignored. This is a known issue that will be fixed in a future version of NVSM.Run the below command to set the UUID for mute notification:
# nvsm set /systems/localhost/storage/policy volume_mute_notification=<UUID>
# nvsm set /systems/localhost/storage/policy volume_mute_notification=3568aa82:dc3da8ac:5c17ea13:b04cf894
Run the below command to verify that the policies were correctly set:
# nvsm show /systems/localhost/storage/policy
# nvsm show /systems/localhost/storage/policy /systems/localhost/storage/policy Properties: controller_mute_monitoring = <NOT_SET> controller_mute_notification = <NOT_SET> controller_poll_interval = 60 drive_mute_monitoring = <NOT_SET> drive_mute_notification = <NOT_SET> drive_poll_interval = 60 volume_mute_monitoring = 3568aa82:dc3da8ac:5c17ea13:b04cf894 volume_mute_notification = 3568aa82:dc3da8ac:5c17ea13:b04cf894 volume_poll_interval = 60 Targets: Verbs: cd set show
Thermal Monitoring Policy
Thermal monitoring policy (for fan speed and temperature) is represented by the /chassis/localhost/thermal/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /chassis/localhost/thermal/policy
Example output:
/chassis/localhost/thermal/policy
Properties:
fan_mute_notification = <NOT_SET>
pdb_mute_monitoring = <NOT_SET>
fan_mute_monitoring = <NOT_SET>
pdb_mute_notification = <NOT_SET>
Verbs:
cd
set
show
The properties for thermal monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
fan_mute_notification |
List of comma separated FAN IDs. Example: FAN2_R,FAN1_L,PDB_FAN2 |
Email alert notification is suppressed for devices in the list. |
fan_mute_monitoring |
List of comma separated FAN IDs Example: FAN6_F,PDB_FAN1 |
Health monitoring is suppressed for devices in the list. |
pdb_mute_notification |
List of comma separated PDB IDs. Example: PDB1,PDB2 |
Email alert notification is suppressed for devices in the list. |
pdb_mute_monitoring |
List of comma separated PDB IDs Example: PDB1 |
Health monitoring is suppressed for devices in the list. |
Power Monitoring Policy
Power monitoring policy is represented by the /chassis/localhost/power/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /chassis/localhost/power/policy
Example output:
/chassis/localhost/power/policy
Properties:
mute_notification = <NOT_SET>
mute_monitoring = <NOT_SET>
Verbs:
cd
set
show
The properties for power monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated PSU IDs. Example: PSU4,PSU2 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated FAN IDs Example: PSU1,PSU4 |
Health monitoring is suppressed for devices in the list. |
PCIe Monitoring Policy
Memory monitoring policy is represented by the /systems/localhost/pcie/policy target of NVSM CLI.
:~$ sudo nvsm show /systems/localhost/pcie/policy
Example output:
/systems/localhost/pcie/policy
Properties:
mute_notification = <NOT_SET>
mute_monitoring = <NOT_SET>
Verbs:
cd
set
show
The properties for memory monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated PCIe IDs |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated PCIe IDs |
Health monitoring is suppressed for devices in the list. |
GPU Monitoring Policy
Memory monitoring policy is represented by the /systems/localhost/gpus/policy target of NVSM CLI.
:~$ sudo nvsm show /systems/localhost/gpus/policy
Example output:
/systems/localhost/gpus/policy
Properties:
mute_notification = <NOT_SET>
mute_monitoring = <NOT_SET>
Verbs:
cd
set
show
The properties for memory monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated GPU IDs |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated GPU IDs |
Health monitoring is suppressed for devices in the list. |
Network Adapter Monitoring Policies
Network Adapter Policy
The physical network adapter monitoring policy is represented by the /chassis/localhost/NetworkAdapters/policy
target of the NVSM CLI.
:~$ sudo nvsm show /chassis/localhost/NetworkAdapters/policy
Example output:
/chassis/localhost/NetworkAdapters/policy
Properties:
mute_notification = <NOT_SET>
mute_monitoring = <NOT_SET>
Verbs:
cd
set
show
The properties are described in the following table.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated physical network adapter IDs. |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated physical network adapter IDs. |
Health monitoring is suppressed for devices in the list. |
The mute monitoring is assigned by using the Physical Adapter name and not the logical name. To get the physical adapter name use the command:
$ sudo nvsm show /chassis/localhost/NetworkAdapters
This command will display a list of target adapter names as shown below:
:~$:/etc/nvsm/platforms# sudo nvsm show /chassis/localhost/NetworkAdapters
/chassis/localhost/NetworkAdapters
Targets:
PCI0000_0c_00
PCI0000_12_00
PCI0000_4b_00
PCI0000_54_00
PCI0000_8d_00
PCI0000_94_00
PCI0000_ba_00
PCI0000_cc_00
PCI0000_e1_00
PCI0000_e2_00
Note
Use these adapter names to assign monitoring policies.
Here is an example that uses the PCI0000_0c_00
network interface:
:~$ sudo nvsm show /chassis/localhost/NetworkAdapters/PCI0000_0c_00/NetworkPorts/policy
Example output:
/chassis/localhost/NetworkAdapters/PCI0000_0c_00/NetworkPorts/policy
Properties:
mute_notification = <NOT_SET>
mute_monitoring = <NOT_SET>
Verbs:
cd
set
show
The properties are described in the following table.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated physical network port IDs. |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated physical network port IDs |
Health monitoring is suppressed for devices in the list. |
Network Devices Functions Policy
The network devices functions monitoring policy is represented by the /chassis/localhost/NetworkAdapters/<network-id>/NetworkDeviceFunctions/policy
target of NVSM CLI.
The following command uses the PCI0000_0c_00
network port to demonstrate this command.
:~$ sudo nvsm show /chassis/localhost/NetworkAdapters/PCI0000_0c_00/NetworkDeviceFunctions/policy
Example output:
/chassis/localhost/NetworkAdapters/PCI0000_0c_00/NetworkDeviceFunctions/policy
Properties:
mute_monitoring = <NOT_SET>
mute_notification = <NOT_SET>
rx_collision_threshold = 5
rx_crc_threshold = 5
tx_collision_threshold = 5
Verbs:
cd
set
show
The properties are described in the following table.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated network-centric PCIe function IDs. Example: PSU4,PSU2 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated network-centric PCIe function IDs. Example: PSU1,PSU4 |
Health monitoring is suppressed for devices in the list. |
rx_collision_threshold |
Positive integer |
|
rx_crc_threshold |
Positive integer |
|
tx_collision_threshold |
Positive integer |
Performing System Management Tasks
This section describes commands for accomplishing some system management tasks.
Rebuilding a RAID/ESP Array for Current NVSM
On DGX systems, cache drives are configured as a RAID 0 array by default. This volume is mounted to /raid
. In the example below, it shows as /dev/md1
, but the name can be different depending on the OS naming schema and configuration.
Additionally for DGX systems with two NVMe OS drives, echo OS drive have two partitions:
The second partitions are configured as a RAID 1 array with the operating system installed. In the examples below, it shows as
/dev/md0
.The first partition is known as the EFI System Partition (ESP). NVSM monitors the content of this partition from both drives. If one of the ESP is corrupted, NVSM can be used to recover that partition from the healthy ESP.
Note
This is not a RAID array, because UEFI does not support booting from software raid volumes.
Viewing a Healthy RAID/ESP Volume
On a healthy system, the OS volume appears with VolumeType = Mirrored
and Status_Health = OK
. For example:
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
CapacityBytes = 1918641373184
Encrypted = False
Id = md0
Name = md0
Status_Health = OK
Status_State = Enabled
VolumeType = Mirrored
Targets:
Verbs:
cd
show
The cache volume appears with VolumeType = NonRedundant
and and Status_Health = OK
. For example:
nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md1
/systems/localhost/storage/volumes/md1
Properties:
CapacityBytes = 30724962910208
Encrypted = False
Id = md1
Name = md1
Status_Health = OK
Status_State = Enabled
VolumeType = NonRedundant
Targets:
encryption
Verbs:
cd
show
The ESP volume appears with VolumeType = EFI
system partition and Status_Health = OK
. The name of the ESP volume varies per system; you can use the command nvsm show volumes to list all volumes and look for VolumeType = EFI
system partition. Here’s the example from DGX A100:
nvsm(/systems/localhost/storage)-> show volumes
...
/systems/localhost/storage/volumes/nvme2n1p1
Properties:
CapacityBytes = 536870912
Encrypted = False
Id = nvme2n1p1
Name = nvme2n1p1
Status_Health = OK
Status_State = Enabled
VolumeType = EFI system partition
...
/systems/localhost/storage/volumes/nvme3n1p1
Properties:
CapacityBytes = 536870912
Encrypted = False
Id = nvme3n1p1
Name = nvme3n1p1
Status_Health = OK
Status_State = StandbyOffline
VolumeType = EFI system partition
Targets:
Verbs:
cd
show
Viewing a Degraded RAID/ESP Volume
On a system with degraded OS volume, the md0 volume will appear with only one drive, with the following Status_Health = Critical
message:
nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
CapacityBytes = 1918641373184
Encrypted = False
Id = md0
Name = md0
Status_Health = Critical
Status_State = Enabled
VolumeType = Mirrored
Targets:
Verbs:
cd
show
On a system with corrupted ESP, the volume will appear with the following Status_Health = Critical
and Status_State = UnavailableOffline
messages:
nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/nvme2n1p1
/systems/localhost/storage/volumes/nvme2n1p1
Properties:
CapacityBytes = 536870912
Encrypted = False
Id = nvme2n1p1
Name = nvme2n1p1
Status_Health = Critical
Status_State = UnavailableOffline
VolumeType = EFI system partition
Targets:
Verbs:
cd
show
Rebuilding the RAID/ESP Volume
To rebuild the RAID/ESP volume, make sure that you have replaced failed NVMe drives.
The RAID rebuilding process should begin automatically upon turning on the system. If it does not start automatically, use NVSM CLI to manually rebuild the array as follows.
Start an NVSM CLI interactive session and switch to the storage target.
~$ sudo nvsm nvsm-> cd /systems/localhost/storage
Start the rebuilding process, and select which volumes to rebuild.
raid-1 for OS volume
raid-0 for cache volume
esp for EFI system partition
For raid-1 volume, you also need to enter the replaced drive name.
Note
This is not the partition name. For example, use nvme3 instead of nvme3n1p2.
nvsm(/systems/localhost/storage)-> start volumes/rebuild PROMPT: In order to rebuild volume, volume type is required. Please specify the volume type to rebuild from options below. raid-0: create raid-0 data volume raid-1: rebuild OS boot and root volumes esp: find and replicate an empty EFI system partition Type of volume rebuild (CTRL-C to cancel): raid-1 PROMPT: In order to rebuild this volume, a spare drive is required. Please specify the spare drive to use to rebuild RAID-1. Name of spare drive for RAID-1 rebuild (CTRL-C to cancel): nvme3 WARNING: Once the rebuild process is started, the process cannot be stopped. Start RAID-1 rebuild? [y/n] y
After entering y at the prompt to start the RAID 1 rebuild, the “Initiating rebuild …” message appears.
/systems/localhost/storage/volumes/rebuild started at 2023-04-10 Initiating RAID-1 rebuild on volume md0... 0.0% [\ ]
After a few seconds, the “Rebuilding RAID-1 …” message appears.
/systems/localhost/storage/volumes/rebuild started at 2023-04-10 08:22:58.910025 Rebuilding RAID-1... 31.0% [=============/ ]
If this message remains at Initiating RAID-1 rebuild for more than 30 seconds, there is a problem with the rebuild process. Verify that the name of the replacement drive is correct and try again.
The RAID 1 rebuild process should take about 1 hour to complete.
For more detailed information on replacing a failed NVMe drive, see the NVIDIA DGX-2 Service Manual or NVIDIA DGX A100 Service Manual.
Rebuilding a RAID 1 Array for Legacy NVSM (< 21.09)
For DGX systems with two NVMe OS drives configure as a RAID 1 array, the operating system is installed on volume md0
. You can use NVSM CLI to view the health of the RAID volume and then rebuild the RAID array on two healthy drives.
Viewing a Healthy RAID Volume
On a healthy system, this volume appears with two drives and Status_Health = OK
. For example:
nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
Status_State = Enabled
Status_Health = OK
Name = md0
Encrypted = False
VolumeType = RAID-1
Drives = [ nvme0n1, nvme1n1 ]
CapacityBytes = 893.6G
Id = md0
Targets:
rebuild
Verbs:
cd
show
Viewing a Degraded RAID Volume
On a system with degraded OS volume, the md0 volume will appear with only one drive, with the following Status_Health = Warning
, and Status_State = Degraded
messages:
nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
Status_State = Degraded
Status_Health = Warning
Name = md0
Encrypted = False
VolumeType = RAID-1
Drives = [ nvme1n1 ]
CapacityBytes = 893.6G
Id = md0Targets:
rebuild
Verbs:
cd
show
In this situation, the OS volume is missing its parity drive.
Rebuilding the RAID 1 Volume
To rebuild the RAID array, make sure that you have installed a known good NVMe drive for the parity drive.
The RAID rebuilding process should begin automatically upon turning on the system. If it does not start automatically, use NVSM CLI to manually rebuild the array as follows.
Start an NVSM CLI interactive session and switch to the storage target.
$ sudo nvsm nvsm-> cd /systems/localhost/storage
Start the rebuilding process and be ready to enter the device name of the replaced drive.
nvsm(/systems/localhost/storage)-> start volumes/md0/rebuild PROMPT: In order to rebuild this volume, a spare drive is required. Please specify the spare drive to use to rebuild md0. Name of spare drive for md0 rebuild (CTRL-C to cancel): nvmeXn1 WARNING: Once the volume rebuild process is started, the process cannot be stopped. Start RAID-1 rebuild on md0? [y/n] y
After entering y at the prompt to start the RAID 1 rebuild, the “Initiating rebuild …” message appears.
/systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12 15:27:26.525187 Initiating RAID-1 rebuild on volume md0... 0.0% [\ ]
After about 30 seconds, the
Rebuilding RAID-1 ...
message should appear./systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12 15:27:26.525187 Rebuilding RAID-1 rebuild on volume md0... 31.0% [=============/ ]
If this message remains at
Initiating RAID-1 rebuild
for more than 30 seconds, there is a problem with the rebuild process. Verify that the name of the replacement drive is correct and try again.
The RAID 1 rebuild process should take about 1 hour to complete.
For more detailed information on replacing a failed NVMe OS drive, see the NVIDIA DGX-2 Service Manual or NVIDIA DGX A100 Service Manual.
Setting MaxQ/MaxP on DGX-2 Systems
Beginning with DGX OS 4.0.5, you can set two GPU performance modes – MaxQ or MaxP.
Note
Support on DGX-2 systems requires BMC firmware version 1.04.03 or later. MaxQ/MaxP is not supported on DGX-2H systems.
MaxQ
Maximum efficiency mode
Allows two DGX-2 systems to be installed in racks that have a power budget of 18 kW.
Switch to MaxQ mode as follows.
$ sudo nvsm set powermode=maxq
The settings are preserved across reboots.
MaxP
Default mode for maximum performance
GPUs operate unconstrained up to the thermal design power (TDP) level.
In this setting, the maximum DGX-2 power consumption is 10 kW.
Provides reduced but better performance than MaxQ when only 3 or 4 PSUs are working.
If you switch to MaxQ mode, you can switch back to MaxP mode as follows:
$ sudo nvsm set powermode=maxp
The settings are preserved across reboots.
Performing a Stress Test
NVSM supports functionality to simultaneously stress various components (GPU, PCIe, DIMMs, Storage Drives, CPUs, Network Cards) of the system with large workloads. The stress-test will provide a summary at the end determining whether each stressed component passed the test or failed with some error. NVSM will also monitor various system metrics during the stress-test to provide a clearer picture of the kinds of computational loads imposed. This stress test can be invoked from the CLI.
Syntax:
$ sudo nvsm stress-test [--usage] [--force] [--no-prompt] [<test>...] [DURATION]
For help on running the test, issue the following.
$ sudo nvsm stress-test --usage
Example output for sudo nvsm stress-test 60 --force
: