Using the NVSM CLI
NVIDIA DGX-2 servers running DGX OS version 4.0.1 or later should come with NVSM pre-installed.
NVSM CLI communicates with the privileged NVSM API server, so NVSM CLI requires superuser privileges to run. All examples given in this guide are prefixed with the “sudo” command.
Using the NVSM CLI Interactively
Starting an interactive session
The command “sudo nvsm” will start an NVSM CLI interactive session.
user@dgx-2:~$ sudo nvsm
[sudo] password for user:
nvsm->
Once at the “nvsm->
” prompt, the user can enter NVSM CLI commands to view and manage the DGX system.
Example command
One such command is “show fans”, which prints the state of all fans known to NVSM.
nvsm-> show fans
/chassis/localhost/thermal/fans/FAN10_F
Properties:
Status_State = Enabled
Status_Health = OK
Name = FAN10_F
MemberId = 19
ReadingUnits = RPM
LowerThresholdNonCritical = 5046.000
Reading = 9802 RPM
LowerThresholdCritical = 3596.000
...
/chassis/localhost/thermal/fans/PDB_FAN4
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB_FAN4
MemberId = 23
ReadingUnits = RPM
LowerThresholdNonCritical = 11900.000
Reading = 14076 RPM
LowerThresholdCritical = 10744.000
nvsm->
Leaving an interactive session
To leave the NVSM CLI interactive session, use the “exit” command.
nvsm-> exit
user@dgx2:~$
Using the NVSM CLI Non-Interactively
Any NVSM CLI command can be invoked from the system shell, without starting an NVSM CLI interactive session. To do this, simply append the desired NVSM CLI command to the “sudo nvsm” command. The “show fans” command given above can be invoked directly from the system shell as follows.
user@dgx2:~$ sudo nvsm show fans
/chassis/localhost/thermal/fans/FAN10_F
Properties:
Status_State = Enabled
Status_Health = OK
Name = FAN10_F
MemberId = 19
ReadingUnits = RPM
LowerThresholdNonCritical = 5046.000
Reading = 9802 RPM
LowerThresholdCritical = 3596.000
...
/chassis/localhost/thermal/fans/PDB_FAN4
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB_FAN4
MemberId = 23
ReadingUnits = RPM
LowerThresholdNonCritical = 11900.000
Reading = 14076 RPM
LowerThresholdCritical = 10744.000
user@dgx2:~$
The output of some NVSM commands can be too large to fit on one screen, it is sometimes useful to pipe this output to a paging utility such as “less”.
user@dgx2:~$ sudo nvsm show fans | less
Throughout this chapter, examples are given for both interactive and non-interactive NVSM CLI use cases. Note that these interactive and non-interactive examples are interchangeable.
Getting Help
Apart from the NVSM CLI User Guide (this document), there are many sources for finding additional help for NVSM CLI and the related NVSM tools.
nvsm “man” Page
A man page for NVSM CLI is included on DGX systems with NVSM installed. The user can view this man page by invoking the “man nvsm” command.
user@dgx2:~$ man nvsm
nvsm –help Flag
By passing the –help flag, the nvsm command itself will print a short description of the command line arguments it recognizes. These arguments affect the behavior of the NVSM CLI interactive session, such as inclusion of color or log messages.
user@dgx2:~$ nvsm --help
usage: nvsm [-h] [--color WHEN] [-i] [--] [<command>...]
NVIDIA System Management interface
optional arguments:
-h, --help show this help message and exit
--color WHEN Control colorization of output. Possible
values for WHEN are "always", "never", or
"auto". Default value is "auto".
-i, --interactive When this option is given, run in
interactive mode. The default is
automatic.
--log-level {debug,info,warning,error,critical}
Set the output logging level. Default is
'warning'.
Help for NVSM CLI Commands
Each NVSM command within the NVSM CLI interactive session, such as show, set, and exit, recognizes a “-help
” flag that describes the NVSM command and its arguments.
user@dgx2:~$ sudo nvsm
nvsm-> exit -help
usage: exit [-help]
Leave the NVSM shell.
optional arguments:
-help, -h show this help message and exit
Examining System Health
The most basic functionality of NVSM CLI is examination of system state. NVSM CLI provides a “show” command for this purpose.
Because NVSM CLI is modeled after the SMASH CLP, the output of the NVSM CLI “show” command should be familiar to users of BMC command line interfaces.
List of Basic Commands
The following table lists the basic commands (primarily “show”). Detailed use of these commands are explained in subsequent sections of the document.
Note
On DGX Station, the following are the only commands supported.
nvsm show health
nvsm dump health
Global Commands |
Descriptions |
---|---|
$ sudo nvsm show alerts |
Displays warnings and critical alerts for all subsystems |
$ sudo nvsm show policy |
Displays alert policies for subsystems |
$ sudo nvsm show versions |
Displays system version properties |
Health Commands |
Descriptions |
---|---|
$ sudo nvsm show health |
Displays overall system health |
$ sudo nvsm dump health |
Generates a health report file |
Storage Commands |
Descriptions |
---|---|
$ sudo nvsm show storage |
Displays all storage-related information |
$ sudo nvsm show drives |
Displays the storage drives |
$ sudo nvsm show volumes |
Displays the storage volumes |
GPU Commands |
Descriptions |
---|---|
$ sudo nvsm show gpus |
Displays informatin for all GPUs in the system. |
Processor Commands |
Descriptions |
---|---|
$ sudo nvsm show processors |
Displays information for all CPUs in the system |
$ sudo nvsm show cpus |
Alias for “show processors” |
Memory Commands |
Descriptions |
---|---|
$ sudo nvsm show memory |
Displays information for all installed DIMMs |
$ sudo nvsm show dimms |
Alias for “show memory” |
Thermal Commands |
Descriptions |
---|---|
$ sudo nvsm show fans |
Displays information for all the fans in the system. |
$ sudo nvsm show temperatures |
Displays temperature information for all sensors in the system |
$ sudo nvsm show temps |
Alias for “show temperatures” |
Network Commands |
Descriptions |
---|---|
$ sudo nvsm show networkadapters |
Displays information for the physical network adapters |
$ sudo nvsm show networkinterfaces |
Displays information for the logical network interfaces |
$ sudo nvsm show networkports |
Displays information for the network ports of a given network adapter |
$ sudo nvsm show networkdevicefunctions |
Displays information for the PCIe functions for a given network adapter |
Power Commands |
Descriptions |
---|---|
$ sudo nvsm show power |
Displays information for all power supply units (PSUs) in the system. |
$ sudo nvsm show powermode |
Display the current system power mode |
$ sudo nvsm show psus |
Alias for “show power” |
NVSwitch Commands |
Descriptions |
---|---|
$ sudo nvsm show nvswitches |
Displays information for all the NVSwitch interconnects in the system. |
Firmware Commands |
Descriptions |
---|---|
$ sudo nvsm show firmware |
Guides you through the steps of selecting a firmware update container on your local DGX system, and running it to show the firmware versions installed on the system. This requires that you have already loaded the container onto the DGX system. |
$ sudo nvsm update firmware |
Guides you through the steps of selecting a firmware update container on your local DGX system, and running it to update the firmware on the system. This requires that you have already loaded the container onto the DGX system. |
Show Health
The “show health” command can be used to quickly assess overall system health.
user@dgx-2:~$ sudo nvsm show health
Example output:
...
Checks
------Verify installed DIMM memory sticks..........................
HealthyNumber of logical CPU cores [96].............................
HealthyGPU link speed [0000:39:00.0][8GT/s].........................
HealthyGPU link width [0000:39:00.0][x16]...........................
Healthy
...
Health Summary
--------------
205 out of 205 checks are Healthy
Overall system status is Healthy
If any system health problems are found, this will be reflected in the health summary at the bottom of the “show health” output”. Detailed information on health checks performed will appear above.
Dump Health
The “dump health” command produces a health report file suitable for attaching to support tickets.
user@dgx-2:~$ sudo nvsm dump health
Example output:
Writing output to /tmp/nvsm-health-dgx-1-20180907085048.tar.xzDone.
The file produced by “dump health” is a familiar compressed tar archive, and its contents can be examined by using the “tar” command as shown in the following example.
user@dgx-2:~$ cd /tmp
user@dgx-2:/tmp$ sudo tar xlf nvsm-health-dgx-1-20180907085048.tar.xz
user@dgx-2:/tmp$ sudo ls ./nvsm-health-dgx-1-20180907085048
date java nvsysinfo_commands sos_reports
df last nvsysinfo_log.txt sos_strings
dmidecode lib proc sys
etc lsb-release ps uname
free lsmod pstree uptime
hostname lsof route usr
initctl lspci run var
installed-debs mount sos_commands version.txt
ip_addr netstat sos_logs vgdisplay
Show Storage
NVSM CLI provides a “show storage” command to view all storage-related information. This command can be invoked from the command line as follows.
user@dgx-2:~$ sudo nvsm show storage
Alternatively, the “show drives” and “show volumes” NVSM commands will show the storage drives or storage volumes respectively.
user@dgx-2:~$ sudo nvsm show drives
...
user@dgx-2:~$ sudo nvsm show volumes
...
Within an NVSM CLI interactive session, the CLI targets related to storage are located under the /systems/localhost/storage/1 target.
user@dgx2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/
nvsm(/systems/localhost/storage/)-> show
Example output:
/systems/localhost/storage/
Properties:
DriveCount = 10
Volumes = [ md0, md1, nvme0n1p1, nvme1n1p1 ]
Targets:
alerts
drives
policy
volumes
Verbs:
cd
show
Show Storage Alerts
Storage alerts are generated when the DSHM monitoring daemon detects a storage-related problem and attempts to alert the user (via email or otherwise). Past storage alerts can be viewed within an NVSM CLI interactive session under the /systems/localhost/storage/1/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/alerts
nvsm(/systems/localhost/storage/alerts)-> show
Example output:
/systems/localhost/storage/alerts
Targets:
alert0
alert1
Verbs:
cd
show
In this example listing, there appear to be two storage alerts associated with this system. The contents of these alerts can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/storage/alerts)-> show alert1
/systems/localhost/storage/alerts/alert1
Properties:
system_name = dgx-2
message_details = EFI System Partition 1 is corrupted
nvme0n1p1
component_id = nvme0n1p1
description = Storage sub-system is reporting an error
event_time = 2018-07-14 12:51:19
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
alert_id = NV-VOL-03
system_serial = productserial
message = System entered degraded mode, storage sub-system is reporting an error
severity = Warning
Verbs:
cd
show
The message seen in this alert suggests a possible EFI partition corruption, which is an error condition that might adversely affect this system’s ability to boot. Note that the text seen here reflects the exact message that the user would have seen when this alert was generated.
Possible categories for storage alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-DRIVE-01 |
Critical |
Drive missing |
NV-DRIVE-02 |
Warning |
Media errors detected in drive |
NV-DRIVE-03 |
Warning |
IO errors detected in drive |
NV-DRIVE-04 |
Critical |
NVMe controller failure detected in drive |
NV-DRIVE-05 |
Warning |
Available spare block percentage is below critical threshold of ten percent |
NV-DRIVE-06 |
Warning |
NVM subsystem usage exceeded ninety percent |
NV-DRIVE-07 |
Warning |
System has unsupported drive |
NV-VOL-01 |
Critical |
RAID-0 corruption observed |
NV-VOL-02 |
Critical |
RAID-1 corruption observed |
NV-VOL-03 |
Warning |
EFI System Partition 1 corruption observed |
NV-VOL-04 |
Warning |
EFI System Partition 2 corruption observed |
Show Storage Drives
Within an NVSM CLI interactive session, each storage drive on the system is represented by a target under the /systems/localhost/storage/drives target. A listing of drives can be obtained as follows.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/drives
nvsm(/systems/localhost/storage/drives)-> show
Example output:
/systems/localhost/storage/drives
Targets:
nvme0n1
nvme1n1
nvme2n1
nvme3n1
nvme4n1
nvme5n1
nvme6n1
nvme7n1
nvme8n1
nvme9n1
Verbs:
cd
show
Details for any particular drive can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/storage/drives)-> show nvme2n1
/systems/localhost/storage/drives/nvme2n1
Properties:
Capacity = 3840755982336
BlockSizeBytes = 7501476528
SerialNumber = 18141C244707
PartNumber = N/A
Model = Micron_9200_MTFDHAL3T8TCT
Revision = 100007C0
Manufacturer = Micron Technology Inc
Status_State = Enabled
Status_Health = OK
Name = Non-Volatile Memory Express
MediaType = SSD
IndicatorLED = N/A
EncryptionStatus = N/A
HotSpareType = N/A
Protocol = NVMe
NegotiatedSpeedsGbs = 0
Id = 2
Verbs:
cd
show
Show Storage Volumes
Within an NVSM CLI interactive session, each storage volume on the system is represented by a target under the /systems/localhost/storage/volumes target. A listing of volumes can be obtained as follows.
user@dgx-2:~$ sudo nvsm
nvsmnvsm-> cd /systems/localhost/storage/volumes
nvsm(/systems/localhost/storage/volumes)-> show
Example output:
/systems/localhost/storage/volumes
Targets:
md0
md1
nvme0n1p1
nvme1n1p1
Verbs:
cd
show
Details for any particular volume can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/storage/volumes)-> show md0
/systems/localhost/storage/volumes/md0P
roperties:
Status_State = Enabled
Status_Health = OK
Name = md0
Encrypted = False
VolumeType = RAID-1
Drives = [ nvme0n1, nvme1n1 ]
CapacityBytes = 893.6G
Id = md0
Verbs:
cd
show
Show GPUs
Information for all GPUs installed on the system can be viewed invoking the “show gpus” command as follows.
user@dgx-2:~$ sudo nvsm show gpus
Within an NVSM CLI interactive session, the same information can be accessed under the /systems/localhost/gpus CLI target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/gpus
nvsm(/systems/localhost/gpus)-> show
Example output:
/systems/localhost/gpus
Targets:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Verbs:
cd
show
Details for any particular GPU can also be viewed with the “show” command.
For example:
nvsm(/systems/localhost/gpus)-> show 6
/systems/localhost/gpus/6
Properties:
Inventory_ModelName = Tesla V100-SXM3-32GB
Inventory_UUID = GPU-4c653056-0d6e-df7d-19c0-4663d6745b97
Inventory_SerialNumber = 0332318503073
Inventory_PCIeDeviceId = 1DB810DE
Inventory_PCIeSubSystemId = 12AB10DE
Inventory_BrandName = Tesla
Inventory_PartNumber = 699-2G504-0200-000
Verbs:
cd
show
Showing Individual GPUs
Details for any particular GPU can also be viewed with the “show” command.
For example:
nvsm(/systems/localhost/gpus)-> show GPU6
/systems/localhost/gpus/GPU6
Properties:
Inventory_ModelName = Tesla V100-SXM3-32GB
Inventory_UUID = GPU-4c653056-0d6e-df7d-19c0-4663d6745b97
Inventory_SerialNumber = 0332318503073
Inventory_PCIeDeviceId = 1DB810DE
Inventory_PCIeSubSystemId = 12AB10DE
Inventory_BrandName = Tesla
Inventory_PartNumber = 699-2G504-0200-000
Specifications_MaxPCIeGen = 3
Specifications_MaxPCIeLinkWidth = 16x
Specifications_MaxSpeeds_GraphicsClock = 1597 MHz
Specifications_MaxSpeeds_MemClock = 958 MHz
Specifications_MaxSpeeds_SMClock = 1597 MHz
Specifications_MaxSpeeds_VideoClock = 1432 MHz
Connections_PCIeGen = 3
Connections_PCIeLinkWidth = 16x
Connections_PCIeLocation = 00000000:34:00.0
Power_PowerDraw = 50.95 W
Stats_ErrorStats_ECCMode = Enabled
Stats_FrameBufferMemoryUsage_Free = 32510 MiB
Stats_FrameBufferMemoryUsage_Total = 32510 MiB
Stats_FrameBufferMemoryUsage_Used = 0 MiB
Stats_PCIeRxThroughput = 0 KB/s
Stats_PCIeTxThroughput = 0 KB/s
Stats_PerformanceState = P0
Stats_UtilDecoder = 0 %
Stats_UtilEncoder = 0 %
Stats_UtilGPU = 0 %
Stats_UtilMemory = 0 %
Status_Health = OK
Verbs:
cd
show
Identifying GPU Health Incidents
Explain the benefits of the task, the purpose of the task, who should perform the task, and when to perform the task in 50 words or fewer.
NVSM uses NVIDIA Data Center GPU Manager (DCGM) to continuously monitor GPU health, and reports GPU health issues as “GPU health incidents”. Whenever GPU health incidents are present, NVSM indicates this state in the “Status_HealthRollup
” property of the /systems/localhost/gpus
CLI target.
“Status_HealthRollup
” captures the overall health of all GPUs in the system in a single value. Check the “Status_HealthRollup
” property before checking other properties when checking for GPU health incidents.
To check for GPU health incidents, do the following,
Display the “Properties” section of GPU health
~$ sudo nvsm nvsm-> cd /systems/localhost/gpus nvsm(/systems/localhost/gpus)-> show -display properties
A system with a GPU-related issue might report the following.
Properties: Status_HealthRollup = Critical Status_Health = OK
The “
Status_Health = OK
” property in this example indicates that NVSM did not find any system-level problems, such as missing drivers or incorrect device file permissions.The “
Status_HealthRollup = Critical
” property indicates that at least one GPU in this system is exhibiting a “Critical” health incident.To find this GPU, issue the following command to list the health status for each GPU..
~$ sudo nvsm nvsm-> show -display properties=*health /systems/localhost/gpus/*
The GPU with the health incidents will be reported as in the following example for GPU14.
/systems/localhost/gpus/GPU14 Properties: Status_Health = Critica
Issue the following command to show the detailed health information for a particular GPU (GPU14 in this example).
nvsm-> cd /systems/localhost/gpus nvsm(/systems/localhost/gpus)-> show -level all GPU14/health
The output shows all the incidents involving that particular GPU.
/systems/localhost/gpus/GPU14/health Properties: Health = Critical Targets: incident0 Verbs: cd show/systems/localhost/gpus/GPU2/health/incident0 Properties: Message = GPU 14's NvLink link 2 is currently down. Health = Critical System = NVLink Verbs: cd show
The output in this example narrows down the scope to a specific incident (or incidents) on a specific GPU. DCGM will monitor for a variety of GPU conditions, so check “Status_HealthRollup
” using NVSM CLI to understand each incident.
Show Processors
Information for all CPUs installed on the system can be viewed using the “show processors” command.
user@dgx-2$ sudo nvsm show processors
From within an NVSM CLI interactive session, the same information is available under the /systems/localhost/processors target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/processors
nvsm(/systems/localhost/processors)-> show
Example output:
/systems/localhost/processors
Targets:
CPU0
CPU1
alerts
policy
Verbs:
cd
show
Details for any particular CPU can be viewed using the “show” command.
For example:
nvsm(/systems/localhost/processors)-> show CPU0/systems/localhost/processors/CPU0
Properties:
Id = CPU0
InstructionSet = x86-64
Manufacturer = Intel(R) Corporation
MaxSpeedMHz = 3600
Model = Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Name = Central Processor
ProcessorArchitecture = x86
ProcessorId_EffectiveFamily = 6
ProcessorId_EffectiveModel = 85
ProcessorId_IdentificationRegisters = 0xBFEBFBFF00050654
ProcessorId_Step = 4
ProcessorId_VendorId = GenuineIntel
ProcessorType = CPU
Socket = CPU 0
Status_Health = OK
Status_State = Enabled
TotalCores = 24
TotalThreads = 48
Verbs:
cd
show
Show Processor Alerts
Processor alerts are generated when the DSHM monitoring daemon detects a CPU Internal Error (IERR) or Thermal Trip and attempts to alert the user (via email or otherwise). Past processor alerts can be viewed within an NVSM CLI interactive session under the /systems/localhost/processors/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/processors/alerts
nvsm(/systems/localhost/processors/alerts)-> show
Example output:
/systems/localhost/processors/alerts
Targets:
alert0
alert1
alert2
Verbs:
cd
show
This example listing appears to show three processor alerts associated with this system. The contents of these alerts can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/processors/alerts)-> show alert2
/systems/localhost/processors/alerts/alert2
Properties:
system_name = xpl-bu-06
component_id = CPU0
description = CPU is reporting an error.
event_time = 2018-07-18T16:42:20.580050
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Critical
alert_id = NV-CPU-02
system_serial = To be filled by O.E.M.
message = System entered degraded mode, CPU0 is reporting an error.
message_details = CPU Thermtrip has occurred, processor socket temperature exceeded the thermal specifications of the component.
Verbs:
cd
show
Possible categories for processor alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-CPU-01 |
Critical |
An unrecoverable CPU Internal error has occurred. |
NV-CPU-02 |
Critical |
CPU Thermtrip has occurred, processor socket temperature exceeded the thermal specifications of the component. |
Show Memory
Information for all system memory (i.e. all DIMMs installed near the CPU, not including GPU memory) can be viewed using the “show memory” command.
user@dgx-2:~$ sudo nvsm show memory
From within an NVSM CLI interactive session, system memory information is accessible under the /systems/localhost/memory target.
lab@xpl-dvt-42:~$ sudo nvsm
nvsm-> cd /systems/localhost/memory
nvsm(/systems/localhost/memory)-> show
Example output:
/systems/localhost/memory
Targets:
CPU0_DIMM_A1
CPU0_DIMM_A2
CPU0_DIMM_B1
CPU0_DIMM_B2
CPU0_DIMM_C1
CPU0_DIMM_C2
CPU0_DIMM_D1
CPU0_DIMM_D2
CPU0_DIMM_E1
CPU0_DIMM_E2
CPU0_DIMM_F1
CPU0_DIMM_F2
CPU1_DIMM_G1
CPU1_DIMM_G2
CPU1_DIMM_H1
CPU1_DIMM_H2
CPU1_DIMM_I1
CPU1_DIMM_I2
CPU1_DIMM_J1
CPU1_DIMM_J2
CPU1_DIMM_K1
CPU1_DIMM_K2
CPU1_DIMM_L1
CPU1_DIMM_L2
alerts policy
Verbs:
cd
show
Details for any particular memory DIMM can be viewed using the “show” command.
For example:
nvsm(/systems/localhost/memory)-> show CPU2_DIMM_B1
/systems/localhost/memory/CPU2_DIMM_B1
Properties:
CapacityMiB = 65536
DataWidthBits = 64
Description = DIMM DDR4 Synchronous
Id = CPU2_DIMM_B1
Name = Memory Instance
OperatingSpeedMhz = 2666
PartNumber = 72ASS8G72LZ-2G6B2
SerialNumber = 1CD83000
Status_Health = OK
Status_State = Enabled
VendorId = Micron
Verbs:
cd
show
Show Memory Alerts
On DGX systems with a Baseboard Management Controller (BMC), the BMC will monitor DIMMs for correctable and uncorrectable errors. Whenever memory error counts cross a certain threshold (as determined by SBIOS), a memory alert is generated by the DSHM daemon in an attempt to notify the user (via email or otherwise).
Past memory alerts are accessible from an NVSM CLI interactive session under the /systems/localhost/memory/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/memory/alerts
nvsm(/systems/localhost/memory/alerts)-> show
Example output:
/systems/localhost/memory/alerts
Targets:
alert0
Verbs:
cd
show
This example listing appears to show one memory alert associated with this system. The contents of this alert can be viewed with the “show” command.
For example:
nvsm(/systems/localhost/memory/alerts)-> show alert0
/systems/localhost/memory/alerts/alert0
Properties:
system_name = xpl-bu-06
component_id = CPU1_DIMM_A2
description = DIMM is reporting an error.
event_time = 2018-07-18T16:48:09.906572
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Critical
alert_id = NV-DIMM-01
system_serial = To be filled by O.E.M.
message = System entered degraded mode, CPU1_DIMM_A2 is reporting an error.
message_details = Uncorrectable error is reported.
Verbs:
cd
show
Possible categories for memory alerts are given in the table below.
Alert Type |
Severity |
Details |
---|---|---|
NV-DIMM-01 |
Critical |
Uncorrectable error is reported. |
Show Fans and Temperature
NVSM CLI provides a “show fans” command to display information for each fan on the system.
~$ sudo nvsm show fans
Likewise, NVSM CLI provides a “show temperatures” command to display temperature information for each temperature sensor known to NVSM.
~$ sudo nvsm show temperatures
Within an NVSM CLI interactive session, targets related to fans and temperature are located under the /chassis/localhost/thermal target.
~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal
nvsm(/chassis/localhost/thermal)-> show
Example output:
/chassis/localhost/thermal
Targets:
alerts
fans
policy
temperatures
Verbs:
cd
show
Show Thermal Alerts
The DSHM daemon monitors fan speed and temperature sensors. When the values of these sensors violate certain threshold criteria, DSHM generates a thermal alert in an attempt to notify the user (via email or otherwise).
Past thermal alerts can be viewed in an NVSM CLI interactive session under the /chassis/localhost/thermal/alerts target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal/alerts
nvsm(/chassis/localhost/thermal/alerts)-> show
Example output:
/chassis/localhost/thermal/alerts
Targets:
alert0
Verbs:
cd
show
This example listing appears to show one thermal alert associated with this system. The contents of this alert can be viewed with the “show” command.
For example:
nvsm(/chassis/localhost/thermal/alerts)-> show alert0
/chassis/localhost/thermal/alerts/alert0
Properties:
system_name = system-name
component_id = FAN1_R
description = Fan Module is reporting an error.
event_time = 2018-07-12T15:12:22.076814
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin 3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Critical
alert_id = NV-FAN-01
system_serial = To be filled by O.E.M.
message = System entered degraded mode, FAN1_R is reporting an error.
message_details = Fan speed reading has fallen below the expected speed setting.
Verbs: cd show
From the message in this alert, it appears that one of the rear fans is broken in this system. This is the exact message that the user would have received at the time this alert was generated, assuming alert notifications were enabled.
Possible categories for thermal-related (fan and temperature) alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-FAN-01 |
Critical |
Fan speed reading has fallen below the expected speed setting. |
NV-FAN-02 |
Critical |
Fan readings are inaccessible. |
NV-PDB-01 |
Critical |
Operating temperature exceeds the thermal specifications of the component. |
Show Fans
Within an NVSM CLI interactive session, each fan on the system is represented by a target under the /chassis/localhost/thermal/fans target. The “show” command can be used to obtain a listing of fans on the system.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal/fans
nvsm(/chassis/localhost/thermal/fans)-> show
Example output:
/chassis/localhost/thermal/fans
Targets:
FAN10_F
FAN10_R
FAN1_F
FAN1_R
FAN2_F
FAN2_R
FAN3_F
FAN3_R
FAN4_F
FAN4_R
FAN5_F
FAN5_R
FAN6_F
FAN6_R
FAN7_F
FAN7_R
FAN8_F
FAN8_R
FAN9_F
FAN9_R
PDB_FAN1
PDB_FAN2
PDB_FAN3
PDB_FAN4
Verbs:
cd
show
Again using the “show” command, the details for any given fan can be obtained as follows.
For example:
nvsm(/chassis/localhost/thermal/fans)-> show PDB_FAN2
/chassis/localhost/thermal/fans/PDB_FAN2
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB_FAN2
MemberId = 21
ReadingUnits = RPM
LowerThresholdNonCritical = 11900.000
Reading = 13804 RPM
LowerThresholdCritical = 10744.000
Verbs:
cd
show
Show Temperatures
Each temperature sensor known to NVSM is represented as a target under the /chassis/localhost/thermal/temperatures target. A listing of temperature sensors on the system can be obtained using the following commands.
nvsm(/chassis/localhost/thermal/temperatures)-> show
Example output:
/chassis/localhost/thermal/temperatures
Targets:
PDB1
PDB2
Verbs:
cd
show
As with fans, the details for any temperature sensor can be viewed with the “show” command.
For example:
nvsm(/chassis/localhost/thermal/temperatures)-> show PDB2
/chassis/localhost/thermal/temperatures/PDB2
Properties:
Status_State = Enabled
Status_Health = OK
Name = PDB2
PhysicalContext = PDB
MemberId = 1
ReadingCelsius = 20 degrees C
UpperThresholdNonCritical = 127.000
SensorNumber = 66h
UpperThresholdCritical = 127.000
Verbs:
cd
show
Show Power Supplies
NVSM CLI provides a “show power” command to display information for all power supplies present on the system.
user@dgx-2:~$ sudo nvsm show power
From an NVSM CLI interactive session, power supply information can be found under the /chassis/localhost/power target.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/power
nvsm(/chassis/localhost/power)-> show
Example output:
/chassis/localhost/power
Targets:
PSU1
PSU2
PSU3
PSU4
PSU5
PSU6
alerts policyVerbs: cd show
Details for any particular power supply can be viewed using the “show” command as follows.
For example:
nvsm(/chassis/localhost/power)-> show PSU4
/chassis/localhost/power/PSU4
Properties:
Status_State = Present
Status_Health = OK
LastPowerOutputWatts = 442
Name = PSU4
SerialNumber = DTHTCD18240
MemberId = 3
PowerSupplyType = AC
Model = ECD16010081
Manufacturer = Delta
Verbs:
cd
show
Show Power Alerts
The DSHM daemon monitors PSU status. When the PSU status is not Ok, DSHM generates a power alert in an attempt to notify the user (via email or otherwise).
Prior power alerts can be viewed under the /chassis/localhost/power/alerts target of an NVSM CLI interactive session.
user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/power/alerts
nvsm(/chassis/localhost/power/alerts)-> show
Example output:
/chassis/localhost/power/alerts
Targets:
alert0
alert1
alert2
alert3
alert4
Verbs:
cd
show
This example listing shows a system with five prior power alerts. The details for any one of these alerts can be viewed using the “show” command.
For example:
nvsm(/chassis/localhost/power/alerts)-> show alert4
/chassis/localhost/power/alerts/alert4
Properties:
system_name = system-name
component_id = PSU4
description = PSU is reporting an error.
event_time = 2018-07-18T16:01:27.462005
recommended_action =
1. Please run nvsysinfo
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
severity = Warning
alert_id = NV-PSU-05
system_serial = To be filled by O.E.M.
message = System entered degraded mode, PSU4 is reporting an error.
message_details = PSU is missing
Verbs:
cd
show
Possible categories for power alerts are given in the table below.
Alert ID |
Severity |
Details |
---|---|---|
NV-PSU-01 |
Critical |
Power supply module has failed. |
NV-PSU-02 |
Warning |
Detected predictive failure of the Power supply module. |
NV-PSU-03 |
Critical |
Input to the Power supply module is missing. |
NV-PSU-04 |
Critical |
Input voltage is out of range for the Power Supply Module. |
NV-PSU-05 |
Warning |
PSU is missing |
Show Network Adapters
NVSM CLI provides a “show networkadapters” command to display information for each physical network adapter in the chassis.
~$ sudo nvsm show networkadapters
Within an NVSM CLI interactive session, targets related to network adapters are located under the /chassis/localhost/NetworkAdapters target.
~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters
nvsm(/chassis/localhost/NetworkAdapters)-> show
Show Network Ports
NVSM CLI provides a “show networkports” command to display information for each physical network port in the chassis.
~$ sudo nvsm show networkports
Within an NVSM CLI interactive session, targets related to network adapters are located under the /chassis/localhost/NetworkAdapter/<id>/NetworkPort target, where <id> is one of the network adapter IDs displayed from the nvsm show networkadapters
command.
~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters/<id>/NetworkPorts
nvsm(/chassis/localhost/NetworkAdapters/<id>/NetworkPorts)-> show
Show Network Device Functions
NVSM CLI provides a “show networkdevicefunctions” command to display information for each network adapter-centric PCIe function in the chassis.
~$ sudo nvsm show networkdevicefunctions
Within an NVSM CLI interactive session, targets related to network adevice functions are located under the /chassis/localhost/NetworkAdapter/<id>/NetworkDeviceFunctions target, where <id> is one of the network adapter IDs displayed from the nvsm show networkadapters
command..
~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters/<id>/NetworkDeviceFunctions
nvsm(/chassis/localhost/NetworkAdapters/<id>/NetworkDeviceFunctions)-> show
Show Network Interfaces
NVSM CLI provides a “show networkinterfaces” command to display information for each logical network adapter on the system.
~$ sudo nvsm show networkinterfaces
Within an NVSM CLI interactive session, targets related to network adapters are located under the /system/localhost/networkinterfaces target.
~$ sudo nvsm
nvsm-> cd /system/localhost/NetworkInterfaces
nvsm(/system/localhost/NetworkInterfaces)-> show
System Monitoring Configuration
NVSM provides a DSHM service that monitors the state of the DGX system.
NVSM CLI can be used to interact with the DSHM system monitoring service via the NVSM API server.
Configuring Email Alerts
In order to receive the Alerts generated by DSHM through email, configure the Email settings in the global policy using NVSM CLI. User shall receive email whenever a new alert gets generated. The sender address, recipient address(es), SMTP server IP address and SMTP server Port number must be configured according to the SMTP server settings hosted by the user.
Email configuration properties
Property |
Description |
---|---|
email_sender |
Sender email address Must be a valid email address, otherwise no emails will be sent. |
email_recipients |
List of recipients to which the email shall be sent [ user1@domain.com,user2@domain.com ] |
email_smtp_server_name |
SMTP server name that the user wants to use for relaying email [ smtp.domain.com ] |
email_smtp_server_port |
Port Number used by the SMTP server for providing SMTP relay service. Numeric value |
The following examples illustrate how to configure email settings in global policy using NVSM CLI.
user@dgx-2:~$sudo nvsm set /policy email_sender=dgx-admin@nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_smtp_server_name=smtpserver.nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_recipients=jdoe@nvidia.com,jdeer@nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_smtp_server_port=465
Understanding System Monitoring Policies
From within an NVSM CLI interactive session, system monitor policy settings are accessible under the following targets.
CLI Target |
Description |
---|---|
/policy |
Global NVSM monitoring policy, such as email settings for alert notifications. |
/systems/localhost/gpus/policy |
|
/systems/localhost/memory/policy |
NVSM policy for monitoring DIMM correctable and uncorrectable errors. |
/systems/localhost/processors/policy |
NVSM policy for monitoring CPU machine-check exceptions (MCE) |
/systems/localhost/storage/policy |
NVSM policy for monitoring storage drives and volumes |
/chassis/policy |
|
/chassis/localhost/thermal/policy |
NVSM policy for monitoring fan speed and temperature as reported by the baseboard management controller (BMC) |
/chassis/localhost/power/policy |
NVSM policy for monitoring power supply voltages as reported by the BMC |
/chassis/localhost/NetworkAdaptors/policy |
NVSM policy for monitoring the physical network adapters |
/chassis/localhost/NetworkAdaptors/<ETH x >/NetworkPorts/policy |
NVSM policy for monitoring the network ports for the specified Ethernet network adapter |
/chassis/localhost/NetworkAdaptors/<IB y >/NetworkPorts/policy |
NVSM policy for monitoring the network ports for the specified InfiniBand network adapter |
/chassis/localhost/NetworkAdaptors/<ETH x >/NetworkDevicesFunctions/policy |
NVSM policy for monitoring the PCIe functions for the specified Ethernet network adapter |
/chassis/localhost/NetworkAdaptors/<IB y >/NetworkDevicesFunctions/policy |
NVSM policy for monitoring the PCIe functions for the specified InfiniBand network adapter |
Global Monitoring Policy
Global monitoring policy is represented by the /policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /policy
Example output:
/policy
Properties:
email_sender = NVIDIA DSHM Service
email_smtp_server_name = smtp.example.com
email_recipients = jdoe@nvidia.com,jdeer@nvidia.com
email_smtp_server_port = 465
Verbs:
cd
set
show
The properties for global monitoring policy are described in the table below.
Property |
Description |
---|---|
email_sender |
Sender email address |
email_recipients |
List of recipients to which the email shall be sent [ user1@domain.com,user2@domain.com ] |
email_smtp_server_name |
SMTP server name that the user wants to use for relaying email [ smtp.domain.com ] |
email_smtp_server_port |
Port Number used by the SMTP server for providing SMTP relay service. Numeric value |
Memory Monitoring Policy
Memory monitoring policy is represented by the /systems/localhost/memory/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /systems/localhost/memory/policy
Example output:
/systems/localhost/memory/policy
Properties:
mute_notification =
mute_monitoring =
Verbs:
cd
set
show
The properties for memory monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated DIMM IDs Example: CPU1_DIMM_A1,CPU2_DIMM_F2 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated DIMM IDs Example: CPU1_DIMM_A1,CPU2_DIMM_F2 |
Health monitoring is suppressed for devices in the list. |
Processor Monitoring Policy
Processor monitoring policy is represented by the /systems/localhost/processors/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /systems/localhost/processors/policy
Example output:
/systems/localhost/processors/policy
Properties:
mute_notification =
mute_monitoring =
Verbs:
cd
set
show
The properties for processor monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated CPU IDs. Example: CPU0,CPU1 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated CPU IDs Example: CPU0,CPU1 |
Health monitoring is suppressed for devices in the list. |
Storage Monitoring Policy
Storage monitoring policy is represented by the /systems/localhost/storage/1/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /systems/localhost/storage/policy
Example output:
/systems/localhost/storage/policy
Properties:
volume_mute_monitoring =
volume_poll_interval = 10
drive_mute_monitoring =
drive_mute_notification =
drive_poll_interval = 10
volume_mute_notification =
Verbs:
cd
set
show
The properties for storage monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
drive_mute_notification |
List of comma separated drive slots Example: 0, 1 etc |
Email alert notification is suppressed for drives in the list. |
drive_mute_monitoring |
List of comma separated drive slots Example: 0, 1 etc |
Health monitoring is suppressed for drives in the list. |
drive_poll_interval |
Positive integer |
DSHM checks the health of the drives periodically. By default, this polling occurs every 10 seconds. The poll interval can be configured through this property. |
volume_mute_notification |
List of comma separated volume identifier Example: md0, md1 etc |
Email alert notification is suppressed for volumes in the list |
volume_mute_monitoring |
List of comma separated volume identifier Example: md0, md1 etc |
Health monitoring is suppressed for volumes in the list |
volume_poll_interval |
Positive integer |
DSHM checks the health of the volumes periodically. By default, this polling occurs every 10 seconds. The poll interval can be configured through this property. |
Thermal Monitoring Policy
Thermal monitoring policy (for fan speed and temperature) is represented by the /chassis/localhost/thermal/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /chassis/localhost/thermal/policy
Example output:
/chassis/localhost/thermal/policy
Properties:
fan_mute_notification =
pdb_mute_monitoring =
fan_mute_monitoring =
pdb_mute_notification =
Verbs:
cd
set
show
The properties for thermal monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
fan_mute_notification |
List of comma separated FAN IDs. Example: FAN2_R,FAN1_L,PDB_FAN2 |
Email alert notification is suppressed for devices in the list. |
fan_mute_monitoring |
List of comma separated FAN IDs Example: FAN6_F,PDB_FAN1 |
Health monitoring is suppressed for devices in the list. |
pdb_mute_notification |
List of comma separated PDB IDs. Example: PDB1,PDB2 |
Email alert notification is suppressed for devices in the list. |
pdb_mute_monitoring |
List of comma separated PDB IDs Example: PDB1 |
Health monitoring is suppressed for devices in the list. |
Power Monitoring Policy
Power monitoring policy is represented by the /chassis/localhost/power/policy target of NVSM CLI.
user@dgx-2:~$ sudo nvsm show /chassis/localhost/power/policy
Example output:
/chassis/localhost/power/policy
Properties:
mute_notification =
mute_monitoring =
Verbs:
cd
set
show
The properties for power monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated PSU IDs. Example: PSU4,PSU2 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated FAN IDs Example: PSU1,PSU4 |
Health monitoring is suppressed for devices in the list. |
PCIe Monitoring Policy
Memory monitoring policy is represented by the /systems/localhost/pcie/policy target of NVSM CLI.
:~$ sudo nvsm show /systems/localhost/pcie/policy
Example output:
/systems/localhost/pcie/policy
Properties:
mute_notification =
mute_monitoring =
Verbs:
cd
set
show
The properties for memory monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated PCIe IDs |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated PCIe IDs |
Health monitoring is suppressed for devices in the list. |
GPU Monitoring Policy
Memory monitoring policy is represented by the /systems/localhost/gpus/policy target of NVSM CLI.
:~$ sudo nvsm show /systems/localhost/gpus/policy
Example output:
/systems/localhost/gpus/policy
Properties:
mute_notification =
mute_monitoring =
Verbs:
cd
set
show
The properties for memory monitoring policy are described in the table below.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated GPU IDs |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated GPU IDs |
Health monitoring is suppressed for devices in the list. |
Network Adaptor Monitoring Policies
Network Adapter Policy
The physical network adapter monitoring policy is represented by the /chassis/localhost/NetworkAdaptors/policy target of NVSM CLI.
:~$ sudo nvsm show /chassis/localhost/NetworkAdaptors/policy
Example output:
/chassis/localhost/NetworkAdaptors/policy
Properties:
mute_notification =
mute_monitoring =
Verbs:
cd
set
show
The properties are described in the following table.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated physical network adapter IDs. |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated physical network adapter IDs. |
Health monitoring is suppressed for devices in the list. |
Network Ports Policy
The physical network port monitoring policy is represented by the /chassis/localhost/NetworkAdaptors/<network-id>/NetworkPorts/policy target of NVSM CLI.
The following uses the network port IB0 to demonstrate this command.
:~$ sudo nvsm show /chassis/localhost/NetworkAdaptors/IB0/NetworkPorts/policy
Example output:
/chassis/localhost/NetworkAdaptors/IB0/NetworkPorts/policy
Properties:
mute_notification =
mute_monitoring =
Verbs:
cd
set
show
The properties are described in the following table.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated physical network port IDs. |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated physical network port IDs |
Health monitoring is suppressed for devices in the list. |
Network Devices Functions Policy
The network devices functions monitoring policy is represented by the /chassis/localhost/NetworkAdaptors/<network-id>/NetworkDevicesFunctions/policy target of NVSM CLI.
The following uses the network port IB0 to demonstrate this command.
:~$ sudo nvsm show /chassis/localhost/NetworkAdaptors/IB0/NetworkDevicesFunctions/policy
Example output:
/chassis/localhost/NetworkAdaptors/IB0/NetworkDevicesFunctions/policy
Properties:
mute_monitoring =
mute_notification =
rx_collision_threshold = 5
rx_crc_threshold = 5
tx_collision_threshold = 5
Verbs:
cd
set
show
The properties are described in the following table.
Property |
Syntax |
Description |
---|---|---|
mute_notification |
List of comma separated network-centric PCIe function IDs. Example: PSU4,PSU2 |
Email alert notification is suppressed for devices in the list. |
mute_monitoring |
List of comma separated network-centric PCIe function IDs. Example: PSU1,PSU4 |
Health monitoring is suppressed for devices in the list. |
rx_collision_threshold |
Positive integer |
|
rx_crc_threshold |
Positive integer |
|
tx_collision_threshold |
Positive integer |
Performing System Management Tasks
This section describes commands for accomplishing some system management tasks.
Rebuilding a RAID 1 Array
For DGX systems with two NVMe OS drives configure as a RAID 1 array, the operating system is installed on volume md0. You can use NVSM CLI to view the health of the RAID volume and then rebuild the RAID array on two healthy drives.
Viewing a Healthy RAID Volume
On a healthy system, this volume appears with two drives and “Status_Health = OK”. For example:
nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
Status_State = Enabled
Status_Health = OK
Name = md0
Encrypted = False
VolumeType = RAID-1
Drives = [ nvme0n1, nvme1n1 ]
CapacityBytes = 893.6G
Id = md0
Targets:
rebuild
Verbs:
cd
show
Viewing a Degraded RAID Volume
On a system with degraded OS volume, the md0 volume will appear with only one drive, with messages “Status_Health = Warning”, and “Status_State = Degraded” reported as follows.
nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
Status_State = Degraded
Status_Health = Warning
Name = md0
Encrypted = False
VolumeType = RAID-1
Drives = [ nvme1n1 ]
CapacityBytes = 893.6G
Id = md0Targets:
rebuild
Verbs:
cd
show
In this situation, the OS volume is missing its parity drive.
Rebuilding the RAID 1 Volume
To rebuild the RAID array, make sure that you have installed a known good NVMe drive for the parity drive.
The RAID rebuilding process should begin automatically upon turning on the system. If it does not start automatically, use NVSM CLI to manually rebuild the array as follows.
Start an NVSM CLI interactive session and switch to the storage target.
$ sudo nvsm nvsm-> cd /systems/localhost/storage
Start the rebuilding process and be ready to enter the device name of the replaced drive.
nvsm(/systems/localhost/storage)-> start volumes/md0/rebuild PROMPT: In order to rebuild this volume, a spare drive is required. Please specify the spare drive to use to rebuild md0. Name of spare drive for md0 rebuild (CTRL-C to cancel): nvmeXn1 WARNING: Once the volume rebuild process is started, the process cannot be stopped. Start RAID-1 rebuild on md0? [y/n] y
After entering y at the prompt to start the RAID 1 rebuild, the “Initiating rebuild …” message appears.
/systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12 15:27:26.525187 Initiating RAID-1 rebuild on volume md0... 0.0% [\ ]
After about 30 seconds, the “Rebuilding RAID-1 …” message should appear.
/systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12 15:27:26.525187 Rebuilding RAID-1 rebuild on volume md0... 31.0% [=============/ ]
If this message remains at “Initiating RAID-1 rebuild” for more than 30 seconds, then there is a problem with the rebuild process. In this case, make sure the name of the replacement drive is correct and try again.
The RAID 1 rebuild process should take about 1 hour to complete.
For more detailed information on replacing a failed NVMe OS drive, see the NVIDIA DGX-2 Service Manual.
Setting MaxQ/MaxP on DGX-2 Systems
Beginning with DGX OS 4.0.5, you can set two GPU performance modes – MaxQ or MaxP.
Note
Support on DGX-2 systems requires BMC firmware version 1.04.03 or later. MaxQ/MaxP is not supported on DGX-2H systems.
MaxQ
Maximum efficiency mode
Allows two DGX-2 systems to be installed in racks that have a power budget of 18 kW.
Switch to MaxQ mode as follows.
$ sudo nvsm set powermode=maxq
The settings are preserved across reboots.
MaxP
Default mode for maximum performance
GPUs operate unconstrained up to the thermal design power (TDP) level.
In this setting, the maximum DGX-2 power consumption is 10 kW.
Provides reduced but better performance than MaxQ when only 3 or 4 PSUs are working.
If you switch to MaxQ mode, you can switch back to MaxP mode as follows:
$ sudo nvsm set powermode=maxp
The settings are preserved across reboots.
Configuring Support for Custom Drive Partitioning
DGX systems incorporate data drives configured as either RAID 0 or RAID 5 arrays, depending on the product. You can alter the default configuration by adding or removing drives, or by switching between a RAID 0 configuration and a RAID 5 configuration. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.
To configure NVSM to support a custom drive partitioning, perform the following.
Stop NVSM services.
$ systemctl stop nvsm
Edit
/etc/nvsm/nvsm.config
and set the"use_standard_config_storage"
parameter tofalse
"use_standard_config_storage":false
Remove the NVSM database.
$ sudo rm /var/lib/nvsm/sqlite/nvsm.db
Restart NVSM.
$ systemctl restart nvsm
Remember to set the parameter back to true
if you restore the drive partition back to the default configuration.