Using the NVSM CLI

NVIDIA DGX-2 servers running DGX OS version 4.0.1 or later should come with NVSM pre-installed.

NVSM CLI communicates with the privileged NVSM API server, so NVSM CLI requires superuser privileges to run. All examples given in this guide are prefixed with the “sudo” command.

Using the NVSM CLI Interactively

Starting an interactive session

The command “sudo nvsm” will start an NVSM CLI interactive session.

user@dgx-2:~$ sudo nvsm
[sudo] password for user:
nvsm->

Once at the “nvsm-> ” prompt, the user can enter NVSM CLI commands to view and manage the DGX system.

Example command

One such command is “show fans”, which prints the state of all fans known to NVSM.

nvsm-> show fans
/chassis/localhost/thermal/fans/FAN10_F
Properties:
    Status_State = Enabled
    Status_Health = OK
    Name = FAN10_F
    MemberId = 19
    ReadingUnits = RPM
    LowerThresholdNonCritical = 5046.000
    Reading = 9802 RPM
    LowerThresholdCritical = 3596.000
    ...
    /chassis/localhost/thermal/fans/PDB_FAN4
Properties:
    Status_State = Enabled
    Status_Health = OK
    Name = PDB_FAN4
    MemberId = 23
    ReadingUnits = RPM
    LowerThresholdNonCritical = 11900.000
    Reading = 14076 RPM
    LowerThresholdCritical = 10744.000
nvsm->

Leaving an interactive session

To leave the NVSM CLI interactive session, use the “exit” command.

nvsm-> exit
user@dgx2:~$

Using the NVSM CLI Non-Interactively

Any NVSM CLI command can be invoked from the system shell, without starting an NVSM CLI interactive session. To do this, simply append the desired NVSM CLI command to the “sudo nvsm” command. The “show fans” command given above can be invoked directly from the system shell as follows.

user@dgx2:~$ sudo nvsm show fans
/chassis/localhost/thermal/fans/FAN10_F
Properties:
    Status_State = Enabled
    Status_Health = OK
    Name = FAN10_F
    MemberId = 19
    ReadingUnits = RPM
    LowerThresholdNonCritical = 5046.000
    Reading = 9802 RPM
    LowerThresholdCritical = 3596.000
...
/chassis/localhost/thermal/fans/PDB_FAN4
Properties:
    Status_State = Enabled
    Status_Health = OK
    Name = PDB_FAN4
    MemberId = 23
    ReadingUnits = RPM
    LowerThresholdNonCritical = 11900.000
    Reading = 14076 RPM
    LowerThresholdCritical = 10744.000
user@dgx2:~$

The output of some NVSM commands can be too large to fit on one screen, it is sometimes useful to pipe this output to a paging utility such as “less”.

user@dgx2:~$ sudo nvsm show fans | less

Throughout this chapter, examples are given for both interactive and non-interactive NVSM CLI use cases. Note that these interactive and non-interactive examples are interchangeable.

Getting Help

Apart from the NVSM CLI User Guide (this document), there are many sources for finding additional help for NVSM CLI and the related NVSM tools.

nvsm “man” Page

A man page for NVSM CLI is included on DGX systems with NVSM installed. The user can view this man page by invoking the “man nvsm” command.

user@dgx2:~$ man nvsm

nvsm –help Flag

By passing the –help flag, the nvsm command itself will print a short description of the command line arguments it recognizes. These arguments affect the behavior of the NVSM CLI interactive session, such as inclusion of color or log messages.

user@dgx2:~$ nvsm --help
usage: nvsm [-h] [--color WHEN] [-i] [--] [<command>...]
NVIDIA System Management interface
optional arguments:
  -h, --help            show this help message and exit
  --color WHEN          Control colorization of output. Possible
                        values for WHEN are "always", "never", or
                        "auto". Default value is "auto".
  -i, --interactive     When this option is given, run in
                        interactive mode. The default is
                        automatic.
  --log-level {debug,info,warning,error,critical}
                        Set the output logging level. Default is
                        'warning'.

Help for NVSM CLI Commands

Each NVSM command within the NVSM CLI interactive session, such as show, set, and exit, recognizes a “-help ” flag that describes the NVSM command and its arguments.

user@dgx2:~$ sudo nvsm
nvsm-> exit -help
usage: exit [-help]

Leave the NVSM shell.

optional arguments:
  -help, -h  show this help message and exit

Examining System Health

The most basic functionality of NVSM CLI is examination of system state. NVSM CLI provides a “show” command for this purpose.

Because NVSM CLI is modeled after the SMASH CLP, the output of the NVSM CLI “show” command should be familiar to users of BMC command line interfaces.

List of Basic Commands

The following table lists the basic commands (primarily “show”). Detailed use of these commands are explained in subsequent sections of the document.

Note

On DGX Station, the following are the only commands supported.

  • nvsm show health

  • nvsm dump health

Global Commands

Descriptions

$ sudo nvsm show alerts

Displays warnings and critical alerts for all subsystems

$ sudo nvsm show policy

Displays alert policies for subsystems

$ sudo nvsm show versions

Displays system version properties

Health Commands

Descriptions

$ sudo nvsm show health

Displays overall system health

$ sudo nvsm dump health

Generates a health report file

Storage Commands

Descriptions

$ sudo nvsm show storage

Displays all storage-related information

$ sudo nvsm show drives

Displays the storage drives

$ sudo nvsm show volumes

Displays the storage volumes

GPU Commands

Descriptions

$ sudo nvsm show gpus

Displays informatin for all GPUs in the system.

Processor Commands

Descriptions

$ sudo nvsm show processors

Displays information for all CPUs in the system

$ sudo nvsm show cpus

Alias for “show processors”

Memory Commands

Descriptions

$ sudo nvsm show memory

Displays information for all installed DIMMs

$ sudo nvsm show dimms

Alias for “show memory”

Thermal Commands

Descriptions

$ sudo nvsm show fans

Displays information for all the fans in the system.

$ sudo nvsm show temperatures

Displays temperature information for all sensors in the system

$ sudo nvsm show temps

Alias for “show temperatures”

Network Commands

Descriptions

$ sudo nvsm show networkadapters

Displays information for the physical network adapters

$ sudo nvsm show networkinterfaces

Displays information for the logical network interfaces

$ sudo nvsm show networkports

Displays information for the network ports of a given network adapter

$ sudo nvsm show networkdevicefunctions

Displays information for the PCIe functions for a given network adapter

Power Commands

Descriptions

$ sudo nvsm show power

Displays information for all power supply units (PSUs) in the system.

$ sudo nvsm show powermode

Display the current system power mode

$ sudo nvsm show psus

Alias for “show power”

NVSwitch Commands

Descriptions

$ sudo nvsm show nvswitches

Displays information for all the NVSwitch interconnects in the system.

Firmware Commands

Descriptions

$ sudo nvsm show firmware

Guides you through the steps of selecting a firmware update container on your local DGX system, and running it to show the firmware versions installed on the system. This requires that you have already loaded the container onto the DGX system.

$ sudo nvsm update firmware

Guides you through the steps of selecting a firmware update container on your local DGX system, and running it to update the firmware on the system. This requires that you have already loaded the container onto the DGX system.

Show Health

The “show health” command can be used to quickly assess overall system health.

user@dgx-2:~$ sudo nvsm show health

Example output:

...
Checks
------Verify installed DIMM memory sticks..........................
HealthyNumber of logical CPU cores [96].............................
HealthyGPU link speed [0000:39:00.0][8GT/s].........................
HealthyGPU link width [0000:39:00.0][x16]...........................
Healthy
...
Health Summary
--------------
205 out of 205 checks are Healthy
Overall system status is Healthy

If any system health problems are found, this will be reflected in the health summary at the bottom of the “show health” output”. Detailed information on health checks performed will appear above.

Dump Health

The “dump health” command produces a health report file suitable for attaching to support tickets.

user@dgx-2:~$ sudo nvsm dump health

Example output:

Writing output to /tmp/nvsm-health-dgx-1-20180907085048.tar.xzDone.

The file produced by “dump health” is a familiar compressed tar archive, and its contents can be examined by using the “tar” command as shown in the following example.

user@dgx-2:~$ cd /tmp
user@dgx-2:/tmp$ sudo tar xlf nvsm-health-dgx-1-20180907085048.tar.xz
user@dgx-2:/tmp$ sudo ls ./nvsm-health-dgx-1-20180907085048
date            java         nvsysinfo_commands  sos_reports
df              last         nvsysinfo_log.txt   sos_strings
dmidecode       lib          proc                sys
etc             lsb-release  ps                  uname
free            lsmod        pstree              uptime
hostname        lsof         route               usr
initctl         lspci        run                 var
installed-debs  mount        sos_commands        version.txt
ip_addr         netstat      sos_logs            vgdisplay

Show Storage

NVSM CLI provides a “show storage” command to view all storage-related information. This command can be invoked from the command line as follows.

user@dgx-2:~$ sudo nvsm show storage

Alternatively, the “show drives” and “show volumes” NVSM commands will show the storage drives or storage volumes respectively.

user@dgx-2:~$ sudo nvsm show drives
...
user@dgx-2:~$ sudo nvsm show volumes
...

Within an NVSM CLI interactive session, the CLI targets related to storage are located under the /systems/localhost/storage/1 target.

user@dgx2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/
nvsm(/systems/localhost/storage/)-> show

Example output:

/systems/localhost/storage/
Properties:
    DriveCount = 10
    Volumes = [ md0, md1, nvme0n1p1, nvme1n1p1 ]
Targets:
    alerts
    drives
    policy
    volumes
Verbs:
    cd
    show

Show Storage Alerts

Storage alerts are generated when the DSHM monitoring daemon detects a storage-related problem and attempts to alert the user (via email or otherwise). Past storage alerts can be viewed within an NVSM CLI interactive session under the /systems/localhost/storage/1/alerts target.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/alerts
nvsm(/systems/localhost/storage/alerts)-> show

Example output:

/systems/localhost/storage/alerts
Targets:
    alert0
    alert1
Verbs:
    cd
    show

In this example listing, there appear to be two storage alerts associated with this system. The contents of these alerts can be viewed with the “show” command.

For example:

nvsm(/systems/localhost/storage/alerts)-> show alert1
/systems/localhost/storage/alerts/alert1
Properties:
    system_name = dgx-2
    message_details = EFI System Partition 1 is corrupted
nvme0n1p1
    component_id = nvme0n1p1
    description = Storage sub-system is reporting an error
    event_time = 2018-07-14 12:51:19
    recommended_action =
         1. Please run nvsysinfo
         2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
         3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
    alert_id = NV-VOL-03
    system_serial = productserial
    message = System entered degraded mode, storage sub-system is reporting an error
    severity = Warning
Verbs:
    cd
    show

The message seen in this alert suggests a possible EFI partition corruption, which is an error condition that might adversely affect this system’s ability to boot. Note that the text seen here reflects the exact message that the user would have seen when this alert was generated.

Possible categories for storage alerts are given in the table below.

Alert ID

Severity

Details

NV-DRIVE-01

Critical

Drive missing

NV-DRIVE-02

Warning

Media errors detected in drive

NV-DRIVE-03

Warning

IO errors detected in drive

NV-DRIVE-04

Critical

NVMe controller failure detected in drive

NV-DRIVE-05

Warning

Available spare block percentage is below critical threshold of ten percent

NV-DRIVE-06

Warning

NVM subsystem usage exceeded ninety percent

NV-DRIVE-07

Warning

System has unsupported drive

NV-VOL-01

Critical

RAID-0 corruption observed

NV-VOL-02

Critical

RAID-1 corruption observed

NV-VOL-03

Warning

EFI System Partition 1 corruption observed

NV-VOL-04

Warning

EFI System Partition 2 corruption observed

Show Storage Drives

Within an NVSM CLI interactive session, each storage drive on the system is represented by a target under the /systems/localhost/storage/drives target. A listing of drives can be obtained as follows.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/storage/drives
nvsm(/systems/localhost/storage/drives)-> show

Example output:

/systems/localhost/storage/drives
Targets:
    nvme0n1
    nvme1n1
    nvme2n1
    nvme3n1
    nvme4n1
    nvme5n1
    nvme6n1
    nvme7n1
    nvme8n1
    nvme9n1
Verbs:
    cd
    show

Details for any particular drive can be viewed with the “show” command.

For example:

nvsm(/systems/localhost/storage/drives)-> show nvme2n1
/systems/localhost/storage/drives/nvme2n1
Properties:
    Capacity = 3840755982336
    BlockSizeBytes = 7501476528
    SerialNumber = 18141C244707
    PartNumber = N/A
    Model = Micron_9200_MTFDHAL3T8TCT
    Revision = 100007C0
    Manufacturer = Micron Technology Inc
    Status_State = Enabled
    Status_Health = OK
    Name = Non-Volatile Memory Express
    MediaType = SSD
    IndicatorLED = N/A
    EncryptionStatus = N/A
    HotSpareType = N/A
    Protocol = NVMe
    NegotiatedSpeedsGbs = 0
    Id = 2
Verbs:
    cd
    show

Show Storage Volumes

Within an NVSM CLI interactive session, each storage volume on the system is represented by a target under the /systems/localhost/storage/volumes target. A listing of volumes can be obtained as follows.

user@dgx-2:~$ sudo nvsm

nvsmnvsm-> cd /systems/localhost/storage/volumes
nvsm(/systems/localhost/storage/volumes)-> show

Example output:

/systems/localhost/storage/volumes
Targets:
    md0
    md1
    nvme0n1p1
    nvme1n1p1
Verbs:
    cd
    show

Details for any particular volume can be viewed with the “show” command.

For example:

nvsm(/systems/localhost/storage/volumes)-> show md0
/systems/localhost/storage/volumes/md0P
roperties:
    Status_State = Enabled
    Status_Health = OK
    Name = md0
    Encrypted = False
    VolumeType = RAID-1
    Drives = [ nvme0n1, nvme1n1 ]
    CapacityBytes = 893.6G
    Id = md0
Verbs:
    cd
    show

Show GPUs

Information for all GPUs installed on the system can be viewed invoking the “show gpus” command as follows.

user@dgx-2:~$ sudo nvsm show gpus

Within an NVSM CLI interactive session, the same information can be accessed under the /systems/localhost/gpus CLI target.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/gpus
nvsm(/systems/localhost/gpus)-> show

Example output:

/systems/localhost/gpus
Targets:
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
Verbs:
    cd
    show

Details for any particular GPU can also be viewed with the “show” command.

For example:

nvsm(/systems/localhost/gpus)-> show 6
/systems/localhost/gpus/6
Properties:
    Inventory_ModelName = Tesla V100-SXM3-32GB
    Inventory_UUID = GPU-4c653056-0d6e-df7d-19c0-4663d6745b97
    Inventory_SerialNumber = 0332318503073
    Inventory_PCIeDeviceId = 1DB810DE
    Inventory_PCIeSubSystemId = 12AB10DE
    Inventory_BrandName = Tesla
    Inventory_PartNumber = 699-2G504-0200-000
Verbs:
    cd
    show

Showing Individual GPUs

Details for any particular GPU can also be viewed with the “show” command.

For example:

   nvsm(/systems/localhost/gpus)-> show GPU6
/systems/localhost/gpus/GPU6
Properties:
    Inventory_ModelName = Tesla V100-SXM3-32GB
    Inventory_UUID = GPU-4c653056-0d6e-df7d-19c0-4663d6745b97
    Inventory_SerialNumber = 0332318503073
    Inventory_PCIeDeviceId = 1DB810DE
    Inventory_PCIeSubSystemId = 12AB10DE
    Inventory_BrandName = Tesla
    Inventory_PartNumber = 699-2G504-0200-000
    Specifications_MaxPCIeGen = 3
    Specifications_MaxPCIeLinkWidth = 16x
    Specifications_MaxSpeeds_GraphicsClock = 1597 MHz
    Specifications_MaxSpeeds_MemClock = 958 MHz
    Specifications_MaxSpeeds_SMClock = 1597 MHz
    Specifications_MaxSpeeds_VideoClock = 1432 MHz
    Connections_PCIeGen = 3
    Connections_PCIeLinkWidth = 16x
    Connections_PCIeLocation = 00000000:34:00.0
    Power_PowerDraw = 50.95 W
    Stats_ErrorStats_ECCMode = Enabled
    Stats_FrameBufferMemoryUsage_Free = 32510 MiB
    Stats_FrameBufferMemoryUsage_Total = 32510 MiB
    Stats_FrameBufferMemoryUsage_Used = 0 MiB
    Stats_PCIeRxThroughput = 0 KB/s
    Stats_PCIeTxThroughput = 0 KB/s
    Stats_PerformanceState = P0
    Stats_UtilDecoder = 0 %
    Stats_UtilEncoder = 0 %
    Stats_UtilGPU = 0 %
    Stats_UtilMemory = 0 %
    Status_Health = OK
Verbs:
    cd
    show

Identifying GPU Health Incidents

Explain the benefits of the task, the purpose of the task, who should perform the task, and when to perform the task in 50 words or fewer.

NVSM uses NVIDIA Data Center GPU Manager (DCGM) to continuously monitor GPU health, and reports GPU health issues as  “GPU health incidents”. Whenever GPU health incidents are present, NVSM indicates this state in the “Status_HealthRollup ” property of the /systems/localhost/gpus CLI target.

Status_HealthRollup ” captures the overall health of all GPUs in the system in a single value. Check  the “Status_HealthRollup ” property before checking other properties when checking for GPU health incidents.

To check for GPU health incidents, do the following,

  1. Display the “Properties” section of GPU health

    ~$ sudo nvsm
    nvsm-> cd /systems/localhost/gpus
    nvsm(/systems/localhost/gpus)-> show -display properties
    

    A system with a GPU-related issue might report the following.

    Properties:
        Status_HealthRollup = Critical
        Status_Health = OK
    

    The “Status_Health = OK ” property in this example indicates that NVSM did not find any system-level problems, such as missing drivers or incorrect device file permissions.

    The “Status_HealthRollup = Critical ” property indicates that at least one GPU in this system is exhibiting a “Critical” health incident.

  2. To find this GPU, issue the following command to list the health status for each GPU..

    ~$ sudo nvsm
    nvsm-> show -display properties=*health /systems/localhost/gpus/*
    

    The GPU with the health incidents will be reported as in the following example for GPU14.

    /systems/localhost/gpus/GPU14
    Properties:
        Status_Health = Critica
    
  3. Issue the following command to show the detailed health information for a particular GPU (GPU14 in this example).

    nvsm-> cd /systems/localhost/gpus
    nvsm(/systems/localhost/gpus)-> show -level all GPU14/health
    

    The output shows all the incidents involving that particular GPU.

    /systems/localhost/gpus/GPU14/health
    Properties:
        Health = Critical
    Targets:
        incident0
    Verbs:
        cd
        show/systems/localhost/gpus/GPU2/health/incident0
    Properties:
        Message = GPU 14's NvLink link 2 is currently down.
        Health = Critical
        System = NVLink
    Verbs:
        cd
        show
    

The output in this example narrows down the scope to a specific incident (or incidents) on a specific GPU. DCGM will monitor for a variety of GPU conditions, so check “Status_HealthRollup ” using NVSM CLI to understand each incident.

Show Processors

Information for all CPUs installed on the system can be viewed using the “show processors” command.

user@dgx-2$ sudo nvsm show processors

From within an NVSM CLI interactive session, the same information is available under the /systems/localhost/processors target.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/processors
nvsm(/systems/localhost/processors)-> show

Example output:

/systems/localhost/processors
Targets:
    CPU0
    CPU1
    alerts
    policy
Verbs:
    cd
    show

Details for any particular CPU can be viewed using the “show” command.

For example:

nvsm(/systems/localhost/processors)-> show CPU0/systems/localhost/processors/CPU0
Properties:
    Id = CPU0
    InstructionSet = x86-64
    Manufacturer = Intel(R) Corporation
    MaxSpeedMHz = 3600
    Model = Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
    Name = Central Processor
    ProcessorArchitecture = x86
    ProcessorId_EffectiveFamily = 6
    ProcessorId_EffectiveModel = 85
    ProcessorId_IdentificationRegisters = 0xBFEBFBFF00050654
    ProcessorId_Step = 4
    ProcessorId_VendorId = GenuineIntel
    ProcessorType = CPU
    Socket = CPU 0
    Status_Health = OK
    Status_State = Enabled
    TotalCores = 24
    TotalThreads = 48
Verbs:
    cd
    show

Show Processor Alerts

Processor alerts are generated when the DSHM monitoring daemon detects a CPU Internal Error (IERR) or Thermal Trip and attempts to alert the user (via email or otherwise). Past processor alerts can be viewed within an NVSM CLI interactive session under the /systems/localhost/processors/alerts target.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/processors/alerts
nvsm(/systems/localhost/processors/alerts)-> show

Example output:

/systems/localhost/processors/alerts
Targets:
    alert0
    alert1
    alert2
Verbs:
    cd
    show

This example listing appears to show three processor alerts associated with this system. The contents of these alerts can be viewed with the “show” command.

For example:

nvsm(/systems/localhost/processors/alerts)-> show alert2
/systems/localhost/processors/alerts/alert2
Properties:
      system_name = xpl-bu-06
      component_id = CPU0
      description = CPU is reporting an error.
      event_time = 2018-07-18T16:42:20.580050
      recommended_action =
      1. Please run nvsysinfo
      2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
      3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
      severity = Critical
      alert_id = NV-CPU-02
      system_serial = To be filled by O.E.M.
      message = System entered degraded mode, CPU0 is reporting an error.
      message_details = CPU Thermtrip has occurred, processor socket temperature exceeded the thermal specifications of the component.
Verbs:
    cd
    show

Possible categories for processor alerts are given in the table below.

Alert ID

Severity

Details

NV-CPU-01

Critical

An unrecoverable CPU Internal error has occurred.

NV-CPU-02

Critical

CPU Thermtrip has occurred, processor socket temperature exceeded the thermal specifications of the component.

Show Memory

Information for all system memory (i.e. all DIMMs installed near the CPU, not including GPU memory) can be viewed using the “show memory” command.

user@dgx-2:~$ sudo nvsm show memory

From within an NVSM CLI interactive session, system memory information is accessible under the /systems/localhost/memory target.

lab@xpl-dvt-42:~$ sudo nvsm
nvsm-> cd /systems/localhost/memory
nvsm(/systems/localhost/memory)-> show

Example output:

/systems/localhost/memory
Targets:
    CPU0_DIMM_A1
    CPU0_DIMM_A2
    CPU0_DIMM_B1
    CPU0_DIMM_B2
    CPU0_DIMM_C1
    CPU0_DIMM_C2
    CPU0_DIMM_D1
    CPU0_DIMM_D2
    CPU0_DIMM_E1
    CPU0_DIMM_E2
    CPU0_DIMM_F1
    CPU0_DIMM_F2
    CPU1_DIMM_G1
    CPU1_DIMM_G2
    CPU1_DIMM_H1
    CPU1_DIMM_H2
    CPU1_DIMM_I1
    CPU1_DIMM_I2
    CPU1_DIMM_J1
    CPU1_DIMM_J2
    CPU1_DIMM_K1
    CPU1_DIMM_K2
    CPU1_DIMM_L1
    CPU1_DIMM_L2
    alerts    policy
Verbs:
    cd
    show

Details for any particular memory DIMM can be viewed using the “show” command.

For example:

nvsm(/systems/localhost/memory)-> show CPU2_DIMM_B1
/systems/localhost/memory/CPU2_DIMM_B1
Properties:
    CapacityMiB = 65536
    DataWidthBits = 64
    Description = DIMM DDR4 Synchronous
    Id = CPU2_DIMM_B1
    Name = Memory Instance
    OperatingSpeedMhz = 2666
    PartNumber = 72ASS8G72LZ-2G6B2
    SerialNumber = 1CD83000
    Status_Health = OK
    Status_State = Enabled
    VendorId = Micron
Verbs:
    cd
    show

Show Memory Alerts

On DGX systems with a Baseboard Management Controller (BMC), the BMC will monitor DIMMs for correctable and uncorrectable errors. Whenever memory error counts cross a certain threshold (as determined by SBIOS), a memory alert is generated by the DSHM daemon in an attempt to notify the user (via email or otherwise).

Past memory alerts are accessible from an NVSM CLI interactive session under the /systems/localhost/memory/alerts target.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /systems/localhost/memory/alerts
nvsm(/systems/localhost/memory/alerts)-> show

Example output:

/systems/localhost/memory/alerts
Targets:
    alert0
Verbs:
    cd
    show

This example listing appears to show one memory alert associated with this system. The contents of this alert can be viewed with the “show” command.

For example:

nvsm(/systems/localhost/memory/alerts)-> show alert0
/systems/localhost/memory/alerts/alert0
Properties:
   system_name = xpl-bu-06
   component_id = CPU1_DIMM_A2
   description = DIMM is reporting an error.
   event_time = 2018-07-18T16:48:09.906572
   recommended_action =
       1. Please run nvsysinfo
       2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
       3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
   severity = Critical
   alert_id = NV-DIMM-01
   system_serial = To be filled by O.E.M.
   message = System entered degraded mode, CPU1_DIMM_A2 is reporting an error.
   message_details = Uncorrectable error is reported.
Verbs:
    cd
    show

Possible categories for memory alerts are given in the table below.

Alert Type

Severity

Details

NV-DIMM-01

Critical

Uncorrectable error is reported.

Show Fans and Temperature

NVSM CLI provides a “show fans” command to display information for each fan on the system.

~$ sudo nvsm show fans

Likewise, NVSM CLI provides a “show temperatures” command to display temperature information for each temperature sensor known to NVSM.

~$ sudo nvsm show temperatures

Within an NVSM CLI interactive session, targets related to fans and temperature are located under the /chassis/localhost/thermal target.

~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal
nvsm(/chassis/localhost/thermal)-> show

Example output:

/chassis/localhost/thermal
Targets:
    alerts
    fans
    policy
    temperatures
Verbs:
    cd
    show

Show Thermal Alerts

The DSHM daemon monitors fan speed and temperature sensors. When the values of these sensors violate certain threshold criteria, DSHM generates a thermal alert in an attempt to notify the user (via email or otherwise).

Past thermal alerts can be viewed in an NVSM CLI interactive session under the /chassis/localhost/thermal/alerts target.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal/alerts
nvsm(/chassis/localhost/thermal/alerts)-> show

Example output:

/chassis/localhost/thermal/alerts
Targets:
    alert0
Verbs:
    cd
    show

This example listing appears to show one thermal alert associated with this system. The contents of this alert can be viewed with the “show” command.

For example:

nvsm(/chassis/localhost/thermal/alerts)-> show alert0
/chassis/localhost/thermal/alerts/alert0
Properties:
   system_name = system-name
    component_id = FAN1_R
    description = Fan Module is reporting an error.
    event_time = 2018-07-12T15:12:22.076814
    recommended_action =
        1. Please run nvsysinfo
        2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin        3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
    severity = Critical
    alert_id = NV-FAN-01
    system_serial = To be filled by O.E.M.
    message = System entered degraded mode, FAN1_R is reporting an error.
    message_details = Fan speed reading has fallen below the expected speed setting.
Verbs:    cd    show

From the message in this alert, it appears that one of the rear fans is broken in this system. This is the exact message that the user would have received at the time this alert was generated, assuming alert notifications were enabled.

Possible categories for thermal-related (fan and temperature) alerts are given in the table below.

Alert ID

Severity

Details

NV-FAN-01

Critical

Fan speed reading has fallen below the expected speed setting.

NV-FAN-02

Critical

Fan readings are inaccessible.

NV-PDB-01

Critical

Operating temperature exceeds the thermal specifications of the component.

Show Fans

Within an NVSM CLI interactive session, each fan on the system is represented by a target under the /chassis/localhost/thermal/fans target. The “show” command can be used to obtain a listing of fans on the system.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/thermal/fans
nvsm(/chassis/localhost/thermal/fans)-> show

Example output:

/chassis/localhost/thermal/fans
Targets:
    FAN10_F
    FAN10_R
    FAN1_F
    FAN1_R
    FAN2_F
    FAN2_R
    FAN3_F
    FAN3_R
    FAN4_F
    FAN4_R
    FAN5_F
    FAN5_R
    FAN6_F
    FAN6_R
    FAN7_F
    FAN7_R
    FAN8_F
    FAN8_R
    FAN9_F
    FAN9_R
    PDB_FAN1
    PDB_FAN2
    PDB_FAN3
    PDB_FAN4
Verbs:
    cd
    show

Again using the “show” command, the details for any given fan can be obtained as follows.

For example:

nvsm(/chassis/localhost/thermal/fans)-> show PDB_FAN2
/chassis/localhost/thermal/fans/PDB_FAN2
Properties:
    Status_State = Enabled
    Status_Health = OK
    Name = PDB_FAN2
    MemberId = 21
    ReadingUnits = RPM
    LowerThresholdNonCritical = 11900.000
    Reading = 13804 RPM
    LowerThresholdCritical = 10744.000
Verbs:
    cd
    show

Show Temperatures

Each temperature sensor known to NVSM is represented as a target under the /chassis/localhost/thermal/temperatures target. A listing of temperature sensors on the system can be obtained using the following commands.

nvsm(/chassis/localhost/thermal/temperatures)-> show

Example output:

/chassis/localhost/thermal/temperatures
Targets:
    PDB1
    PDB2
Verbs:
    cd
    show

As with fans, the details for any temperature sensor can be viewed with the “show” command.

For example:

nvsm(/chassis/localhost/thermal/temperatures)-> show PDB2
/chassis/localhost/thermal/temperatures/PDB2
Properties:
    Status_State = Enabled
    Status_Health = OK
    Name = PDB2
    PhysicalContext = PDB
    MemberId = 1
    ReadingCelsius = 20 degrees C
    UpperThresholdNonCritical = 127.000
    SensorNumber = 66h
    UpperThresholdCritical = 127.000
Verbs:
    cd
    show

Show Power Supplies

NVSM CLI provides a “show power” command to display information for all power supplies present on the system.

user@dgx-2:~$ sudo nvsm show power

From an NVSM CLI interactive session, power supply information can be found under the /chassis/localhost/power target.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/power
nvsm(/chassis/localhost/power)-> show

Example output:

/chassis/localhost/power
Targets:
    PSU1
    PSU2
    PSU3
    PSU4
    PSU5
    PSU6
    alerts    policyVerbs:    cd    show

Details for any particular power supply can be viewed using the “show” command as follows.

For example:

nvsm(/chassis/localhost/power)-> show PSU4
/chassis/localhost/power/PSU4
Properties:
    Status_State = Present
    Status_Health = OK
    LastPowerOutputWatts = 442
    Name = PSU4
    SerialNumber = DTHTCD18240
    MemberId = 3
    PowerSupplyType = AC
    Model = ECD16010081
    Manufacturer = Delta
Verbs:
    cd
    show

Show Power Alerts

The DSHM daemon monitors PSU status. When the PSU status is not Ok, DSHM generates a power alert in an attempt to notify the user (via email or otherwise).

Prior power alerts can be viewed under the /chassis/localhost/power/alerts target of an NVSM CLI interactive session.

user@dgx-2:~$ sudo nvsm
nvsm-> cd /chassis/localhost/power/alerts
nvsm(/chassis/localhost/power/alerts)-> show

Example output:

/chassis/localhost/power/alerts
Targets:
    alert0
    alert1
    alert2
    alert3
    alert4
Verbs:
    cd
    show

This example listing shows a system with five prior power alerts. The details for any one of these alerts can be viewed using the “show” command.

For example:

nvsm(/chassis/localhost/power/alerts)-> show alert4
/chassis/localhost/power/alerts/alert4
Properties:
   system_name = system-name
   component_id = PSU4
   description = PSU is reporting an error.
   event_time = 2018-07-18T16:01:27.462005
   recommended_action =
       1. Please run nvsysinfo
       2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/enterpriselogin
       3. Attach this notification and the nvsysinfo log file from /tmp/nvsysinfo-XYZ*
   severity = Warning
   alert_id = NV-PSU-05
   system_serial = To be filled by O.E.M.
   message = System entered degraded mode, PSU4 is reporting an error.
   message_details = PSU is missing
Verbs:
    cd
    show

Possible categories for power alerts are given in the table below.

Alert ID

Severity

Details

NV-PSU-01

Critical

Power supply module has failed.

NV-PSU-02

Warning

Detected predictive failure of the Power supply module.

NV-PSU-03

Critical

Input to the Power supply module is missing.

NV-PSU-04

Critical

Input voltage is out of range for the Power Supply Module.

NV-PSU-05

Warning

PSU is missing

Show Network Adapters

NVSM CLI provides a “show networkadapters” command to display information for each physical network adapter in the chassis.

~$ sudo nvsm show networkadapters

Within an NVSM CLI interactive session, targets related to network adapters are located under the /chassis/localhost/NetworkAdapters target.

~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters
nvsm(/chassis/localhost/NetworkAdapters)-> show

Show Network Ports

NVSM CLI provides a “show networkports” command to display information for each physical network port in the chassis.

~$ sudo nvsm show networkports

Within an NVSM CLI interactive session, targets related to network adapters are located under the /chassis/localhost/NetworkAdapter/<id>/NetworkPort target, where <id> is one of the network adapter IDs displayed from the nvsm show networkadapters command.

~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters/<id>/NetworkPorts
nvsm(/chassis/localhost/NetworkAdapters/<id>/NetworkPorts)-> show

Show Network Device Functions

NVSM CLI provides a “show networkdevicefunctions” command to display information for each network adapter-centric PCIe function in the chassis.

~$ sudo nvsm show networkdevicefunctions

Within an NVSM CLI interactive session, targets related to network adevice functions are located under the /chassis/localhost/NetworkAdapter/<id>/NetworkDeviceFunctions target, where <id> is one of the network adapter IDs displayed from the nvsm show networkadapters command..

~$ sudo nvsm
nvsm-> cd /chassis/localhost/NetworkAdapters/<id>/NetworkDeviceFunctions
nvsm(/chassis/localhost/NetworkAdapters/<id>/NetworkDeviceFunctions)-> show

Show Network Interfaces

NVSM CLI provides a “show networkinterfaces” command to display information for each logical network adapter on the system.

~$ sudo nvsm show networkinterfaces

Within an NVSM CLI interactive session, targets related to network adapters are located under the /system/localhost/networkinterfaces target.

~$ sudo nvsm
nvsm-> cd /system/localhost/NetworkInterfaces
nvsm(/system/localhost/NetworkInterfaces)-> show

System Monitoring Configuration

NVSM provides a DSHM service that monitors the state of the DGX system.

NVSM CLI can be used to interact with the DSHM system monitoring service via the NVSM API server.

Configuring Email Alerts

In order to receive the Alerts generated by DSHM through email, configure the Email settings in the global policy using NVSM CLI. User shall receive email whenever a new alert gets generated. The sender address, recipient address(es), SMTP server IP address and SMTP server Port number must be configured according to the SMTP server settings hosted by the user.

Email configuration properties

Property

Description

email_sender

Sender email address

Must be a valid email address, otherwise no emails will be sent.

[ sender@domain.com ]

email_recipients

List of recipients to which the email shall be sent

[ user1@domain.com,user2@domain.com ]

email_smtp_server_name

SMTP server name that the user wants to use for relaying email

[ smtp.domain.com ]

email_smtp_server_port

Port Number used by the SMTP server for providing SMTP relay service. Numeric value

The following examples illustrate how to configure email settings in global policy using NVSM CLI.

user@dgx-2:~$sudo nvsm set /policy email_sender=dgx-admin@nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_smtp_server_name=smtpserver.nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_recipients=jdoe@nvidia.com,jdeer@nvidia.com
user@dgx-2:~$sudo nvsm set /policy email_smtp_server_port=465

Understanding System Monitoring Policies

From within an NVSM CLI interactive session, system monitor policy settings are accessible under the following targets.

CLI Target

Description

/policy

Global NVSM monitoring policy, such as email settings for alert notifications.

/systems/localhost/gpus/policy

/systems/localhost/memory/policy

NVSM policy for monitoring DIMM correctable and uncorrectable errors.

/systems/localhost/processors/policy

NVSM policy for monitoring CPU machine-check exceptions (MCE)

/systems/localhost/storage/policy

NVSM policy for monitoring storage drives and volumes

/chassis/policy

/chassis/localhost/thermal/policy

NVSM policy for monitoring fan speed and temperature as reported by the baseboard management controller (BMC)

/chassis/localhost/power/policy

NVSM policy for monitoring power supply voltages as reported by the BMC

/chassis/localhost/NetworkAdaptors/policy

NVSM policy for monitoring the physical network adapters

/chassis/localhost/NetworkAdaptors/<ETH x >/NetworkPorts/policy

NVSM policy for monitoring the network ports for the specified Ethernet network adapter

/chassis/localhost/NetworkAdaptors/<IB y >/NetworkPorts/policy

NVSM policy for monitoring the network ports for the specified InfiniBand network adapter

/chassis/localhost/NetworkAdaptors/<ETH x >/NetworkDevicesFunctions/policy

NVSM policy for monitoring the PCIe functions for the specified Ethernet network adapter

/chassis/localhost/NetworkAdaptors/<IB y >/NetworkDevicesFunctions/policy

NVSM policy for monitoring the PCIe functions for the specified InfiniBand network adapter

Global Monitoring Policy

Global monitoring policy is represented by the /policy target of NVSM CLI.

user@dgx-2:~$ sudo nvsm show /policy

Example output:

/policy
Properties:
    email_sender = NVIDIA DSHM Service
    email_smtp_server_name = smtp.example.com
    email_recipients = jdoe@nvidia.com,jdeer@nvidia.com
    email_smtp_server_port = 465
Verbs:
    cd
    set
    show

The properties for global monitoring policy are described in the table below.

Property

Description

email_sender

Sender email address

[ sender@domain.com ]

email_recipients

List of recipients to which the email shall be sent

[ user1@domain.com,user2@domain.com ]

email_smtp_server_name

SMTP server name that the user wants to use for relaying email

[ smtp.domain.com ]

email_smtp_server_port

Port Number used by the SMTP server for providing SMTP relay service. Numeric value

Memory Monitoring Policy

Memory monitoring policy is represented by the /systems/localhost/memory/policy target of NVSM CLI.

user@dgx-2:~$ sudo nvsm show /systems/localhost/memory/policy

Example output:

/systems/localhost/memory/policy
Properties:
    mute_notification =
    mute_monitoring =

Verbs:
    cd
    set
    show

The properties for memory monitoring policy are described in the table below.

Property

Syntax

Description

mute_notification

List of comma separated DIMM IDs

Example: CPU1_DIMM_A1,CPU2_DIMM_F2

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated DIMM IDs

Example: CPU1_DIMM_A1,CPU2_DIMM_F2

Health monitoring is suppressed for devices in the list.

Processor Monitoring Policy

Processor monitoring policy is represented by the /systems/localhost/processors/policy target of NVSM CLI.

user@dgx-2:~$ sudo nvsm show /systems/localhost/processors/policy

Example output:

/systems/localhost/processors/policy
Properties:
    mute_notification =
    mute_monitoring =

Verbs:
    cd
    set
    show

The properties for processor monitoring policy are described in the table below.

Property

Syntax

Description

mute_notification

List of comma separated CPU IDs.

Example: CPU0,CPU1

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated CPU IDs

Example: CPU0,CPU1

Health monitoring is suppressed for devices in the list.

Storage Monitoring Policy

Storage monitoring policy is represented by the /systems/localhost/storage/1/policy target of NVSM CLI.

user@dgx-2:~$ sudo nvsm show /systems/localhost/storage/policy

Example output:

/systems/localhost/storage/policy
Properties:
    volume_mute_monitoring =
    volume_poll_interval = 10
    drive_mute_monitoring =
    drive_mute_notification =
    drive_poll_interval = 10
    volume_mute_notification =
Verbs:
    cd
    set
    show

The properties for storage monitoring policy are described in the table below.

Property

Syntax

Description

drive_mute_notification

List of comma separated drive slots

Example: 0, 1 etc

Email alert notification is suppressed for drives in the list.

drive_mute_monitoring

List of comma separated drive slots

Example: 0, 1 etc

Health monitoring is suppressed for drives in the list.

drive_poll_interval

Positive integer

DSHM checks the health of the drives periodically. By default, this polling occurs every 10 seconds. The poll interval can be configured through this property.

volume_mute_notification

List of comma separated volume identifier

Example: md0, md1 etc

Email alert notification is suppressed for volumes in the list

volume_mute_monitoring

List of comma separated volume identifier

Example: md0, md1 etc

Health monitoring is suppressed for volumes in the list

volume_poll_interval

Positive integer

DSHM checks the health of the volumes periodically. By default, this polling occurs every 10 seconds. The poll interval can be configured through this property.

Thermal Monitoring Policy

Thermal monitoring policy (for fan speed and temperature) is represented by the /chassis/localhost/thermal/policy target of NVSM CLI.

user@dgx-2:~$ sudo nvsm show /chassis/localhost/thermal/policy

Example output:

/chassis/localhost/thermal/policy
Properties:
    fan_mute_notification =
    pdb_mute_monitoring =
    fan_mute_monitoring =
    pdb_mute_notification =
Verbs:
    cd
    set
    show

The properties for thermal monitoring policy are described in the table below.

Property

Syntax

Description

fan_mute_notification

List of comma separated FAN IDs.

Example: FAN2_R,FAN1_L,PDB_FAN2

Email alert notification is suppressed for devices in the list.

fan_mute_monitoring

List of comma separated FAN IDs

Example: FAN6_F,PDB_FAN1

Health monitoring is suppressed for devices in the list.

pdb_mute_notification

List of comma separated PDB IDs.

Example: PDB1,PDB2

Email alert notification is suppressed for devices in the list.

pdb_mute_monitoring

List of comma separated PDB IDs

Example: PDB1

Health monitoring is suppressed for devices in the list.

Power Monitoring Policy

Power monitoring policy is represented by the /chassis/localhost/power/policy target of NVSM CLI.

user@dgx-2:~$ sudo nvsm show /chassis/localhost/power/policy

Example output:

/chassis/localhost/power/policy
Properties:
    mute_notification =
    mute_monitoring =

Verbs:
    cd
    set
    show

The properties for power monitoring policy are described in the table below.

Property

Syntax

Description

mute_notification

List of comma separated PSU IDs.

Example: PSU4,PSU2

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated FAN IDs

Example: PSU1,PSU4

Health monitoring is suppressed for devices in the list.

PCIe Monitoring Policy

Memory monitoring policy is represented by the /systems/localhost/pcie/policy target of NVSM CLI.

:~$ sudo nvsm show /systems/localhost/pcie/policy

Example output:

/systems/localhost/pcie/policy
Properties:
    mute_notification =
    mute_monitoring =

Verbs:
    cd
    set
    show

The properties for memory monitoring policy are described in the table below.

Property

Syntax

Description

mute_notification

List of comma separated PCIe IDs

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated PCIe IDs

Health monitoring is suppressed for devices in the list.

GPU Monitoring Policy

Memory monitoring policy is represented by the /systems/localhost/gpus/policy target of NVSM CLI.

:~$ sudo nvsm show /systems/localhost/gpus/policy

Example output:

/systems/localhost/gpus/policy
Properties:
    mute_notification =
    mute_monitoring =

Verbs:
    cd
    set
    show

The properties for memory monitoring policy are described in the table below.

Property

Syntax

Description

mute_notification

List of comma separated GPU IDs

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated GPU IDs

Health monitoring is suppressed for devices in the list.

Network Adaptor Monitoring Policies

Network Adapter Policy

The physical network adapter monitoring policy is represented by the /chassis/localhost/NetworkAdaptors/policy target of NVSM CLI.

:~$ sudo nvsm show /chassis/localhost/NetworkAdaptors/policy

Example output:

/chassis/localhost/NetworkAdaptors/policy
Properties:
    mute_notification =
    mute_monitoring =
Verbs:
    cd
    set
    show

The properties are described in the following table.

Property

Syntax

Description

mute_notification

List of comma separated physical network adapter IDs.

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated physical network adapter IDs.

Health monitoring is suppressed for devices in the list.

Network Ports Policy

The physical network port monitoring policy is represented by the /chassis/localhost/NetworkAdaptors/<network-id>/NetworkPorts/policy target of NVSM CLI.

The following uses the network port IB0 to demonstrate this command.

:~$ sudo nvsm show /chassis/localhost/NetworkAdaptors/IB0/NetworkPorts/policy

Example output:

/chassis/localhost/NetworkAdaptors/IB0/NetworkPorts/policy
Properties:
    mute_notification =
    mute_monitoring =
Verbs:
    cd
    set
    show

The properties are described in the following table.

Property

Syntax

Description

mute_notification

List of comma separated physical network port IDs.

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated physical network port IDs

Health monitoring is suppressed for devices in the list.

Network Devices Functions Policy

The network devices functions monitoring policy is represented by the /chassis/localhost/NetworkAdaptors/<network-id>/NetworkDevicesFunctions/policy target of NVSM CLI.

The following uses the network port IB0 to demonstrate this command.

:~$ sudo nvsm show /chassis/localhost/NetworkAdaptors/IB0/NetworkDevicesFunctions/policy

Example output:

/chassis/localhost/NetworkAdaptors/IB0/NetworkDevicesFunctions/policy
Properties:
    mute_monitoring =
    mute_notification =
    rx_collision_threshold = 5
    rx_crc_threshold = 5
    tx_collision_threshold = 5
Verbs:
    cd
    set
    show

The properties are described in the following table.

Property

Syntax

Description

mute_notification

List of comma separated network-centric PCIe function IDs.

Example: PSU4,PSU2

Email alert notification is suppressed for devices in the list.

mute_monitoring

List of comma separated network-centric PCIe function IDs.

Example: PSU1,PSU4

Health monitoring is suppressed for devices in the list.

rx_collision_threshold

Positive integer

rx_crc_threshold

Positive integer

tx_collision_threshold

Positive integer

Performing System Management Tasks

This section describes commands for accomplishing some system management tasks.

Rebuilding a RAID 1 Array

For DGX systems with two NVMe OS drives configure as a RAID 1 array, the operating system is installed on volume md0. You can use NVSM CLI to view the health of the RAID volume and then rebuild the RAID array on two healthy drives.

Viewing a Healthy RAID Volume

On a healthy system, this volume appears with two drives and “Status_Health = OK”. For example:

 nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
    Status_State = Enabled
    Status_Health = OK
    Name = md0
    Encrypted = False
    VolumeType = RAID-1
    Drives = [ nvme0n1, nvme1n1 ]
    CapacityBytes = 893.6G
    Id = md0
Targets:
    rebuild
Verbs:
    cd
    show

Viewing a Degraded RAID Volume

On a system with degraded OS volume, the md0 volume will appear with only one drive, with messages “Status_Health = Warning”, and “Status_State = Degraded” reported as follows.

nvsm-> cd /systems/localhost/storage
nvsm(/systems/localhost/storage)-> show volumes/md0
/systems/localhost/storage/volumes/md0
Properties:
    Status_State = Degraded
    Status_Health = Warning
    Name = md0
    Encrypted = False
    VolumeType = RAID-1
    Drives = [ nvme1n1 ]
    CapacityBytes = 893.6G
    Id = md0Targets:
    rebuild
Verbs:
    cd
    show

In this situation, the OS volume is missing its parity drive.

Rebuilding the RAID 1 Volume

To rebuild the RAID array, make sure that you have installed a known good NVMe drive for the parity drive.

The RAID rebuilding process should begin automatically upon turning on the system. If it does not start automatically, use NVSM CLI to manually rebuild the array as follows.

  1. Start an NVSM CLI interactive session and switch to the storage target.

    $ sudo nvsm
    nvsm-> cd /systems/localhost/storage
    
  2. Start the rebuilding process and be ready to enter the device name of the replaced drive.

    nvsm(/systems/localhost/storage)-> start volumes/md0/rebuild
    PROMPT: In order to rebuild this volume, a spare drive
            is required. Please specify the spare drive to use
            to rebuild md0.
    Name of spare drive for md0 rebuild (CTRL-C to cancel): nvmeXn1
    WARNING: Once the volume rebuild process is started, the
             process cannot be stopped.
    Start RAID-1 rebuild on md0? [y/n] y
    
  3. After entering y at the prompt to start the RAID 1 rebuild, the “Initiating rebuild …” message appears.

    /systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12 15:27:26.525187
     Initiating RAID-1 rebuild on volume md0...
     0.0% [\ ]
    

    After about 30 seconds, the “Rebuilding RAID-1 …” message should appear.

    /systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12 15:27:26.525187
     Rebuilding RAID-1 rebuild on volume md0...
     31.0% [=============/ ]
    

    If this message remains at “Initiating RAID-1 rebuild” for more than 30 seconds, then there is a problem with the rebuild process. In this case, make sure the name of the replacement drive is correct and try again.

The RAID 1 rebuild process should take about 1 hour to complete.

For more detailed information on replacing a failed NVMe OS drive, see the NVIDIA DGX-2 Service Manual.

Setting MaxQ/MaxP on DGX-2 Systems

Beginning with DGX OS 4.0.5, you can set two GPU performance modes – MaxQ or MaxP.

Note

Support on DGX-2 systems requires BMC firmware version 1.04.03 or later. MaxQ/MaxP is not supported on DGX-2H systems.

MaxQ

  • Maximum efficiency mode

  • Allows two DGX-2 systems to be installed in racks that have a power budget of 18 kW.

  • Switch to MaxQ mode as follows.

    $ sudo nvsm set powermode=maxq
    

    The settings are preserved across reboots.

MaxP

  • Default mode for maximum performance

  • GPUs operate unconstrained up to the thermal design power (TDP) level.

    In this setting, the maximum DGX-2 power consumption is 10 kW.

  • Provides reduced but better performance than MaxQ when only 3 or 4 PSUs are working.

  • If you switch to MaxQ mode, you can switch back to MaxP mode as follows:

    $ sudo nvsm set powermode=maxp
    

    The settings are preserved across reboots.

Configuring Support for Custom Drive Partitioning

DGX systems incorporate data drives configured as either RAID 0 or RAID 5 arrays, depending on the product. You can alter the default configuration by adding or removing drives, or by switching between a RAID 0 configuration and a RAID 5 configuration. If you alter the default configuration, you must let NVSM know so that the utility does not flag the configuration as an error, and so that NVSM can continue to monitor the health of the drives.

To configure NVSM to support a custom drive partitioning, perform the following.

  1. Stop NVSM services.

    $ systemctl stop nvsm
    
  2. Edit /etc/nvsm/nvsm.config and set the "use_standard_config_storage" parameter to false

    "use_standard_config_storage":false
    
  3. Remove the NVSM database.

    $ sudo rm /var/lib/nvsm/sqlite/nvsm.db
    
  4. Restart NVSM.

    $ systemctl restart nvsm
    

Remember to set the parameter back to true if you restore the drive partition back to the default configuration.