C. Using Custom DGX Software Utilities for the DGX Station

The DGX Software includes custom utilities for maintaining the DGX Station persistent storage. Custom utilities for managing and obtaining diagnostic information for the DGX Station were included only in version EL7-20.01 of the DGX Software.

C.1. Rebuilding or Re-Creating the DGX Station RAID Array

Failure of a single drive in a RAID 5 array is a recoverable error but the failure causes data redundancy for the array to be lost. After replacing a single failed SSD in a RAID 5 array, you must rebuild the array to restore data redundancy for the array. Failure of any number of SSDs in a RAID 0 array and failure of more than one SSD in a RAID 5 array are both unrecoverable failures. After replacing the SSDs in response to an unrecoverable failure, you must re-create the array.

If the DGX Station RAID array is degraded because one or more SSDs failed, replace each failed SSD as explained in DGX Station User Guide.

The DGX Station software includes the custom script configure_raid_array.py for rebuilding or re-creating the RAID array.

  • To rebuild a RAID 5 array after replacing a single failed SSD, run the following command:

    $ sudo configure_raid_array.py -r
    Note: The time required to rebuild a RAID 5 array depends on factors such as system load, SSD capacity, and the number of SSDs in the array. Rebuilding the array of three 1.92-terabyte SSDs in the DGX Station may require several hours.

    You can monitor the progress of a long-running rebuild by examining the contents of the /proc/mdstat file:

    $ cat /proc/mdstat
    Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10]
    md0 : active raid5 sdb[0] sdd[3] sdc[1]
          3750486016 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
          [>....................]  recovery =  4.0% (75580956/1875243008) finish=438.3min speed=68419K/sec
          bitmap: 2/14 pages [8KB], 65536KB chunk
    
    unused devices: <none>

    In this example, the rebuild is 4.0% complete and the rebuild is estimated to finish in 438.3 minutes.

  • To re-create a RAID 5 array after replacing more than one failed SSD, run the following command:

    $ sudo configure_raid_array.py -c -5 -f
    CAUTION:
    Specify the -c option only if an unrecoverable failure, such as the failure of more than one SSD, has occurred. The -c option erases all data in the array.
  • To re-create a RAID 0 array after replacing any number of failed SSDs, run the following command:

    $ sudo configure_raid_array.py -c -f
The RAID array is rebuilt or re-created with the RAID level that you specified.
  • If you re-created a RAID 0 or RAID 5 array, all data that was on the array is erased after array is re-created.
  • If you rebuilt a RAID 5 array, the data on the array is preserved after array is rebuilt.
If you have re-created a RAID 0 or RAID 5 array and have a backup of data on the array that you want to preserve, restore the data from the backup.

C.2. Changing the RAID Level of the RAID Array

During the initial installation of the DGX software on CentOS, the data SSDs in the DGX Station are configured as a RAID 0 or RAID 5 array. If your requirements for redundancy or storage capacity change, you can change the RAID level of the array from the level that was initially configured.

Before changing the RAID level of the DGX Station RAID array, back up all data on the array that you want to preserve. Changing the RAID level of the DGX Station RAID array erases all data stored on the array.

The DGX Station software includes the custom script configure_raid_array.py, which you can use to change the level of the RAID array without unmounting the RAID volume.

  • To change the RAID level to RAID 5, run the following command:

    $ sudo configure_raid_array.py -m raid5
    Note:

    After you change the RAID level to RAID 5, the RAID array is rebuilt. A RAID array that is being rebuilt is online and ready to be used, but a check on the health of the DGX Station reports the status of the RAID volume as unhealthy. Therefore, avoid checking the health of the DGX Station while the RAID array is being rebuilt. For more information, see EL7-20.01 Only: Checking the Health of the DGX Station.

    The time required to rebuild the RAID array depends on the workload on the system. On an idle system, the rebuild might be complete within 30 minutes.

  • To change the RAID level to RAID 0, run the following command:

    $ sudo configure_raid_array.py -m raid0

To confirm that the RAID level was changed as required, run the lsblk command. The entry in the TYPE column for each SSD in the RAID array indicates the RAID level of the array.

The following example shows that the RAID level of the array is RAID 0. The name of the RAID volume is md0 and the mount point of the volume is /raid.

~$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda      8:0    0  1.8T  0 disk
|_sda1   8:1    0  487M  0 part  /boot/efi
|_sda2   8:2    0  1.8T  0 part  /
sdb      8:16   0  1.8T  0 disk
|_md0    9:0    0  5.2T  0 raid0 /raid
sdc      8:32   0  1.8T  0 disk
|_md0    9:0    0  5.2T  0 raid0 /raid
sdd      8:48   0  1.8T  0 disk
|_md0    9:0    0  5.2T  0 raid0 /raid

C.3. EL7-20.01 Only: Checking the Health of the DGX Station

Note: Starting with release EL7-20.02, the NVIDIA System Health Checker (nvhealth) tool is replaced by NVIDIA System Management (NVSM). For information about how to use NVSM to perfrom this task, see Show Health in NVIDIA System Management User Guide.

The DGX Station provides the NVIDIA System Health Checker (nvhealth) tool to exercise the system and verify its health. The output of nvhealth is an itemized list of checks and their status, typically Healthy or Unhealthy. On a healthy system, all checks should return Healthy. You should investigate any checks that return Unhealthy to determine their root cause and resolve them.

To check the health of the DGX Station, run the following command:

$ sudo nvhealth [-k output-file]
output-file

The name and the path of the file in which the raw state of the system is written. The nvhealth command displays this file name at the end of the output from the command.

If you omit the output file, the information is written to the file /tmp/nvhealth-log.random-string.jsonl, for example, /tmp/nvhealth-log.6wf3WriAC3.jsonl.

Note:

If you run the nvhealth command while the RAID array is being rebuilt after a change in RAID level to RAID 5, nvhealth reports the status of the RAID volume as unhealthy. To avoid this potentially misleading result, wait until RAID array is rebuilt before running nvhealth.

To check the progress of the rebuild and show the percentage complete and an estimate of the time to completion, run this command:

# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb[0] sdc[1] sdd[2]
     181764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
     [===>.................]  recovery = 17.2% (10426232/60588032) finish=45.8min speed=18238K/sec

C.4. EL7-20.01 Only: Collecting Information for Troubleshooting the DGX Station

Note: Starting with release EL7-20.02, the tool to collect troubleshooting information (nvsysinfo) tool is replaced by NVIDIA System Management (NVSM). For information about how to use NVSM to perfrom this task, see Dump Health in NVIDIA System Management User Guide.

To help diagnose and resolve issues, the DGX Station provides a tool to collect troubleshooting information for NVIDIA Support Enterprise Services.

The tool verifies basic functionality and performance of the DGX Station and collects the following information in an xz-compressed tar archive:

  • Log files
  • Hardware inventory
  • SW inventory

To collect information for troubleshooting the DGX Station, run the following command:

sudo nvsysinfo [-o output-file]
output-file

The path of the file in which the information is written.

If you omit the output file, the name of the file to which the information is written is /tmp/nvsysinfo-host-name-timestamp.tar.xz.

Use any method that is convenient for you to send the file to NVIDIA Support Enterprise Services. For example, send the file as an e-mail attachment.