Internal Storage Configuration

For deep learning to be effective and to take full advantage of the DGX family, the various aspects of the system have to be balanced. This includes storage and I/O. This is particularly important for feeding data to the GPUs to keep them busy and dramatically reduce run times for models. This section presents some best practices for storage within and outside of the DGX system. It also talks about storage considerations as the number of DGX units are scaled out, primarily the DGX-1 and DGX-2.

  • NFS Cache

  • Data drive

The first storage consideration is storage within the DGX itself. The focus of the internal storage, outside of the OS drive, is performance.

NFS Cache for Deep Learning

Deep Learning I/O patterns typically consist of multiple iterations of reading the training data. The first epoch of training reads the data that is used to start training the model. Subsequent passes through the data can avoid rereading the data from NFS if adequate local caching is provided on the node. If you can estimate the maximum size of your data, you can architect your system to provide enough cache so that the data only needs to be read once during any training job. A set of very fast SSD disks can provide an inexpensive and scalable way of providing adequate caching for your applications. The DGX family NFS read cache was created for precisely this purpose, offering roughly 5, 7, and 30+ TB of fast local cache on DGX Station, DGX-1, and DGX-2, respectively.

For training the best possible model, the input data is randomized. This adds some additional statistical noise to the training and also keeps the model from being “overfit” on the training data (in other words, trained very well on the training data but doesn’t do well on the validation data). Randomizing the order of the data for training puts pressure on the data access. The I/O pattern becomes random oriented rather than streaming oriented. The DGX family NFS cache is SSD-based with a very high level of random IOPs performance.

The benefit of adequate caching is that your external filesystem does not have to provide maximum performance during a cold start (the first epoch), since this first pass through the data is only a small part of the overall training time. For example, typical training sessions can iterate over the data 100 times. If we assume a 5x slower read access time during the first cold start iteration vs the remaining iterations with cached access, then the total run time of training increases by the following amount.

  • 5x slower shared storage 1st iteration + 99 local cached storage iterationn -> 4% increase in runtime over 100 iterations

Even if your external file system cannot sustain peak training IO performance, it has only a small impact on overall training time. This should be considered when creating your storage system to allow you to develop the most cost-effective storage systems for your workloads.

For either the DGX Station or the DGX-1 you cannot put additional drives into the system without voiding your warranty. For the DGX-2, you can add additional 8 U.2 NVMe drives to those already in the system.

RAID-0

The internal SSD drives are configured as RAID-0 array, formatted with ext4, and mounted as a file system. This is then used as an NFS read cache to cache data reads. Recall that its number one focus is performance.

RAID-0 stripes the contents of each file across all disks in the RAID group. but doesn’t perform any mirroring or parity checks. This reduces the availability of the RAID group but it also improves its performance and capacity. The capacity of a RAID-0 group is the sum of the capacities of the drives in the set.

The performance of a RAID-0 group, which results in improved throughput of read and write operations to any file, is the number of drives multiplied by their performance. As an example, if the drives are capable of a sequential read throughput of 550 MB/s and you have three drives in the RAID group, then the theoretical sequential throughput is 3 x 550MB/s = 1650 MB/s.

DGX Internal Storage

The DGX-2 has 8 or 16 3.84 TB NVMe drives that are managed by the OS using mdadm (software RAID). On systems with 8 NVMe drives, you can add an additional 8.

The DGX-1 has a total five 1.92TB SSDs. These are plugged into the LSI controller (hardware RAID). In the DGX-1, there are a total of five or six 1.92TB SSDs. These are plugged into the LSI controller. Two RAID arrays are configured:

  • Either a single-drive RAID-0 or a dual-drive RAID-1 array for the OS, and a Four-drive RAID-0 array to be used as read cache for NFS file systems. The Storage Command Line Tool (StorCLI) is used by the LSI card.

Note: You cannot put additional cache drives into the DGX-1 without voiding your warranty. The DGX Station has three 1.92 TB SSDs in a RAID-0 group. The Linux software RAID tool, mdadm, is used to manage and monitor the RAID-0 group.

Warning

You cannot put additional cache drives into the DGX Station without voiding your warranty.

Monitoring the RAID Array

This section explains how to use mdadm to monitor the RAID array in DGX-2 and DGX Station systems.

The RAID-0 group is created and managed by Linux software, mdadm. mdadm is also referred to as “software RAID” because all of the common RAID functions are carried out by the host CPUs and the host OS instead of a dedicated RAID controller processor. Linux software RAID configurations can include anything presented to the Linux kernel as a block device. Examples include whole hard drives (for example, /dev/sda), and their partitions (for example, /dev/sda1).

Of particular importance is that since version 3.7 of the Linux kernel mainline, mdadm supports TRIM operations for the underlying solid-state drives (SSDs), for linear, RAID 0, RAID 1, RAID 5 and RAID 10 layouts. TRIM is very important because it helps with garbage collection on SSDs. This reduces write amplification and reduces the wear on the drive.

There are some very simple commands using mdadm that you can use for monitoring the status of the RAID array. The first thing you should do is find the mount point for the RAID group. You can do this by simply running the command mount -a. Look through the output for mdadm-created devices with naming format /dev/md*.

$ mount

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
...
/dev/md0 on /raid type ext4 (rw,relatime,discard,stripe=384,data=ordered)
/dev/sdb1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
...

This /dev/md0 is a RAID-0 array that acts as a read cache for NFS file systems.

One of the first commands that can be run is to check the status of the RAID group. The command is simple, cat /proc/mdstat.

$ cat /proc/mdstat
Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid0 sde[2] sdd[1] sdc[0]
            5625729024 blocks super 1.2 512k chunks

unused devices: <none>

This is a command to read the mdstat file in the /proc filesystem. The output looks compact but there is a great deal of information in that output. The first line of output is a list of the possible ways to use mdadm in this version of Linux.

The next lines of output will present some details on each md device. In this case, the DGX Station only has one RAID group, /dev/md0. The output for /dev/md0 means that it is an active RAID-0 group. It has three devices:

  • sdc

  • sdd

  • sde

It also lists the number of blocks in the device, the version of the super block (1.2), and the chunk size (512k). This is the size that is written to each device when mdadm breaks up a file. This information would be repeated for each md device (if there are more than one).

Another option you can use with mdadm is to examine/query the individual block devices in the RAID group and examine/query the RAID groups themselves. A simple example from a DGX Station is below. The command queries the RAID group.

$ mdadm --query /dev/md0
/dev/md0: 5365.11GiB raid0 3 devices, 0 spares. Use mdadm --detail for more detail.

Notice that there are 3 devices with a total capacity of 5,365.11 GiB (this is different than GB). If this were a RAID level that supported redundancy rather than focusing on maximizing performance, you could allocate drives as ‘spares’ in case an active one failed. Because the DGX use RAID-0 across all available cache drives, there are no spares.

Next is an example of querying a block device that is part of a RAID group.

$ mdadm --query /dev/sdc
/dev/sdc: is not an md array
/dev/sdc: device 0 in 3 device active raid0 /dev/md0.  Use mdadm --examine for more detail.

The query informs you that the drive is not a RAID group but is part of a RAID group (/dev/md0). It also advises to examine the RAID group using the “examine” (-E) option.

Querying the block devices and the RAID group itself, you can put together how the block devices are part of the RAID group. Also notice that the commands are run by the root user (or something with root privileges).

To get even more detail about the md RAID group, you can use the –examine option. It prints the md superblock (if present) from a block device that could be an group component.

$ mdadm --examine /dev/sdc
/dev/sdc:
           Magic : a92b4efc
         Version : 1.2
     Feature Map : 0x0
      Array UUID : 1feabd66:ec5037af:9a40a569:d7023bc5
            Name : demouser-DGX-Station:0  (local to host demouser-DGX-Station)
   Creation Time : Wed Mar 14 16:01:24 2018
      Raid Level : raid0
    Raid Devices : 3

  Avail Dev Size : 3750486704 (1788.37 GiB 1920.25 GB)
     Data Offset : 262144 sectors
    Super Offset : 8 sectors
    Unused Space : before=262056 sectors, after=0 sectors
           State : clean
     Device UUID : 482e0074:35289a95:7d15e226:fe5cbf30

     Update Time : Wed Mar 14 16:01:24 2018
   Bad Block Log : 512 entries available at offset 72 sectors
        Checksum : ee25db67 - correct
          Events : 0

      Chunk Size : 512K

     Device Role : Active device 0
     Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)

It provides information about the RAID array (group) including things such as: Creation time UUID of the array (RAID group) RAID level (this is RAID-0) Number of RAID devices Size of the device both in Gib and GB (they are different) The state of the device (clean) Number of active devices in RAID array (3) The role of the device (if is device 0 in the raid array) The checksum and if it is correct Lists the number of events on the array Another way to get just about the same information but some extra detail, is to use the –detail option with the raid array as below.

$ mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Mar 14 16:01:24 2018
     Raid Level : raid0
     Array Size : 5625729024 (5365.11 GiB 5760.75 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Wed Mar 14 16:01:24 2018
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K

           Name : demouser-DGX-Station:0  (local to host demouser-DGX-Station)
           UUID : 1feabd66:ec5037af:9a40a569:d7023bc5
         Events : 0

       Number   Major   Minor   RaidDevice State
       0    8       32      0   active sync   /dev/sdc
       1    8       48      1   active sync   /dev/sdd
       2    8       64      2   active sync   /dev/sde