Storage
For deep learning to be effective and to take full advantage of the DGX family, the various aspects of the system have to be balanced. This includes storage and I/O. This is particularly important for feeding data to the GPUs to keep them busy and dramatically reduce run times for models. This section presents some best practices for storage within and outside of the DGX-2, DGX-1, or DGX Station. It also talks about storage considerations as the number of DGX units are scaled out, primarily the DGX-1 and DGX-2.
Internal Storage (NFS Cache)
The first storage consideration is storage within the DGX itself. The focus of the internal storage, outside of the OS drive, is performance.
NFS Cache for Deep Learning
Deep Learning I/O patterns typically consist of multiple iterations of reading the training data. The first epoch of training reads the data that is used to start training the model. Subsequent passes through the data can avoid rereading the data from NFS if adequate local caching is provided on the node. If you can estimate the maximum size of your data, you can architect your system to provide enough cache so that the data only needs to be read once during any training job. A set of very fast SSD disks can provide an inexpensive and scalable way of providing adequate caching for your applications. The DGX family NFS read cache was created for precisely this purpose, offering roughly 5, 7, and 30+ TB of fast local cache on DGX Station, DGX-1, and DGX-2, respectively.
For training the best possible model, the input data is randomized. This adds some additional statistical noise to the training and also keeps the model from being “overfit” on the training data (in other words, trained very well on the training data but doesn’t do well on the validation data). Randomizing the order of the data for training puts pressure on the data access. The I/O pattern becomes random oriented rather than streaming oriented. The DGX family NFS cache is SSD-based with a very high level of random IOPs performance.
For either the DGX Station or the DGX-1 you cannot put additional drives into the system without voiding your warranty. For the DGX-2, you can add additional 8 U.2 NVMe drives to those already in the system.
RAID-0
The internal SSD drives are configured as RAID-0 array, formatted with ext4, and mounted as a file system. This is then used as an NFS read cache to cache data reads. Recall that its number one focus is performance.
RAID-0 stripes the contents of each file across all disks in the RAID group. but doesn’t perform any mirroring or parity checks. This reduces the availability of the RAID group but it also improves its performance and capacity. The capacity of a RAID-0 group is the sum of the capacities of the drives in the set.
The performance of a RAID-0 group, which results in improved throughput of read and write operations to any file, is the number of drives multiplied by their performance. As an example, if the drives are capable of a sequential read throughput of 550 MB/s and you have three drives in the RAID group, then the theoretical sequential throughput is 3 x 550MB/s = 1650 MB/s.
DGX Internal Storage
The DGX-2 has 8 or 16 3.84 TB NVMe drives that are managed by the OS using mdadm (software RAID). On systems with 8 NVMe drives, you can add an additional 8.
- Either a single-drive RAID-0 or a dual-drive RAID-1 array for the OS, and a
- Four-drive RAID-0 array to be used as read cache for NFS file systems. The Storage Command Line Tool (StorCLI) is used by the LSI card.
The DGX Station has three 1.92 TB SSDs in a RAID-0 group. The Linux software RAID tool, mdadm, is used to manage and monitor the RAID-0 group.
Monitoring the RAID Array
This section explains how to use mdadm to monitor the RAID array in DGX-2 and DGX Station systems.
The RAID-0 group is created and managed by Linux software, mdadm. mdadm is also referred to as “software RAID” because all of the common RAID functions are carried out by the host CPUs and the host OS instead of a dedicated RAID controller processor. Linux software RAID configurations can include anything presented to the Linux kernel as a block device. Examples include whole hard drives (for example, /dev/sda), and their partitions (for example, /dev/sda1).
Of particular importance is that since version 3.7 of the Linux kernel mainline, mdadm supports TRIM operations for the underlying solid-state drives (SSDs), for linear, RAID 0, RAID 1, RAID 5 and RAID 10 layouts. TRIM is very important because it helps with garbage collection on SSDs. This reduces write amplification and reduces the wear on the drive.
# mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) … /dev/md0 on /raid type ext4 (rw,relatime,discard,stripe=384,data=ordered) /dev/sdb1 on /boot/efi type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro) ...
This /dev/md0 is a RAID-0 array that acts as a read cache for NFS file systems.
# cat /proc/mdstat Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid0 sde[2] sdd[1] sdc[0] 5625729024 blocks super 1.2 512k chunks unused devices: <none>
This is a command to read the mdstat file in the /proc filesystem. The output looks compact but there is a great deal of information in that output. The first line of output is a list of the possible ways to use mdadm in this version of Linux.
- sdc
- sdd
- sde
It also lists the number of blocks in the device, the version of the super block (1.2), and the chunk size (512k). This is the size that is written to each device when mdadm breaks up a file. This information would be repeated for each md device (if there are more than one).
# mdadm --query /dev/md0 /dev/md0: 5365.11GiB raid0 3 devices, 0 spares. Use mdadm --detail for more detail.
Notice that there are 3 devices with a total capacity of 5,365.11 GiB (this is different than GB). If this were a RAID level that supported redundancy rather than focusing on maximizing performance, you could allocate drives as 'spares' in case an active one failed. Because the DGX use RAID-0 across all available cache drives, there are no spares.
# mdadm --query /dev/sdc /dev/sdc: is not an md array /dev/sdc: device 0 in 3 device active raid0 /dev/md0. Use mdadm --examine for more detail.
The query informs you that the drive is not a RAID group but is part of a RAID group (/dev/md0). It also advises to examine the RAID group using the “examine” (-E) option.
Querying the block devices and the RAID group itself, you can put together how the block devices are part of the RAID group. Also notice that the commands are run by the root user (or something with root privileges).
# mdadm --examine /dev/sdc /dev/sdc: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 1feabd66:ec5037af:9a40a569:d7023bc5 Name : demouser-DGX-Station:0 (local to host demouser-DGX-Station) Creation Time : Wed Mar 14 16:01:24 2018 Raid Level : raid0 Raid Devices : 3 Avail Dev Size : 3750486704 (1788.37 GiB 1920.25 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=0 sectors State : clean Device UUID : 482e0074:35289a95:7d15e226:fe5cbf30 Update Time : Wed Mar 14 16:01:24 2018 Bad Block Log : 512 entries available at offset 72 sectors Checksum : ee25db67 - correct Events : 0 Chunk Size : 512K Device Role : Active device 0 Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
- Creation time
- UUID of the array (RAID group)
- RAID level (this is RAID-0)
- Number of RAID devices
- Size of the device both in Gib and GB (they are different)
- The state of the device (clean)
- Number of active devices in RAID array (3)
- The role of the device (if is device 0 in the raid array)
- The checksum and if it is correct
- Lists the number of events on the array
# mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed Mar 14 16:01:24 2018 Raid Level : raid0 Array Size : 5625729024 (5365.11 GiB 5760.75 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Update Time : Wed Mar 14 16:01:24 2018 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Chunk Size : 512K Name : demouser-DGX-Station:0 (local to host demouser-DGX-Station) UUID : 1feabd66:ec5037af:9a40a569:d7023bc5 Events : 0 Number Major Minor RaidDevice State 0 8 32 0 active sync /dev/sdc 1 8 48 1 active sync /dev/sdd 2 8 64 2 active sync /dev/sde
External Storage
As an organization scales out their GPU enabled data center, there are many shared storage technologies which pair well with GPU applications. Since the performance of a GPU enabled server is so much greater than a traditional CPU server, special care needs to be taken to ensure the performance of your storage system is not a bottleneck to your workflow.
- Running parallel HPC applications may require the storage technology to support multiple processes accessing the same files simultaneously.
- To support accelerated analytics, storage technologies often need to support many threads with quick access to small pieces of data.
- For vision based deep learning, accessing images or video used in classification, object detection or segmentation may require high streaming bandwidth, fast random access, or fast memory mapped (mmap()) performance.
- For other deep learning techniques, such as recurrent networks, working with text or speech can require any combination of fast bandwidth with random and small files.
HPC workloads typically drive high simultaneous multi-system write performance and benefit greatly from traditional scalable parallel file system solutions. You can size HPC storage and network performance to meet the increased dense compute needs of GPU servers. It is not uncommon to see per-node performance increases from between 10-40x for a 4 GPU system vs a CPU system for many HPC applications.
Data Analytics workloads, similar to HPC, drive high simultaneous access, but are more read focused than HPC. Again, it is important to size Data Analytics storage to match the dense compute performance of GPU servers. As you adopt accelerated analytics technologies such as GPU-enabled in-memory databases, make sure that you can populate the database from your data warehousing solution quickly to minimize startup time when you change database schemas. This may require a network with 10 Gbe for greater performance. To support clients at this rate, you may have to revisit your data warehouse architecture to identify and eliminate bottlenecks.
Deep learning is a fast evolving computational paradigm and it is important to know what your requirements are in the near and long term to properly architect a storage system. The ImageNet database is often used as a reference when benchmarking deep learning frameworks and networks. The resolution of the images in ImageNet are 256x256. However, it is more common to find images at 1080p or 4k. Images in 1080p resolution are 30 times larger than those in ImageNet. Images in 4k resolution are 4 times larger than that (120X the size of ImageNet images). Uncompressed images are 5-10 times larger than compressed images. If your data cannot be compressed for some reason, for example if you are using a custom image formats, the bandwidth requirements increase dramatically.
For AI-Driven Storage, it is suggested that you make use of deep learning framework features that build databases and archives versus accessing small files directly; reading and writing many small files will reduce performance on the network and local file systems. Storing files in formats such as HDF5, LMDB or TFRecord can reduce metadata access to the filesystem helping performance. However, these formats can lead to their own challenges with additional memory overhead or requiring support for fast mmap() performance. All this means that you should plan to be able to read data at 150-200 MB/s per GPU for files at 1080p resolution. Consider more if you are working with 4k or uncompressed files.
NFS Storage
NFS can provide a good starting point for AI workloads on small GPU server configurations with properly sized storage and network bandwidth. NFS based solutions can scale well for larger deployments, but be aware of possible single node and aggregate bandwidth requirements and make sure that matches your vendor of choice. As you scale your data center to need more than 10 GB/s or your data center grows to hundreds or thousands of nodes, other technologies may be more efficient and scale better.
Generally, it is a good idea to start with NFS using one or more of the Gigabit Ethernet connections on the DGX family. After this is configured, it is recommended that you run your applications and check if IO performance is a bottleneck. Typically, NFS over 10Gb/s Ethernet provides up to 1.25 GB/s of IO throughput for large block sizes. If, in your testing, you see NFS performance that is significantly lower than this, check the network between the NFS server and a DGX server to make sure there are no bottlenecks (for example, a 1 GigE network connection somewhere, a misconfigured NFS server, or a smaller MTU somewhere in the network).
- Increasing Read, Write buffer sizes
- TCP optimizations including larger buffer sizes
- Increasing the MTU size to 9000
- Sync vs. Async
- NFS Server options
- Increasing the number of NFS server daemons
- Increasing the amount of NFS server memory
- net.core.rmem_max=67108864
- net.core.rmem_default=67108864
- net.core.optmem_max=67108864
The values after the variable are example values (they are in bytes). You can change these values on the NFS client and the NFS server, and then run experiments to determine if the IO performance improves.
The previous examples are for the kernel read buffer values. You can also do the same thing for the write buffers where you use wmem instead rmem.
You can also tune the TCP parameters in the NFS client to make them larger. For example, you could change the net.ipv4.tcp_rmem=”4096 87380 33554432” system parameter.
This changes the TCP buffer size, for iPv4, to 4,096 bytes as a minimum, 87,380 bytes as the default, and 33,554,432 bytes as the maximum.
If you can control the NFS server, one suggestion is to increases the number of NFS daemons on the server.
One way to determine whether more NFS threads helps performance is to check the data in /proc/net/rpc/nfs entry for the load on the NFS daemons. The output line that starts with th lists the number of threads, and the last 10 numbers are a histogram of the number of seconds the first 10% of threads were busy, the second 10%, and so on.
Ideally, you want the last two numbers to be zero or close to zero, indicating that the threads are busy and you are not "wasting" any threads. If the last two numbers are fairly high, you should add NFS daemons, because the NFS server has become the bottleneck. If the last two, three, or four numbers are zero, then some threads are probably not being used.
One other option, while a little more complex, can prove to be useful if the IO pattern becomes more write intensive. If you are not getting the IO performance you need, change the mount behavior on the NFS clients from “sync” to “async”.
Switching from “sync” to “async” means that the NFS server responds to the NFS client that the data has been received when the data is in the NFS buffers on the server (in other words, in memory). The data hasn’t actually been written to the storage yet, it’s still in memory. Typically, writing to the storage is much slower than writing to memory, so write performance with “async” is much faster than with “sync”. However, if, for some reason, the NFS server goes down before the data in memory is written to the storage, then the data is lost.
If you try using “async” on the NFS client (in other words, the DGX system), ensure that the data on the NFS server is replicated somewhere else so that if the server goes down, there is always a copy of the original data. The reason is if the NFS clients are using “async” and the NFS server goes down, data that is in memory on the NFS server will be lost and cannot be recovered.
NFS “async” mode is very useful for write IO, both streaming (sequential) and random IO. It is also very useful for “scratch” file systems where data is stored temporarily (in other words, not permanent storage or storage that is not replicated or backed up).
If you find that the IO performance is not what you expected and your applications are spending a great deal of time waiting for data, then you can also connect NFS to the DGX system over InfiniBand using IPoIB (IP over IB). This is part of the DGX family software stack and can be easily configured. The main point is that the NFS server should be InfiniBand attached as well as the NFS clients. This can greatly improve IO performance.
Distributed Filesystems
Distributed filesystems such as EXAScaler, GRIDScaler, Ceph, Lustre, MapR-FS, General Parallel File System, Weka.io, and Gluster can provide features like improved aggregate IO performance, scalability, and/or reliability (fault tolerance). These filesystems are supported by their respective providers unless otherwise noted.
Scaling Out Recommendations
Based on the general IO patterns of deep learning frameworks (see External Storage), below are suggestions for storage needs based on the use case. These are suggestions only and are to be viewed as general guidelines.
Use Case | Adequate Read Cache? | Network Type Recommended | Network File System Options |
---|---|---|---|
Data Analytics | NA | 10 GbE | Object-Storage, NFS, or other system with good multithreaded read and small file performance |
HPC | NA | 10/40/100 GbE, InfiniBand | NFS or HPC targeted filesystem with support for large numbers of clients and fast single-node performance |
DL, 256x256 images | yes | 10 GbE | NFS or storage with good small file support |
DL, 1080p images | yes | 10/40 GbE, InfiniBand | High-end NFS, HPC filesystem or storage with fast streaming performance |
DL, 4k images | yes | 40 GbE, InfiniBand | HPC file system, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node |
DL, uncompressed Images | yes | InfiniBand, 40/100 GbE | HPC filesystem, high-end NFS or storage with fast streaming performance capable of 3+ GB/s per node |
DL, Datasets that are not cached | no | InfiniBand, 10/40/100 GbE | Same as above, aggregate storage performance must scale to meet the all applications simultaneously |
As always, it is best to understand your own applications’ requirements to architect the optimal storage system.
Lastly, this discussion has focused only on performance needs. Reliability, resiliency and manageability are as important as the performance characteristics. When choosing between different solutions that meet your performance needs, make sure that you have considered all aspects of running a storage system and the needs of your organization to select the solution that will provide the maximum overall value.