Storage Architecture
Data, lots of data, is the key to development of accurate deep learning (DL) models. Data volume continues to grow exponentially, and data used to train individual models continues to grow as well. Data format, not just volume can play a key factor in the rate at which data is accessed. The performance of the DGX H100 system is up to nine times faster than its predecessor. To achieve this in practice, storage system performance must scale commensurately.
The key I/O operation in DL training is re-read. It is not just that data is read, but it must be reused again and again due to the iterative nature of DL training. Pure read performance still is important as some model types can train in a fraction of an epoch (ex: some recommender models) and inference of existing can be highly I/O intensive, much more so than training. Write performance can also be important. As DL models grow in size and time-to-train, writing checkpoints is necessary for fault tolerance. The size of checkpoint files can be terabytes in size and while not written frequently are typically written synchronously that blocks forward progress of DL models.
Ideally, data is cached during the first read of the dataset, so data does not have to be retrieved across the network. Shared filesystems typically use RAM as the first layer of cache. Reading files from cache can be an order of magnitude faster than from remote storage. In addition, the DGX H100 system provides local NVMe storage that can also be used for caching or staging data.
DGX SuperPOD is designed to support all workloads, but the storage performance required to maximize training performance can vary depending on the type of model and dataset. The guidelines in Table 5 and Table 6 are provided to help determine the I/O levels required for different types of models.
Table 5. Storage performance requirements
Performance Level |
Work Description |
Dataset Size |
---|---|---|
Good |
Natural Language Processing (NLP) |
Datasets generally fit within local cache |
Better |
Image processing with compressed images, ImageNet/ResNet-50 |
Many to most datasets can fit within the local node’s cache |
Best |
Training with 1080p, 4K, or uncompressed images, offline inference, ETL |
Datasets are too large to fit into cache, massive first epoch I/O requirements, workflows that only read the dataset once |
Table 6. Guidelines for storage performance
Performance Characteristic1 |
Good (GBps) |
Better (GBps) |
Best (GBps) |
---|---|---|---|
Single node read |
4 |
8 |
40 |
Single node write |
2 |
4 |
20 |
Single SU aggregate system read |
15 |
40 |
125 |
Single SU aggregate system write |
7 |
20 |
62 |
4 SU aggregate system read |
60 |
160 |
500 |
4 SU aggregate system write |
30 |
80 |
250 |
Even for the best category in Table 6, it is desirable that the single node read performance is closer to the maximum network performance of 80 GBps.
Note
As datasets get larger, they may no longer fit in cache on the local system. Pairing large datasets that do not fit in cache with very fast GPUs can create a situation where it is difficult to achieve maximum training performance. NVIDIA GPUDirect Storage® (GDS) provides a way to read data from the remote filesystem or local NVMe directly into GPU memory providing higher sustained I/O performance with lower latency. Using the storage fabric on the DGX SuperPOD, a GDS-enabled application should be able to read data at over 40 GBps directly into the GPUs.
High-speed storage provides a shared view of an organization’s data to all nodes. It must be optimized for small, random I/O patterns, and provide high peak node performance and high aggregate filesystem performance to meet the variety of workloads an organization may encounter. High-speed storage should support both efficient multi-threaded reads and writes from a single system, but most DL workloads will be read-dominant.
Use cases in automotive and other computer vision-related tasks, where 1080p images are used for training (and in some cases are uncompressed) involve datasets that easily exceed 30 TB in size. In these cases, 4 GBps per GPU for read performance is needed.
While NLP cases often do not require as much read performance for training, peak performance for reads and writes are needed for creating and reading checkpoint files. This is a synchronous operation and training stops during this phase. If you are looking for the best end-to-end training performance, do not ignore I/O operations for checkpoints.
The preceding metrics assume a variety of workloads, datasets, and need for training locally and directly from the high-speed storage system. It is best to characterize workloads and organizational needs before finalizing performance and capacity requirements.