High Performance Storage Architecture#

Data, lots of data, is the key to the development of accurate deep learning (DL) models. Data volume continues to grow exponentially, and the data used to train individual models continues to grow as well. Data format, not just volume, can play a key role in the rate at which data is accessed, so storage system performance must scale commensurately.

The key I/O operation in DL training is re-read. It is not just that data is read; it must be reused repeatedly because of the iterative nature of DL training. Pure read performance is still important, as some model types can train in a fraction of an epoch (for example, some recommender models), and inference of existing models can be highly I/O intensive, often more so than training. Write performance can also be important. As DL models grow and time-to-train increases, writing checkpoints is necessary for fault tolerance. The size of checkpoint files can reach terabytes, and while they are not written frequently, they are typically written synchronously and can block the forward progress of DL models.

Ideally, data is cached during the first read of the dataset so that it does not have to be retrieved across the network. Shared file systems typically use RAM as the first layer of cache. Reading files from cache can be an order of magnitude faster than from remote storage. In addition, the DGX RUBIN NVL8 system provides local NVMe storage that can also be used for caching or staging data.

DGX SuperPOD is designed to support all workloads, but the storage performance required to maximize training performance can vary depending on the type of model and dataset. The guidelines below are provided to help determine the I/O levels required for different types of models.

Table 5 Storage Performance Requirements#

Level

Work Description

Dataset Size

Standard

Multiple concurrent LLM or fine-tuning training jobs and periodic checkpoints, where the compute requirements dominate the data I/O requirements significantly.

Most datasets can fit within the local compute systems’ memory cache during training. The datasets are single modality, and models have millions of parameters.

Enhanced

Multiple concurrent multimodal training jobs and periodic checkpoints, where the data I/O performance is an important factor for end-to-end training time.

Datasets are too large to fit into local compute systems’ memory cache requiring more I/O during training, not enough to obviate the need for frequent I/O. The datasets have multiple modalities and models have billions (or higher) of parameters.

Table 6 Guidelines for Storage Performance#

Performance Characteristic

Standard (GBps)

Enhanced (GBps)

Single Node aggregated system read

1.25

3.91

Single rack aggregate system write

0.63

1.95

Full SU (72 nodes) system read

90

281.25

Full POD (72 nodes) system write

45

140.63

High-speed storage provides a shared view of an organization’s data to all nodes. It must be optimized for small, random I/O patterns and provide high peak node performance and high aggregate file system performance to meet the variety of workloads an organization may encounter. High-speed storage should support both efficient multithreaded reads and writes from a single system, but most DL workloads are read-dominant.

Use cases in automotive and other computer-vision-related tasks, where high-resolution images are used for training (and in some cases are uncompressed), involve datasets that easily exceed 30 TB in size. In these cases, 4 GB/s per GPU of read performance is required.

While Natural Language Processing (NLP) and large language model (LLM) use cases often do not require as much read performance for training, peak performance for reads and writes is needed for creating and reading checkpoint files. This is a synchronous operation, and training stops during this phase. For best end-to-end training performance, I/O operations for checkpoints must not be ignored. At least one-half of the recommended read performance should be used as the target write performance for LLM and large-model use cases.

The preceding metrics assume a variety of workloads and datasets, and a need for training locally and directly from the high-speed storage system. It is best to characterize workloads and organizational needs before finalizing performance and capacity requirements.