Storage Architecture
Data, lots of data, is the key to development of accurate deep learning (DL) models. Data volume continues to grow exponentially, and data used to train individual models continues to grow as well. Data format, not just volume can play a key factor in the rate at which data is accessed so storage system performance must scale commensurately.
The key I/O operation in DL training is re-read. It is not just that data is read, but it must be reused again and again due to the iterative nature of DL training. Pure read performance still is important as some model types can train in a fraction of an epoch (ex: some recommender models) and inference of existing can be highly I/O intensive, much more so than training. Write performance can also be important. As DL models grow and time-to-train, writing checkpoints is necessary for fault tolerance. The size of checkpoint files can be terabytes in size and while not written frequently are typically written synchronously that blocks forward progress of DL models.
Ideally, data is cached during the first read of the dataset, so data does not have to be retrieved across the network. Shared filesystems typically use RAM as the first layer of cache. Reading files from cache can be an order of magnitude faster than from remote storage. In addition, the DGX B200 system provides local NVMe storage that can also be used for caching or staging data.
DGX SuperPOD is designed to support all workloads, but the storage performance required to maximize training performance can vary depending on the type of model and dataset. The guidelines in Table 5 and Table 6 are provided to help determine the I/O levels required for different types of models.
Performance Level |
Work Description |
Dataset Size |
---|---|---|
Good |
Natural Language Processing (NLP) |
Datasets generally fit within local cache |
Better |
Training Compressed Images, Compressed Audio and Text Data, such as LLM Training |
Many to most datasets can fit within the local system’s cache |
Best |
Training with large Video and Image files (such as AV replay), offline inference, ETL, generative networks such as stable diffusion, 3D images such as Medical U-Net, genomics workload and protein prediction such as AlphaFold |
Datasets are too large to fit into cache, massive first epoch I/O requirements, workflows that only read the dataset once |
Performance Characteristic |
Good (GBps) |
Better (GBps) |
Best (GBps) |
---|---|---|---|
Single SU aggregate system read |
15 |
40 |
125 |
Single SU aggregate system write |
7 |
20 |
62 |
4 SU aggregate system read |
60 |
160 |
500 |
4 SU aggregate system write |
30 |
80 |
250 |
High-speed storage provides a shared view of an organization’s data to all nodes. It must be optimized for small, random I/O patterns, and provide high peak node performance and high aggregate filesystem performance to meet the variety of workloads an organization may encounter. High-speed storage should support both efficient multi-threaded reads and writes from a single system, but most DL workloads will be read-dominant.
Use cases in automotive and other computer vision-related tasks, where high-resolution images are used for training (and in some cases are uncompressed) involve datasets that easily exceed 30 TB in size. In these cases, 4 GBps per GPU for read performance is needed.
While NLP and LLM cases often do not require as much read performance for training, peak performance for reads and writes are needed for creating and reading checkpoint files. This is a synchronous operation and training stops during this phase. If you are looking for best end-to-end training performance, do not ignore I/O operations for checkpoints. Consider at least ½ of the read performance as recommended write performance for LLM and large model use cases.
The preceding metrics assume a variety of workloads, datasets, and need for training locally and directly from the high-speed storage system. It is best to characterize workloads and organizational needs before finalizing performance and capacity requirements.