Storage#
NVIDIA DGX GB200 systems are part of SuperPODs designed for massive-scale AI and HPC workloads. Their storage architecture is a critical component for feeding data to the GPUs efficiently. The system uses a multi-tiered storage hierarchy that includes internal storage as well as network-connected storage over a high-speed network.
Internal Storage#
Each DGX compute tray within the DGX GB200 is equipped with local NVMe storage for caching, and a single M.2 NVMe drive is used for booting the operating system.
For local caching, the E1.s NVMe drives are configured in RAID 0 as a single volume. During the first read of a dataset from shared storage, the DGX system’s software can automatically cache a copy of the data to the local NVMe devices using cachefilesd
, if configured. For subsequent reads of the same data, applications can then access it from the local NVMe cache, which provides significantly faster access than retrieving it again across the network. This caching process is transparent to users and applications.
Local storage can also be used for checkpointing acceleration. While primary checkpoints typically go to high-performance storage for reliability, local NVMe can be used for temporary, faster checkpointing before being flushed to shared storage, or for smaller, very frequent checkpoints within an epoch.
External Storage#
The DGX GB200’s primary data source is a high-performance, shared parallel file system available from partners as described in the SuperPOD reference architecture. This storage is accessed over converged Ethernet by default, although InfiniBand can also be used. The shared storage holds the massive datasets required for AI training and serves as the original data source.
For complete information on sizing and performance requirements, please refer to the reference architecture available at Storage Architecture — NVIDIA DGX SuperPOD.