Overview#
The NVIDIA Enterprise RA using 2-4-5-800 (dual plane) node architecture with NVIDIA GB300 NVL72 and NVIDIA Spectrum-X Networking offers a fully integrated, rack-scale solution optimized for the most demanding AI workloads. It documents a modular architecture based on NVIDIA-Certified NVL72 GB300 systems, each with 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA® Grace CPUs.
With a liquid-cooled design, each Scalable Unit (a single NVL72 rack) integrates 18 compute trays connected via the fifth generation of NVLink. While each tray (single server or node) can still operate independently as needed, the NVLink interconnect enables GPUs to be dynamically combined, allowing the system to function as a single multi-GPU unit of compute for larger workloads. The design minimizes system bottlenecks through tightly coupled configuration to provide the best performance and application scalability.
A fully tested system scales up to 8 SUs (Scalable Units). Larger clusters can be built based on customer requirements.
The rack solution is delivered as a pre-configured system available through the fulfillment of OEM along with Hardware support. Software support from NVIDIA is based on a per GPU paid subscription of NVIDIA AI Enterprise. While this Enterprise RA document may make the solution look simple, it is highly recommended to utilize NVIDIA Mission Control for operation of the NVIDIA GB300 NVL72.
The NVIDIA GB300 NVL72 and NVIDIA Spectrum-X Platform RA have been developed to support enterprise-grade AI deployments, with the following use cases:
Inference on today’s largest AI models, in real-time
Training and finetuning of trillion-parameter language models
Data analytics that’s ultra-fast, at massive scale
High-performance computing for single and mixed precision
For all use cases, this architecture is ideal for multi-user, single tenant workloads. Specifically, the logical design and software is streamlined for deployment and maintenance ease by tailoring the configuration to one where users are all part of the same enterprise, and accounting and access control can be consolidated.
This RA is architected to support deployment of Kubernetes, Slurm, and associated applications and tools for non-virtualized workloads.