Networking Physical Topologies#
The Enterprise RA configurations use three physical network fabrics:
GPU Compute (Node East/West) Network
CPU Converged (Node North/South) Network
Out-of-Band Management Network
Each network is discussed in this section.
GPU Compute (Node East/West) Network#
The compute fabric (East-West) is built using switches with NVIDIA Spectrum technology in a full non-blocking fat tree topology to provide the highest level of performance for applications running over the GB300 NVL72 cluster. The compute fabric is an RDMA-based fabric. It is designed to provide the shortest hops through the network for the application, in a leaf-spine manner, where the GPUs are connected using a rail-optimized network topology through their respective ConnectX-8 NICs. This design allows the most efficient communication for multi-GPU applications within and across the trays.
The leaf and spine architecture allows for a scalable and reliable network that can fit varied sizes of clusters using the same architecture.
The compute fabric is designed to facilitate redundancy, maximize bandwidth and minimize network latency required to connect GPUs within a server and within a rail. Models beyond 1920B parameters (assuming FP4 quantization) would require model parallelism and more than a single tray of the GB300 NVL72 rack, which are connected through NVLink and NVSwitch technology.
Multi-Plane Topology Approach#
While each GPU is associated with a dedicated ConnectX-8 SuperNIC, the coupling of a single link from a NIC to a single point of failure (SPOF) in the fabric is a concern. Customers are opting for a load balanced multi-link approach which can satisfy both the full capacity of the GPU bandwidth as well as avoiding a SPOF. Each compute networking plane forms a separate fabric, where the resiliency and the load balancing between the two planes is handled by the NCCL on the host. Any failure to provide GPU to GPU connectivity via one plane will seclude traffic to the alternative plane, while if both are active, the traffic will be balanced between planes in a manner that utilizes the bandwidth of both links. Load balancing via NCCL is indicated as more efficient than other L2/L3 bonds like LAG that only have “local” awareness for traffic balancing and static hash-based distribution which is suboptimal for AI workloads due to low entropy.
A second attribute for planes separation is the expansion of the fan-out of the NIC and switches. A higher radix of interfaces enables building a larger network with fewer network tiers in a Clos topology.
Dual Plane Topology#
In this Enterprise RA we utilize a Dual Plane topology. A Dual Plane Topology involves building two identical planes that connect to each GPU interface. With each GPU generating 800 Gb/s bandwidth through the ConnectX-8 SuperNICs, dual plane topology involves breaking the interface to 2x400 Gb/s interfaces. Every such interface is then connected to a different leaf switch, and every such leaf switch is part of an independent fabric that scales to 1024 interfaces of 400 Gb/s as part of this reference architecture. The GPU is not aware of the two independent fabrics, each carrying 50% of the generated traffic, and is only aware that 100% of traffic is carried by the ConnectX-8 SuperNIC. Tracking of each plane, load balancing, and failure handling is handled by the ConnectX-8 SuperNIC on the hardware level. A failing or degraded plane will carry an impact linearly associated with the dropped bandwidth.
CPU Converged (Node North/South) Network#
A converged network for both storage and in-band management is used in the Enterprise RA. This provides Enterprises with flexible storage allocation and easy network management. The converged network has the following attributes:
Provide high bandwidth to shared storage and connects the customer through the converged network.
It is independent of the compute fabric to maximize both storage and application performance.
Each compute tray connects to two separate switches using dual 400 Gb/s ports, while each management connects to the same switches with four 200 Gb/s ports. This design provides redundancy and enables per-node storage bandwidth of up to 40 GB/s.
The fabric is built on Ethernet technology with RDMA over Converged Ethernet (RoCE) support and utilizes NVIDIA BlueField-3 B3240 DPUs in each compute tray to deliver existing and emerging cloud and storage services.
It is flexible and can scale to meet specific capacity and bandwidth requirements.
Hybrid storage fabric design with support for tenant isolation provides access to both shared and dedicated storage per tenant.
Used for node provisioning, data movement, Internet access, and other services that must be accessible by the users.
Out-of-Band Management (Node) Network#
The OOB management network connects all the base management controller (BMC) ports, as well as other devices that should be physically isolated from system users to allow the infrastructure management. This includes the 1 Gb/s switch management ports and the BlueField-3 DPU management ports.