Networking Physical Topologies#
The Enterprise RA configurations use three physical network fabrics:
Compute (Node East/West) Network
CPU Converged (Node North/South) Network
Out-of-Band Management Network
Each network is discussed in this section.
Compute (Node East/West) Network#
The compute fabric (East/West) is built using switches with NVIDIA Spectrum technology in a full non-blocking fat tree topology to provide the highest level of performance for the application running over the NVIDIA HGX B300 cluster. The compute fabric is an RDMA-based fabric. It is designed to provide the shortest hops through the network for the application, in a leaf-spine manner, where the GPUs are connected using a rail-optimized network topology through their respective NVIDIA ConnectX-8 SuperNICs. This design allows the most efficient communication for multi-GPU applications within and across the nodes.
The leaf and spine architecture allows for a scalable and reliable network that can fit varied sizes of clusters using the same architecture.
The compute fabric is designed to facilitate redundancy, maximize bandwidth and minimize network latency required to connect GPUs within a server and within a rail. While the architecture below recommends a dual plane compute fabric for resiliency purposes, a cost-effective single plane configuration is also supported based on deployment needs.
The compute network is not necessarily required for inference workloads. Most common pre-trained models do not exceed the size of a single NVIDIA B300 SXM GPU. Models beyond 120B parameters may require model parallelism and require more than one NVIDIA B300 SXM GPU. This parallelism will still reside within the same node as each server node can hold up to 8x NVIDIA B300 SXM GPUs.
For larger network design points, due to the high number of endpoints required, the Compute Fabric utilizes a spine and leaf networking architecture. This is principally to support the required design points but also to allow extensibility in the architecture.
the architectures below, the Compute Fabric is merged with the Converged Fabric for smaller design points. This is to lower cost and to simplify the network architecture.
Compute Fabric Excluded for Pure Inference Deployments#
For the use case of pure inference, a compute network may not be necessary. Each NVIDIA B300 SXM GPU can support a maximum model size of approximately 120B parameters. Also, tensor parallelism yields less performant results per GPU than running the model on the same GPUs in data parallel mode.
Since there are no performance gains from running an inference model on multiple GPUs, either on a single node or across multiple nodes, a compute network is not necessary.
The drawback to implementing an infrastructure without a compute network is that the infrastructure cannot be used for any hybrid workflows including model training. The infrastructure can be deployed without a compute network and can be retrofitted later, albeit with potentially significant downtime and reconfiguration.
Multi Plane Topology Approach#
While each GPU is associated with a dedicated NVIDIA ConnectX-8 SuperNIC, the coupling of a single link from a NIC to a single point of failure (SPOF) in the fabric is a concern. Increasingly, customers are opting for a load balanced multi-link approach which can satisfy both the full capacity of the GPU bandwidth as well as avoiding a SPOF. Each compute networking plane forms a separate fabric, where the resiliency and the load balancing between the two planes is handled by the NCCL on the host. Any failure to provide GPU to GPU connectivity via one plane will seclude traffic to the alternative plane, but if both planes are active, the traffic will be balanced between planes in a manner that utilizes the bandwidth of both links. Load balancing via NCCL is more efficient than other L2/L3 bonds like LAG that only have “local” awareness for traffic balancing and static hash-based distribution which is suboptimal for AI workloads due to low entropy.
A second benefit of plane separation is expansion of the fan-out of the NIC and switches. A higher radix of interfaces enables the creation of a larger network with fewer network tiers in a Clos topology.
Dual Plane Topology#
In this Enterprise RA we recommend using a Dual Plane topology. A Dual Plane Topology involves building two identical planes that connect to each GPU interface. With each GPU generating 800 Gb/s bandwidth through the NVIDIA ConnectX-8 SuperNICs, dual plane topology involves breaking the interface to 2x400 Gb/s interfaces. Every such interface is then connected to a different leaf switch, and every such leaf switch is part of an independent fabric that scales to 1024 interfaces of 400 Gb/s as part of this reference architecture. The GPU is not aware of the two independent fabrics, each carrying 50% of the generated traffic, and is only aware that 100% of traffic is carried by the NVIDIA ConnectX-8 SuperNIC. Tracking of each plane, load balancing, and failure handling is handled by the ConnectX-8 SuperNIC on the hardware level. A failing or degraded plane will carry an impact linearly associated with the dropped bandwidth.
Single Plane Topology#
A Single Plane Topology is also possible as a cost-effective alternative to the Dual Plane topology. In the configuration, each GPU operates at 400 Gb/s of bandwidth through the NVIDIA ConnectX-8 SuperNIC without breaking the interface – using a single 1x 400 Gb/s connection instead of the 2x 400 Gb/s. Each interface is connected to a leaf switch within a single compute fabric that scales to 1024 interfaces of 400 Gb/s as part of the reference architecture. While this approach reduces the total GPU bandwidth by 50%, it is well-suited for workloads that do not warrant maximum throughput and can benefit from lower networking infrastructure costs. The use of OSFP transceiver modules enables seamless migration between single plane and dual plane topologies, allowing flexibility in adapting to evolving performance needs.
CPU Converged (Node North/South) Network#
A converged network for both storage and in-band management are used in the Enterprise RA. This provides Enterprises with flexible storage allocation and easy network management. The converged network has the following attributes:
Provides high bandwidth to shared storage and connects the customer through the converged network.
Is independent of the compute fabric to maximize both storage and application performance.
Each compute and management node is connected with two 400 GbE ports to two separate switches to provide redundancy and high storage throughput that can reach up to 40 GB/s per node.
The fabric is built on Ethernet technology with RDMA over Converged Ethernet (RoCE) support and utilizes NVIDIA BlueField-3 B3240 DPUs in each compute node to deliver existing and emerging cloud and storage services.
It is flexible and can scale to meet specific capacity and bandwidth requirements.
Tenant-controlled management nodes provide tenant the flexibility to deploy OS and their job scheduler of choice.
Hybrid storage fabric design with support for tenant isolation provides access to both shared and dedicated storage per tenant.
Used for node provisioning, data movement, Internet access, and other services that must be accessible by the users.
Out-of-Band Management (Node) Network#
The OOB management network connects all the Baseboard Management Controller (BMC) ports, as well as other devices that should be physically isolated from system users to allow infrastructure management. This includes the 1 GbE switch management ports and the NVIDIA BlueField-3 DPU management ports.