Networking Logical Architecture#
This Enterprise RA uses a converged spine-leaf network providing physical fabrics for the use cases such as inferencing and fine-tuning. Deeper explanations of each of the fabric roles are below:
GPU Compute (East/West) Network
The GPU Compute (East/West) network is an RDMA based fabric in a leaf-spine architecture where the GPUs are connected using a rail-optimized network topology through their respective SuperNICs. This design allows the most efficient communication for multi-GPU applications within and across the compute nodes due to the high number of endpoints needed. The leaf-spine networking architecture also allows for future extensibility.
CPU Converged (North/South) Network
The CPU Converged (North/South) network connects the nodes using two 400 GbE ports to two separate switches to provide redundancy and high storage throughput. This network is used for node communications with compute, storage, in-band management, and end-user connections.
Storage (North/South) connectivity
This network specifically provides converged connectivity for storage infrastructure. Principally storage is attached on this network and then can be provided to the nodes via the CPU Compute network as required.
Customer (North/South) Network Connectivity
Typically, this is upstream connectivity to connect the cluster to the rest of an Enterprise customers’ networking infrastructure. This is provisioned to provide ample bandwidth for typical user communications.
Support Server Networking
This is dedicated networking for the respective support servers. These servers generally provide management, provisioning, monitoring, and control services to the rest of the cluster. High performance networking is used here due to requirements such as cluster deployment and imaging.
Out-of-band Management Networking for the infrastructure
All infrastructure requires management. This network provides bulk management 1Gb RJ45 connectivity for all the Nodes. This network uses low-cost bulk management switches. These switches have upstream connections to the Core networking infrastructure to allow flexibility and wider consolidation of services such as management and monitoring.
Note
VLAN isolation would be used to provide logical separation of the networks above over the single physical fabric.
Enterprise RA Scalable Unit (SU)#
This Enterprise RA is built on scalable units (SU) based on 4 compute nodes. Each SU is a discrete entity of computation that is tied to the port availability size of the network devices. SUs can be replicated to adjust the scale of the deployment with more ease.
Figure 7: Example diagram of 4-Node SU.
The 4-Node scalable unit provides the following connectivity building blocks:
For the Compute (East/West) fabric: 4 servers, each with 8x NVIDIAConnectX-8 SuperNICs, providing 64x 400Gb/s connections and a total aggregate bandwidth of 25.6Tb/s
For the Converged (North/South) fabric: 4 servers, each with 1x NVIDIA BlueField-3 B3240 DPU providing 8x 400Gb/s connections and a total aggregate bandwidth of 3.2Tb/s
For the Out-of-band Management fabric, 4 servers, each with 6x 1Gb/s connections providing 24x 1Gb/s for management
Spine-Leaf Networking#
The network fabrics are built using switches with NVIDIA Spectrum-X Ethernet technology in a full nonblocking fat tree topology to provide the highest level of performance for the application running over the Enterprise RA configuration. The networks are RDMA compliant based fabrics in a leaf-spine architecture where the GPUs are connected using a rail-optimized network topology through their respective SuperNICs. This design allows the most efficient communication for multi-GPU applications within and across the nodes.
The leaf and spine architecture supports a scalable and reliable network that can fit varied sizes of clusters using the same architecture. The compute network is designed to maximize bandwidth and minimize network latency required to connect GPUs within a server and within a rail.
In addition to the supported compute fabrics the converged spine-leaf network also supports the following attributes:
Provides high bandwidth to high performance storage and connects to the data center network.
Each compute node is connected with two 400 GbE ports and each management node is connected with two 200 GbE ports to two separate switches to provide redundancy and high storage throughput.
Out-of-Band (OOB) Management Network#
The OOB management network connects the following components:
Baseboard management controller (BMC) ports of the server nodes
BMC ports of the Bluefield-3 DPU and SuperNICs
OOB management ports of switches
This OOB network can also connect other devices that should have management connectivity physically isolated for security purposes. The NVIDIA SN2201switch is used to connect to the BMC/OOB 1 Gbps ports of these components. Scale out this network using a 25 Gbps or 100 Gbps spine layer to connect the NVIDIA SN2201 uplink switches.
Optimized Use Case#
This architecture is optimized for both inference and training use cases. Deployment of this single architecture will enable both types of workloads.
Architecture Details#
The components of the NVIDIA HGX B300 Spectrum Enterprise RA using Spectrum-X 800 Gbps for the compute fabric are described in Table 5.
Table 5: HGX B300 Server Ethernet components with Spectrum-X fabric
Component |
Technology |
Compute node servers (4-128) |
NVIDIA HGX B300 Servers with 8 NVIDIA B300 SXM GPUs, configured with
|
Compute (East/West) Spine-Leaf Fabric |
NVIDIA SN5600 128-port 400 GbE switches |
Converged (North/South) Spine-Leaf Fabric |
NVIDIA SN5600 128-port 400 GbE switches |
OOB management fabric |
NVIDIA SN2201 48-port 1 Gb plus 4-port 100 GbE |
Compute (Node West/East) Fabric Table#
Table 6 below shows the number of cables and switches required for the dual plane compute (Node East/West) Fabric for different SU sizes.
Table 6: Compute (Node East/West) and switch component count
Compute Counts |
Switch Counts |
Transceiver Counts |
Cable Counts |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
Nodes |
GPUs |
SUs |
Leaf |
Spine |
Uplinks per leaf to spine @ 400G |
Node to Leaf |
Switch -to- Switch |
Compute -to- Leaf |
Switch -to- Switch |
|
Compute |
Switch |
|||||||||
32 |
256 |
8 |
8 |
4 |
32 |
256 |
256 |
512 |
512 |
512 |
64 |
512 |
16 |
16 |
8 |
16 |
512 |
512 |
1024 |
1024 |
1024 |
128 |
1024 |
32 |
32 |
16 |
8 |
1024 |
1024 |
2048 |
2048 |
2048 |
Figures shown are aggregate amounts for both planes.
Converged (Node North/South) Fabric Table#
Table 7 below shows the number of cables and switches required for the Converged (Node North/South) Fabric for different SU sizes. For lower node counts such as 16 nodes where the Converged Network has been consolidated with the Compute Network, the transceiver and cables are additional to the quantities in Table 7 used for the East/West fabric.
Table 7: Converged (Node North/South) and switch component count
|
Switch Counts |
Converged Network Allocated Ports |
|
Cable Counts |
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nodes |
GPUs |
SUs |
Leaf |
Spine |
CPU (Node N-S) |
Storage |
Mgmt. Uplinks |
Customer |
Support |
ISL Ports (both ends) |
Node -to- Leaf |
Switch -to- Switch |
Node -to- Leaf |
Switch -to- Switch |
|
Node |
Switch |
||||||||||||||
32 |
256 |
8 |
2 |
N/A |
32 |
8 |
4 |
16 |
4 |
26 |
188 |
62 |
26 |
188 |
26 |
64 |
512 |
16 |
4 |
2 |
64 |
16 |
4 |
32 |
4 |
256 |
352 |
118 |
256 |
352 |
256 |
128 |
1024 |
32 |
8 |
4 |
128 |
32 |
8 |
64 |
4 |
512 |
688 |
232 |
512 |
688 |
512 |
32 Nodes with 256 NVIDIA HGX B300 SXM GPUs#
Figure 8: NVIDIA HGX B300 Spectrum Enterprise RA with 32 servers (256 GPUs) and dual plane compute fabric.
Architecture Overview
32 NVIDIA HGX B300 Servers (256 GPUs) in 2-8-9-800 configuration with NVIDIA Spectrum-X compute fabric
8 scalable units (SUs) with 4 server nodes per unit
Network Design
Cost-efficient converged two-switch fabric for CPU (North/South) Network
GPU Compute (East/West) Network uses separate, isolated dual-plane leaf-spine architecture for high-bandwidth, low-latency communication
Dual-plane rail-optimized, non-blocking Spectrum fabric with 64-port switches
Port breakout functionality consolidates ports while maintaining resiliency
4 leaf switches per plane, each supporting 2 rails (1+5, 2+6, 3+7, 4+8)
Each GPU has individual 400Gb connection to each plane via 800Gb port broken out to 2x400Gb
Connectivity (Under Optimal Conditions)
64x 400G uplinks per SU for GPU Compute traffic (East/West)
8x 200G uplinks per SU for CPU traffic (North/South)
Up to 8 support servers, dual-connected at 200Gb
64x 100G/200G connections to customer network (minimum 25Gb bandwidth per GPU)
32x 100G/200G connections for storage (minimum 12.5Gb bandwidth per GPU)
Management
SN2201 switch for every 2 SUs (4 switches total)
Each SN2201 uplinked to core via 2x 100G links
200 GbE management fabric with HA using eight 100GbE core connections
All nodes connected to 1Gb management switches
Additional Considerations
VLAN isolation separates North-South and East-West fabrics collapsed into the physical infrastructure
Rack layout must provide power supply redundancy; otherwise, consider an alternative rack layout
The number of GPU servers per rack depends on available rack power
64 Nodes with 512 NVIDIA HGX B300 SXM GPUs#
Figure: HGX B300 Spectrum Enterprise RA with 64 servers (512 GPUs) with dual plane compute fabric.
Architecture Overview
64 NVIDIA HGX B300 Servers (512 GPUs) in 2-8-9-800 configuration with NVIDIA Spectrum-X compute fabric
16 scalable units (SUs) with 4 server nodes per unit
Network Design
Cost-efficient converged two-switch fabric for CPU (North/South) Network
GPU Compute (East/West) Network uses separate, isolated dual-plane leaf-spine architecture for high-bandwidth, low-latency communication
Dual-plane rail-optimized, non-blocking Spectrum fabric with 64-port switches
Port breakout functionality consolidates ports while maintaining resiliency
8 leaf switches (in 2 blocks of 4) per plane, each supporting 2 rails (1+5, 2+6, 3+7, 4+8)
2 leaf switches supporting each pair of rails with corresponding spine increase
Each GPU has individual 400Gb connection to each plane via 800Gb port broken out to 2x400Gb
Connectivity (Under Optimal Conditions)
64x 400G uplinks per SU for GPU Compute traffic (East/West)
8x 200G uplinks per SU for CPU traffic (North/South)
Up to 8 support servers, dual-connected at 200Gb
128x 100G/200G connections to customer network (minimum 25Gb bandwidth per GPU)
64x 100G/200G connections for storage (minimum 12.5Gb bandwidth per GPU)
Management
SN2201 switch per SU (8 switches total)
Each SN2201 uplinked to core via 2x 100G links
200 GbE management fabric with HA using sixteen 100GbE core connections
All nodes connected to 1Gb management switches
Additional Considerations
VLAN isolation separates fabrics on Converged (North/South) Network
Rack layout must provide power supply redundancy
HGX B300 servers per rack depends on available rack power
Dual-ported optic recommended to simplify breakout at NVIDIA ConnectX-8 OSFP port
128 Nodes with 1024 B300 HGX GPUs#
Figure: HGX B300 Spectrum Enterprise RA with 128 servers (1024 GPUs) with dual plane compute fabric.
Architecture Overview
128 NVIDIA HGX B300 Servers (1024 GPUs) in 2-8-9-800 configuration with NVIDIA Spectrum-X compute fabric
32 scalable units (SUs) with 4 server nodes per unit
Network Design
Cost-efficient spine-leaf design for all fabrics
GPU Compute (East/West) Network uses separate, isolated dual-plane leaf-spine architecture for high-bandwidth, low-latency communication
Dual-plane rail-optimized, non-blocking Spectrum fabric with 64-port switches
Port breakout functionality consolidates ports while maintaining resiliency
16 leaf switches (in 4 blocks of 4) per plane, each supporting 2 rails (1+5, 2+6, 3+7, 4+8)
4 leaf switches supporting each pair of rails with corresponding spine increase
Each GPU has individual 400Gb connection to each plane via 800Gb port broken out to 2x400Gb
Connectivity (Under Optimal Conditions)
64x 400G uplinks per SU for GPU Compute traffic (East/West)
8x 200G uplinks per SU for CPU traffic (North/South)
Up to 8 support servers, dual-connected at 200Gb
256x 100G/200G connections to customer network (minimum 25Gb bandwidth per GPU)
128x 100G/200G connections for storage (minimum 12.5Gb bandwidth per GPU)
Management
NVIDIA SN2201 switch per SU (32 switches total)
Each NVIDIA SN2201 uplinked to core via 2x 100G links
200 GbE management fabric with HA using thirty-two 100GbE core connections
All nodes connected to 1Gb management switches
Additional Considerations
VLAN isolation separates fabrics on Converged (North/South) Network
Rack layout must provide power supply redundancy
NVIDIA HGX B300 servers per rack depends on available rack power
Dual-ported optic recommended to simplify breakout at NVIDIA ConnectX-8 OSFP port