Networking Logical Architecture#

This Enterprise RA uses a converged spine-leaf network providing physical fabrics for the use cases such as inferencing and fine-tuning. Deeper explanations of each of the fabric roles are below:

  • GPU Compute (East/West) Network

    The GPU Compute (East/West) network is an RDMA based fabric in a leaf-spine architecture where the GPUs are connected using a rail-optimized network topology through their respective SuperNICs. This design allows the most efficient communication for multi-GPU applications within and across the compute nodes due to the high number of endpoints needed. The leaf-spine networking architecture also allows for future extensibility.

  • CPU Converged (North/South) Network

    The CPU Converged (North/South) network connects the nodes using two 400 GbE ports to two separate switches to provide redundancy and high storage throughput. This network is used for node communications with compute, storage, in-band management, and end-user connections.

  • Storage (North/South) connectivity

    This network specifically provides converged connectivity for storage infrastructure. Principally storage is attached on this network and then can be provided to the nodes via the CPU Compute network as required.

  • Customer (North/South) Network Connectivity

    Typically, this is upstream connectivity to connect the cluster to the rest of an Enterprise customers’ networking infrastructure. This is provisioned to provide ample bandwidth for typical user communications.

  • Support Server Networking

    This is dedicated networking for the respective support servers. These servers generally provide management, provisioning, monitoring, and control services to the rest of the cluster. High performance networking is used here due to requirements such as cluster deployment and imaging.

  • Out-of-band Management Networking for the infrastructure

    All infrastructure requires management. This network provides bulk management 1Gb RJ45 connectivity for all the Nodes. This network uses low-cost bulk management switches. These switches have upstream connections to the Core networking infrastructure to allow flexibility and wider consolidation of services such as management and monitoring.

Note

VLAN isolation would be used to provide logical separation of the networks above over the single physical fabric.

Enterprise RA Scalable Unit (SU)#

This Enterprise RA is built on scalable units (SU) based on 4 compute nodes. Each SU is a discrete entity of computation that is tied to the port availability size of the network devices. SUs can be replicated to adjust the scale of the deployment with more ease.

Figure 7: Example diagram of 4-Node SU.

_images/hgx-ai-factory-07.png

The 4-Node scalable unit provides the following connectivity building blocks:

  • For the Compute (East/West) fabric: 4 servers, each with 8x NVIDIAConnectX-8 SuperNICs, providing 64x 400Gb/s connections and a total aggregate bandwidth of 25.6Tb/s

  • For the Converged (North/South) fabric: 4 servers, each with 1x NVIDIA BlueField-3 B3240 DPU providing 8x 400Gb/s connections and a total aggregate bandwidth of 3.2Tb/s

  • For the Out-of-band Management fabric, 4 servers, each with 6x 1Gb/s connections providing 24x 1Gb/s for management

Spine-Leaf Networking#

The network fabrics are built using switches with NVIDIA Spectrum-X Ethernet technology in a full nonblocking fat tree topology to provide the highest level of performance for the application running over the Enterprise RA configuration. The networks are RDMA compliant based fabrics in a leaf-spine architecture where the GPUs are connected using a rail-optimized network topology through their respective SuperNICs. This design allows the most efficient communication for multi-GPU applications within and across the nodes.

The leaf and spine architecture supports a scalable and reliable network that can fit varied sizes of clusters using the same architecture. The compute network is designed to maximize bandwidth and minimize network latency required to connect GPUs within a server and within a rail.

In addition to the supported compute fabrics the converged spine-leaf network also supports the following attributes:

  • Provides high bandwidth to high performance storage and connects to the data center network.

  • Each compute node is connected with two 400 GbE ports and each management node is connected with two 200 GbE ports to two separate switches to provide redundancy and high storage throughput.

Out-of-Band (OOB) Management Network#

The OOB management network connects the following components:

  • Baseboard management controller (BMC) ports of the server nodes

  • BMC ports of the Bluefield-3 DPU and SuperNICs

  • OOB management ports of switches

This OOB network can also connect other devices that should have management connectivity physically isolated for security purposes. The NVIDIA SN2201switch is used to connect to the BMC/OOB 1 Gbps ports of these components. Scale out this network using a 25 Gbps or 100 Gbps spine layer to connect the NVIDIA SN2201 uplink switches.

Optimized Use Case#

This architecture is optimized for both inference and training use cases. Deployment of this single architecture will enable both types of workloads.

Architecture Details#

The components of the NVIDIA HGX B300 Spectrum Enterprise RA using Spectrum-X 800 Gbps for the compute fabric are described in Table 5.

Table 5: HGX B300 Server Ethernet components with Spectrum-X fabric

Component

Technology

Compute node servers (4-128)

NVIDIA HGX B300 Servers with 8 NVIDIA B300 SXM GPUs, configured with

  • One NVIDIA BlueField-3 B3240 dual port 400 GbE DPU

  • Eight NVIDIA ConnectX-8 single-port 800 GbE SuperNICs, configured for 2x400Gb Ethernet breakout

Compute (East/West) Spine-Leaf Fabric

NVIDIA SN5600 128-port 400 GbE switches

Converged (North/South) Spine-Leaf Fabric

NVIDIA SN5600 128-port 400 GbE switches

OOB management fabric

NVIDIA SN2201 48-port 1 Gb plus 4-port 100 GbE

Compute (Node West/East) Fabric Table#

Table 6 below shows the number of cables and switches required for the dual plane compute (Node East/West) Fabric for different SU sizes.

Table 6: Compute (Node East/West) and switch component count

Compute Counts

Switch Counts

Transceiver Counts

Cable Counts

Nodes

GPUs

SUs

Leaf

Spine

Uplinks per leaf to spine @ 400G

Node to Leaf

Switch

-to-

Switch

Compute

-to-

Leaf

Switch

-to-

Switch

Compute

Switch

32

256

8

8

4

32

256

256

512

512

512

64

512

16

16

8

16

512

512

1024

1024

1024

128

1024

32

32

16

8

1024

1024

2048

2048

2048

Figures shown are aggregate amounts for both planes.

Converged (Node North/South) Fabric Table#

Table 7 below shows the number of cables and switches required for the Converged (Node North/South) Fabric for different SU sizes. For lower node counts such as 16 nodes where the Converged Network has been consolidated with the Compute Network, the transceiver and cables are additional to the quantities in Table 7 used for the East/West fabric.

Table 7: Converged (Node North/South) and switch component count

Compute

Counts

Switch Counts

Converged Network Allocated Ports

Transceiver

Counts

Cable Counts

Nodes

GPUs

SUs

Leaf

Spine

CPU (Node N-S)

Storage

Mgmt. Uplinks

Customer

Support

ISL Ports (both ends)

Node -to- Leaf

Switch -to- Switch

Node -to- Leaf

Switch -to-

Switch

Node

Switch

32

256

8

2

N/A

32

8

4

16

4

26

188

62

26

188

26

64

512

16

4

2

64

16

4

32

4

256

352

118

256

352

256

128

1024

32

8

4

128

32

8

64

4

512

688

232

512

688

512

32 Nodes with 256 NVIDIA HGX B300 SXM GPUs#

Figure 8: NVIDIA HGX B300 Spectrum Enterprise RA with 32 servers (256 GPUs) and dual plane compute fabric.

_images/hgx-ai-factory-08.png

Architecture Overview

  • 32 NVIDIA HGX B300 Servers (256 GPUs) in 2-8-9-800 configuration with NVIDIA Spectrum-X compute fabric

  • 8 scalable units (SUs) with 4 server nodes per unit

Network Design

  • Cost-efficient converged two-switch fabric for CPU (North/South) Network

  • GPU Compute (East/West) Network uses separate, isolated dual-plane leaf-spine architecture for high-bandwidth, low-latency communication

  • Dual-plane rail-optimized, non-blocking Spectrum fabric with 64-port switches

  • Port breakout functionality consolidates ports while maintaining resiliency

  • 4 leaf switches per plane, each supporting 2 rails (1+5, 2+6, 3+7, 4+8)

  • Each GPU has individual 400Gb connection to each plane via 800Gb port broken out to 2x400Gb

Connectivity (Under Optimal Conditions)

  • 64x 400G uplinks per SU for GPU Compute traffic (East/West)

  • 8x 200G uplinks per SU for CPU traffic (North/South)

  • Up to 8 support servers, dual-connected at 200Gb

  • 64x 100G/200G connections to customer network (minimum 25Gb bandwidth per GPU)

  • 32x 100G/200G connections for storage (minimum 12.5Gb bandwidth per GPU)

Management

  • SN2201 switch for every 2 SUs (4 switches total)

  • Each SN2201 uplinked to core via 2x 100G links

  • 200 GbE management fabric with HA using eight 100GbE core connections

  • All nodes connected to 1Gb management switches

Additional Considerations

  • VLAN isolation separates North-South and East-West fabrics collapsed into the physical infrastructure

  • Rack layout must provide power supply redundancy; otherwise, consider an alternative rack layout

  • The number of GPU servers per rack depends on available rack power

64 Nodes with 512 NVIDIA HGX B300 SXM GPUs#

Figure: HGX B300 Spectrum Enterprise RA with 64 servers (512 GPUs) with dual plane compute fabric.

_images/hgx-ai-factory-09.png

Architecture Overview

  • 64 NVIDIA HGX B300 Servers (512 GPUs) in 2-8-9-800 configuration with NVIDIA Spectrum-X compute fabric

  • 16 scalable units (SUs) with 4 server nodes per unit

Network Design

  • Cost-efficient converged two-switch fabric for CPU (North/South) Network

  • GPU Compute (East/West) Network uses separate, isolated dual-plane leaf-spine architecture for high-bandwidth, low-latency communication

  • Dual-plane rail-optimized, non-blocking Spectrum fabric with 64-port switches

  • Port breakout functionality consolidates ports while maintaining resiliency

  • 8 leaf switches (in 2 blocks of 4) per plane, each supporting 2 rails (1+5, 2+6, 3+7, 4+8)

  • 2 leaf switches supporting each pair of rails with corresponding spine increase

  • Each GPU has individual 400Gb connection to each plane via 800Gb port broken out to 2x400Gb

Connectivity (Under Optimal Conditions)

  • 64x 400G uplinks per SU for GPU Compute traffic (East/West)

  • 8x 200G uplinks per SU for CPU traffic (North/South)

  • Up to 8 support servers, dual-connected at 200Gb

  • 128x 100G/200G connections to customer network (minimum 25Gb bandwidth per GPU)

  • 64x 100G/200G connections for storage (minimum 12.5Gb bandwidth per GPU)

Management

  • SN2201 switch per SU (8 switches total)

  • Each SN2201 uplinked to core via 2x 100G links

  • 200 GbE management fabric with HA using sixteen 100GbE core connections

  • All nodes connected to 1Gb management switches

Additional Considerations

  • VLAN isolation separates fabrics on Converged (North/South) Network

  • Rack layout must provide power supply redundancy

  • HGX B300 servers per rack depends on available rack power

  • Dual-ported optic recommended to simplify breakout at NVIDIA ConnectX-8 OSFP port

128 Nodes with 1024 B300 HGX GPUs#

Figure: HGX B300 Spectrum Enterprise RA with 128 servers (1024 GPUs) with dual plane compute fabric.

_images/hgx-ai-factory-10.png

Architecture Overview

  • 128 NVIDIA HGX B300 Servers (1024 GPUs) in 2-8-9-800 configuration with NVIDIA Spectrum-X compute fabric

  • 32 scalable units (SUs) with 4 server nodes per unit

Network Design

  • Cost-efficient spine-leaf design for all fabrics

  • GPU Compute (East/West) Network uses separate, isolated dual-plane leaf-spine architecture for high-bandwidth, low-latency communication

  • Dual-plane rail-optimized, non-blocking Spectrum fabric with 64-port switches

  • Port breakout functionality consolidates ports while maintaining resiliency

  • 16 leaf switches (in 4 blocks of 4) per plane, each supporting 2 rails (1+5, 2+6, 3+7, 4+8)

  • 4 leaf switches supporting each pair of rails with corresponding spine increase

  • Each GPU has individual 400Gb connection to each plane via 800Gb port broken out to 2x400Gb

Connectivity (Under Optimal Conditions)

  • 64x 400G uplinks per SU for GPU Compute traffic (East/West)

  • 8x 200G uplinks per SU for CPU traffic (North/South)

  • Up to 8 support servers, dual-connected at 200Gb

  • 256x 100G/200G connections to customer network (minimum 25Gb bandwidth per GPU)

  • 128x 100G/200G connections for storage (minimum 12.5Gb bandwidth per GPU)

Management

  • NVIDIA SN2201 switch per SU (32 switches total)

  • Each NVIDIA SN2201 uplinked to core via 2x 100G links

  • 200 GbE management fabric with HA using thirty-two 100GbE core connections

  • All nodes connected to 1Gb management switches

Additional Considerations

  • VLAN isolation separates fabrics on Converged (North/South) Network

  • Rack layout must provide power supply redundancy

  • NVIDIA HGX B300 servers per rack depends on available rack power

  • Dual-ported optic recommended to simplify breakout at NVIDIA ConnectX-8 OSFP port