Network Fabrics#

Building systems by SU provide the most efficient designs. However, if a different node count is required due to budgetary constraints, data center constraints, or other needs, the fabric should be designed to support the full SU, including leaf switches and leaf-spine cables, and leave the portion of the fabric unused where these nodes would be located. This will ensure optimal traffic routing and ensure that performance is consistent across all portions of the fabric.

DGX SuperPOD configurations utilize four network fabrics:

  • Compute Fabric

  • Storage Fabric

  • Ethernet Fabric with two network segments

    • Inband Network

    • Out-of-band Network

Figure 5 shows the ports on the back of the DGX B300 CPU tray and the connectivity provided. The compute fabric ports are on the edges of the compute tray and provide access to all eight GPUs. Each pair of in-band management and storage ports provide parallel pathways into the DGX B300 system for increased performance. The BMC port is used for BMC access.

_images/image6.png

Figure 5 DGX B300 network ports#

Compute Fabric#

Figure 6 shows the compute fabric layout for the full 512-node DGX SuperPOD. Each group of 64 nodes is rail-aligned. Traffic per rail of the DGX B300 systems is always one hop away from the other 64 nodes in a SU. Traffic between nodes, or between rails, traverses the spine layer.

The Spectrum-X based compute fabric features a twin-planar design (denoted in blue and green). Each GPU has 2x 400GbE connectivity through two different planes. The multi-planar design provides not only the high performance and low latency required for AI training, but it also enhances the fault tolerance as two independent data paths are available. A single switch failure, transceiver failure or cable failure will not lead to a catastrophic job abortion. Instead, the application can continue to operate on ½ of the original bandwidth.

Another benefit of the multi-planar design is a significantly increased SuperPOD dimension with a two-layer fabric using spine-leaf switches only. As these two independent planes are not connected at a core layer level, twice as many nodes can be connected to the same fabric on 2-layer fabric, reducing the total cost for construction while keeping the high standard for performance.

_images/image7.png

Figure 6 Compute fabric for full 512-node DGX SuperPOD#

The multi-planar (in this case – 2 planes) fabric provides auto-load balancing based on both hardware and software load balancer. With NVIDIA Route software and the onboard plane load balancer unit on the ConnectX-8 NIC, traffic can be automatically routed and balanced over the best path. In case of a single link failure, GPU application will continue to operate as it will not receive an aborted connection error through the RDMA Driver.

NVIDIA NetQ delivers real-time network telemetry and monitoring for Spectrum-X, enabling operators to gain visibility into the health and performance of the compute fabric. By providing tools for rapid troubleshooting and network validation, NetQ helps ensure reliable operation and efficient management of large-scale AI workloads in DGX SuperPOD environments.

InfiniBand Storage Fabric#

The storage fabric employs an InfiniBand network fabric that is essential to maximum bandwidth (Figure 7). This is because the I/O per-node for the DGX SuperPOD must exceed 40 GBps. High bandwidth requirements with advanced fabric management features, such as congestion control and AR, provide significant benefits for the storage fabric.

_images/image9.png

Figure 7 Storage fabric logical design#

There are two storage fabric options for NVIDIA DGX SuperPOD. The InfiniBand storage fabric uses MQM9700-NS2R (AC power) or MQM9701-NS2R (DC power) NDR switches (Figure 8). The high-speed storage devices are connected at a 1:1 port to uplink ratio. The DGX B300 system connections are slightly oversubscribed with a ratio near 4:3 with adjustments as needed to enable more storage flexibility regarding cost and performance.

_images/image10.png

Figure 8 MQM9701-NS2R switch#

Ethernet Storage Fabric#

There are two storage fabric options for NVIDIA DGX SuperPOD. The Ethernet storage fabric employs a high-speed Ethernet network fabric that is essential to maximum bandwidth (Figure 9). This is because the I/O per-node for the DGX SuperPOD must exceed 40 GBps. High bandwidth requirements with advanced fabric management features, provide significant benefits for the storage fabric. Supported ethernet storage appliance leverages RoCE to provide best performance and minimizes CPU usage.

_images/image11.png

Figure 9 Ethernet Storage fabric logical design#

The storage fabric uses SN5600 (AC power) or SN5600D (DC power) switches (Figure 10). The high-speed storage devices are connected at a 1:1 port to uplink ratio. The DGX B300 system connections are slightly oversubscribed with a ratio near 4:3 with adjustments as needed to enable more storage flexibility regarding cost and performance.

_images/image12.jpg

Figure 10 NVIDIA Spectrum SN5600D Ethernet Switch#

Network Segmentation of the Ethernet Fabric#

The ethernet fabric is segmented into these segments on the SuperPOD:

In this reference design, the entire ethernet fabric (except for potential dedicated storage and compute fabric), is built on a common physical network, and segregated with VXLAN and EVPN to achieve network isolation between control traffic and admin traffic. VTEPs for different network segments terminate on the leaf switches (either SN2201 or SN5600D) for the access to different networks.

In the following, we introduce these networks in detail.

_images/image13.png

Figure 11 Network Segmentation diagram#

In-band Management Network#

The in-band management network provides several key functions:

  • Connects all the services that manage the cluster.

  • Enables access to the lower-speed, NFS tier of storage.

  • Provides uplink (border) connectivity for the in-cluster services such as Mission Control, Base Command Manager, Slurm, and Kubernetes to other services outside of the cluster such as the NGC registry, code repositories, and data sources.

  • Provides end user access to the Slurm head nodes and Kubernetes services.

_images/image14.png

Figure 12 In-band Ethernet network#

The in-band management network uses SN5600D switches (Figure 10 and Figure 12)

Out-of-band Management Network#

Figure 13 shows the OOB Ethernet fabric. It connects the management ports of all devices including DGX B300 compute trays, switches, and management servers, storage, networking gear, rack PDUs, and all other devices. These are separated onto their own network. There is no use-case where a non-privileged user needs direct access to these ports and are secured using logical network separation.

The OOB network carries all IPMI related control traffic and serves as the network for fabric management of the compute InfiniBand fabric and compute fabric.

The OOB network is physically rolled up into the aggregation layer (spine layer) of each SU as a dedicated VXLAN. The OOB management network use SN2201 switches, shown in Figure 14.

_images/image15.png

Figure 13 Logical OOB management network layout#

The OOB management network uses SN2201 switches (Figure 14).

_images/image16.jpeg

Figure 14 SN2201 switch#

Customer Edge Connectivity#

For connecting DGX SuperPOD to customer edge for uplink and customer corporate network access, we recommend at least 2x 100GbE links with DR1 single-mode connectivity to cope with the growing demand on high-speed data transfer into and from DGX SuperPOD.

For route handover, we prepare BGP protocol to peer with customer’s network. Routes to/from in-band and out-of-band are announced.

Customers, who cannot provide DR1-based connectivity, are encouraged to use a pair of dedicated border leaf switches to enhance the connectivity (not as part of the SuperPOD).

_images/image17.png

Figure 15 Customer Edge Example#

User Storage Connectivity#

For the operation with DGX SuperPOD, customers are required to provide an NFS-based home storage / configuration storage, to be integrated with NVIDIA Mission Control.

SuperPOD with DGX B300 supports storage integration with single-mode DR1 (or faster) connectivity for user storage. User storage is connected to the leaf-layer SN5600Ds.

If customer provided User Storage does not support DR1 connectivity, we recommend they implement border TORs that would be able to connect to the SN5600Ds.

_images/image18.png

Figure 16 User Storage Example#