Network Fabrics

Building systems by SU provides the most efficient designs. However, if a different node count is required due to budgetary constraints, data center constraints, or other needs, the fabric should be designed to support the full SU, including leaf switches and leaf-spine cables, and leave the portion of the fabric unused where these nodes would be located. This will ensure optimal traffic routing and ensure that performance is consistent across all portions of the fabric.

DGX SuperPOD configurations utilize four network fabrics:

-Compute Fabric -Storage Fabric -In-Band Management Network -Out-of-Band Management Network

Each network is detailed in this section.

Figure 4 shows the ports on the back of the DGX H100 CPU tray and the connectivity provided. The compute fabric ports in the middle use a two-port transceiver to access all eight GPUs. Each pair of in-band management and storage ports provide parallel pathways into the DGX H100 system for increased performance. The OOB port is used for BMC access. (The LAN port next to the BMC port is not used in DGX SuperPOD configurations.)

Figure 4. DGX H100 network ports

_images/network-arch-01.png

Compute—InfiniBand Fabric

Figure 5 shows the compute fabric layout for the full 127-node DGX SuperPOD. Each group of 32 nodes is rail-aligned. Traffic per rail of the DGX H100 systems is always one hop away from the other 31 nodes in a SU. Traffic between nodes, or between rails, traverses the spine layer.

Figure 5. Compute InfiniBand fabric for full 127 node DGX SuperPOD

_images/network-arch-02.png

Table 4 shows the number of cables and switches required for the compute fabric for different SU sizes.

Table 4. Compute fabric component count

SU Count

Node Count

GPU Count

InfiniBand Switch Count

Cable Counts

Leaf

Spine

Compute and UFM

Spine-Leaf

1

31¹

248

8

4

252

256

2

63

504

16

8

508

512

3

95

760

24

16

764

768

4

127

1016

32

16

1020

1024

¹. This is a 32 node per SU design, however a DGX system must be removed to accommodate for UFM connectivity.

Storage—InfiniBand Fabric

The storage fabric employs an InfiniBand network fabric that is essential to maximum bandwidth (Figure 6). This is because the I/O per-node for the DGX SuperPOD must exceed 40 GBps. Highbandwidth requirements with advanced fabric management features, such as congestion control and AR, provide significant benefits for the storage fabric.

Figure 6. InfiniBand storage fabric logical design

_images/network-arch-03.png

The storage fabric uses MQM9700-NS2F switches (Figure 7). The storage devices are connected at a 1:1 port to uplink ratio. The DGX H100 system connections are slightly oversubscribed with a ratio near 4:3 with adjustments as needed to enable more storage flexibility regarding cost and performance.

Figure 7. MQM9700-NS2F switch

_images/network-arch-04.png

In-Band Management Network

The in-band management network provides several key functions:

  • Connects all the services that manage the cluster.

  • Enables access to the home filesystem and storage pool.

  • Provides connectivity for the in-cluster services such as Base Command Manager, Slurm and to other services outside of the cluster such as the NGC registry, code repositories, and data sources.

Figure 8 shows the logical layout of the in-band Ethernet network. The in-band network connects the compute nodes and management nodes. In addition, the OOB network is connected to the in-band network to provide high-speed interfaces from the management nodes to support parallel operations to devices connected to the OOB storage fabric, such as storage.

Figure 8. In-band Ethernet network

_images/network-arch-05.png

The in-band management network uses SN4600C switches (Figure 9).

Figure 9. SN4600C switch

_images/network-arch-06.png

Out-of-Band Management Network

Figure 10 shows the OOB Ethernet fabric. It connects the management ports of all devices including DGX and management servers, storage, networking gear, rack PDUs, and all other devices. These are separate onto their own fabric because there is no use-case where users need access to these ports and are secured using logical network separation.

Figure 10 Logical OOB management network layout

_images/network-arch-07.png

The OOB management network uses SN2201 switches (Figure 11).

Figure 11. SN2201 switch

_images/network-arch-08.png