Setting the InfiniBand Cluster Topology

InfiniBand Cluster Bring-up Procedure

InfiniBand fabric components can be connected using different topologies and it should be decided before building the cluster.

Fat-Tree is NVIDIA's recommended topology, AI factory should based on rail optimized.

Rail-optimized design means a GPU node with multiple interfaces will put each GPU “rail” (IB network interface) onto a different first level (LEAF) switch for cluster Interconnect. This allows multiple nodes to utilize their internal NVSwitch path to talk across a NIC that is just one switch hop away (instead of having to cross multiple switches, incurring additional latency).

The diagram below shows an example of cluster topology for AI factory (based on rail optimized):

image-2024-4-3_10-47-0-version-1-modificationdate-1716821900747-api-v2.png

3 Tier Topology

image-2024-4-4_14-6-56-1-version-1-modificationdate-1716821900130-api-v2.png

2 Tier Topology

To choose the best cluster planning that fits the cluster needs, please contact NVIDIA Support.

After selecting the InfiniBand interconnect principles, create a PTP excel file to describe the cluster connectivity and to generate the Topology file.

Note

It is imperative to verify that the cluster has been connected according to the cluster planning to ensure cluster maintenance.

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.