InfiniBand Cluster Bring-up Procedure

Adaptive Routing

Adaptive Routing (AR) enables the switch to select the output port based on the port's load. It assumes there are no constraints on the output port selection (free adaptive routing). The subnet manager (SM) enables and configures the Adaptive Routing mechanism on the fabric switches. It scans all the fabric switches and identifies which ones support Adaptive Routing, then it configures the AR functionality on these switches. The subnet manager (SM) configures the AR groups and AR LFTs tables to allow switches to select an output port out of an AR group for a specific destination LID. The configuration of the AR groups relies on the selection of one of the following supported algorithm:

  • LAG: All ports that are linked to the same remote switch are in the same AR group. This algorithm is suitable for any topology with multiple links between switches, especially Hypercube/3D torus/mesh, where there are several links in each direction of the X/Y/Z axis

  • TREE: All ports with minimal hops to destination are in the same AR group. This algorithm is suitable for tree topologies such as fat-tree, quasi-fat-tree, parallel links fat-tree, etc

  • DF_PLUS: This algorithm is designed for the Dragonfly plus topology

AR is enabled by default in the opensm.conf file.
routing_engine subnet manager configuration option needs to be adjusted based on the network's topology. The default value is: ar_updn which is relevant for a fat tree topology.

To disable AR in the fabric, need to change the routing_engine subnet manager configuration option to a non-AR routing engine.
Fat tree topology example: routing_engine updn.

After changes in the opensm.conf file should run the "pkill -HUP opensm" bash command.

Note

It is recommended to specify the correct root GUIDs file in the opensm.conf file.

root GUIDs file contains all the root GUIDs, each root GUID in a new line.

By default, the root GUIDs file is taken from the SM as you can see in the opensm.conf file

root_guid_file /opt/ufm/files/conf/opensm/root_guid.conf

For further details and configuration options, see: https://enterprise-support.nvidia.com/s/article/Recommended-Topologies-for-Implementing-an-HPC-Cluster-with-NVIDIA-Quantum-InfiniBand-Solutions-Part-2

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.