InfiniBand Cluster Bring-up Procedure
InfiniBand Cluster Bring-up Procedure

SHIELD

Self-Healing Interconnect Enhancement for Intelligent Datacenters, which referred to as Fast Link Fault Recovery (FLFR) throughout this document, enables the switch to select the alternative output port if the output port provided in the Linear Forwarding Table is not in Armed/Active state. This mode allows the fastest traffic recovery in case of switch-to-switch port failures due to link flaps, or neighbor switch reboots without intervention of Subnet Manager. The Fast Link Fault Notification (FLFN) enables the switch to report to neighbor switches that an alternative output port for the traffic to specific destination LID should be selected to avoid sending traffic to the switch. This is required when the FLFR on the switch has no alternative port to select for the destination LID. Adaptive Routing Notification (ARN) enables the switch to send a report to the neighbor switches, if its ports are congested above the threshold. This report causes neighbor switches to select another output port to deliver the traffic to the destination LID.

Note

Fast Link Fault Notification (FLFN) is supported for fat-tree and quasi-fat-tree topologies only.

SHIELD is enabled by default in the opensm.conf file with shield_mode subnet manager configuration option.

To disable SHIELD in the fabric, you need to change the shield_mode subnet manager configuration option to 0 .

For further details, see: https://enterprise-support.nvidia.com/s/article/Recommended-Topologies-for-Implementing-an-HPC-Cluster-with-NVIDIA-Quantum-InfiniBand-Solutions-Part-2

© Copyright 2024, NVIDIA. Last updated on May 28, 2024.