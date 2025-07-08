In instances where the SM configuration becomes corrupted or the subnet manager cannot raise any logical links it is suggested that you restore the default SM configuration.

To restore subnet manager configuration:

Enter config mode. Run: Copy Copied! * switch [subnet2: master] > enable * switch [subnet2: master] # configure terminal * switch [subnet2: master] (config) # Run the command “ib sm reset-config”. Run: Copy Copied! * switch [subnet2: master] (config) # ib sm reset-config Note The asterisk in the example above (*switch) indicates the local system from where the command is running.

In order to receive information on the running state of a specific node one could run one of the following commands with its requested parameter:

show ib smnode <name> sm-running

show ib smnode <name> sm-state

show ib smnode <name> sm-priority

show ib smnode <name> active

show ib smnode <name> ha-state

show ib smnode <name> ha-role

To configure the subnet manager, log into the centralized management IP (VIP). Once the SM configuration is created, the SM database is duplicated to the other nodes.

Note The SM must be configured from MLNX-OS centralized management IP (VIP). All the configurations that are not created or modified in the master node (using the VIP) are overridden by the master configuration.

The user can configure different SM parameters such as where to run the SM(s) or the SM priority by running the commands according to the desired action.

Note NVIDIA products are fully compliant and interoperable with OpenSM.

Once an SM fails, the SM which takes over the subnet needs to reproduce the internal state of the failed master. Most of the information required is obtained by scanning the subnet and extracting the information from the devices. However, some information which is not stored directly in the network devices cannot be reproduced this way. InfiniBand management architecture limits such information to data exchanged between clients (either user-level programs or kernel modules) and the Subnet Administration (SA) service (attached to the SM). The SA keeps this set of client registrations in an internal data structure called SA-DB. The SA-DB information includes the multicast groups, the multicast group members, subscriptions for event forwarding and service records.

The new SM may retrieve the SA-DB by requesting the clients to re-register with the SA or by obtaining a copy of the previous master SM internal SA-DB via an SA-DB dump file. The client-re-registration offers database correctness and the SA-DB dump file replication provides lower setup time. Client re-registration is required since the SA-DB may not be up-to-date on the registrations listed in the master SM.

Furthermore, since the SM does not maintain SA-DB information for unknown nodes, it is very possible that some of the SA-DB information relating to nodes momentarily disconnected from the master SM become purged. Therefore, these nodes must re-register with the new SM when they are reconnected (they receive a client-re-register request from the SM). Relying only on client re-registration is also non-optimal as it takes some time to recreate the entire SA-DB and the network state.

NVIDIA SM HA replicates the SA-DB dump file from the current master SM to all the standby SMs running on NVIDIA switches. The SA-DB dump file replication provides further optimization to the standby SM that becomes master.

Standby SM loads the existing SA-DB file the old master has used. By using the existing SA-DB the amount of processing needed on client re-registration is lessened resulting in a reduced time to complete setting up the network.

Note SM HA does not replace InfiniBand spec requirement for client re-registration.

Note When running an SM HA cluster with more than 2 active OpenSM instances, IB multicast applications need to support client re-register or they may not work correctly after OpenSM failover.



