Subnet Manager High Availability
All nodes in an SM HA subnet must be of the same CPU type (e.g. x86), and must run the same MLNX-OS version.
High availability (HA) refers to a system or component that is continuously operational for a desirably extended period of time.
 
NVIDIA Subnet Manager (SM) HA reduces subnet downtime and disruption as it is continuously operational for a desirably long length of time. It assures continuity of the work even when one of the SMs dies. The database is synchronized with all the nodes participating in the InfiniBand subnet and a configuration change is prepared. The synchronization is done out-of-band using an Ethernet management network.
NVIDIA SM HA allows the systems’ manager to enter and modify all InfiniBand SM configuration of different subnet managers from a single location. It creates an InfiniBand subnet and associates all the NVIDIA management appliances that are attached to the same InfiniBand subnet into that InfiniBand subnet ID. All subnet managers can be controlled, started, or stopped from this address.
All the nodes that participate in the NVIDIA SM HA are joined to the InfiniBand subnet ID and once joined, the synchronized SMs are launched. One of the nodes is elected as Master and the others are Slaves (or down). NVIDIA SM HA uses an IP address (VIP) that is always directed to the SM HA master to monitor the SM state and to verify that all configurations are executed.
When transitioning from standalone into a group or vice versa, a few seconds are required for the node state to stabilize. During that time, group feature commands (e.g. SM HA commands) should not be executed. To run group features, wait for the CLI prompt to turn into [standalone:master], [<group>:master] or [<group>:standby] instead of [standalone:*unknown*] or [<group>:*unknown*].
An InfiniBand subnet is formed by a network of InfiniBand nodes interconnected via InfiniBand switches. It includes all systems that can run an SM and is part of the SM HA domain. A switch that can potentially run an SM must be a member of an InfiniBand subnet ID to be associated with the NVIDIA SM HA domain. An IB subnet is recognized by its ID which is used by the system to either join or leave the subnet.
Every system that is not associated to an existing IB subnet (has never been part of an IB subnet or has left an existing one) or does not have MLNX-OS license installed, is by default associated to a subnet called “Standalone”.
In order to create, join or leave an InfiniBand subnet, one may use the following commands:
- Create – “ib ha <IB_subnet_ID> ip <ip_addr> <netmask>” 
- Join – “ib ha <IB_subnet_ID>” 
- Leave – “no ib ha” 
When leaving an SM HA cluster, SM configuration is not saved on the node leaving the cluster. After leaving, the configuration is reset to its default values.
For further information see section “Creating and Adding Systems to an InfiniBand Subnet ID”.
MLNX-OS centralized management infrastructure enables the user to configure or modify an existing configuration and monitor the subnet running status. MLNX-OS centralized management IP (VIP) is defined when a new subnet manager is created by running the command “ib ha <IB_subnet_ID> ip <ip_addr> <netmask>”. The created VIP is used as the current subnet master’s alias thus, assumes the same roles as the master.
The VIP always points to one of the systems part of the SM HA domain. It is always active even if one or more of the members are down. For example:
            
            switch (config) # ib ha subnet2 ip 192.168.10.110 255.255.255.0
    
A node is an InfiniBand switch system. Every node member of an IB subnet ID has one of the following roles:
- Master – the node that manages SM configurations and provides services to the Virtual IP (VIP) addresses 
- Standby – the node that replaces the Master node and takes over its responsibilities once the Master node is down 
- Offline – has run an SM in the past and is currently offline, or it was created manually by the “ib smnode <node name> create” command. If the node has been removed from the environment, you can remove it from the list with the “no ib smnode xxx” command. 
To see the mode of the current node, look at the CLI prompt for the following format:
            
            <host name> [<subnet ID>:<mode>] [standalone: master] (config) #
    
For example:
switch [ibstandalone: master] (config) #
To see a list of the existing nodes and details about the running state, run the command “show ib smnodes {brief}”.
The IP is used to configure or modify the existing configuration and monitor the subnet running status. To configure your IP, run the command “ib ha <IB_subnet_ID> ip <ip_addr> <netmask>”:
            
            switch [standalone: master] (config) # ib ha subnet2 ip 192.168.10.110 255.255.255.0
switch [subnet2: master] (config) #
    
To create and add systems to a subnet:
- Log into the system from which you intend to create the subnet. 
- Enter config mode. Run: - switch[standalone: master] >- switch[standalone: master] > enable- switch[standalone: master] # configure terminal
- Create a new subnet using the command “ib ha <IB_subnet_ID> ip <ip_addr> <netmask>”. Run: - switch[standalone: master] (config) # ib ha subnet2 ip- 192.168.- 10.110- 255.255.- 255.0- switch[subnet2: master] (config) #Note- You must run the “ib ha - ip - ” command only once per subnet ID. 
- Log into the system that you are going to join to the new created subnet. 
- Join the system to the subnet, using the “ib ha <IB_subnet_ID>” command. Run: - switch[standalone: master] (config) # ib ha subnet2- switch[subnet2: standby] (config) #
In instances where the SM configuration becomes corrupted or the subnet manager cannot raise any logical links it is suggested that you restore the default SM configuration.
To restore subnet manager configuration:
- Enter config mode. Run: - * - switch[subnet2: master] > enable *- switch[subnet2: master] # configure terminal *- switch[subnet2: master] (config) #
- Run the command “ib sm reset-config”. Run: - * - switch[subnet2: master] (config) # ib sm reset-configNote- The asterisk in the example above (*switch) indicates the local system from where the command is running. 
In order to receive information on the running state of a specific node one could run one of the following commands with its requested parameter:
- show ib smnode <name> sm-running 
- show ib smnode <name> sm-state 
- show ib smnode <name> sm-priority 
- show ib smnode <name> active 
- show ib smnode <name> ha-state 
- show ib smnode <name> ha-role 
Subnet Manager Configuration
To configure the subnet manager, log into the centralized management IP (VIP). Once the SM configuration is created, the SM database is duplicated to the other nodes.
The SM must be configured from MLNX-OS centralized management IP (VIP). All the configurations that are not created or modified in the master node (using the VIP) are overridden by the master configuration.
The user can configure different SM parameters such as where to run the SM(s) or the SM priority by running the commands according to the desired action.
NVIDIA High Availability and OpenSM Handover/Failover
NVIDIA products are fully compliant and interoperable with OpenSM.
Once an SM fails, the SM which takes over the subnet needs to reproduce the internal state of the failed master. Most of the information required is obtained by scanning the subnet and extracting the information from the devices. However, some information which is not stored directly in the network devices cannot be reproduced this way. InfiniBand management architecture limits such information to data exchanged between clients (either user-level programs or kernel modules) and the Subnet Administration (SA) service (attached to the SM). The SA keeps this set of client registrations in an internal data structure called SA-DB. The SA-DB information includes the multicast groups, the multicast group members, subscriptions for event forwarding and service records.
The new SM may retrieve the SA-DB by requesting the clients to re-register with the SA or by obtaining a copy of the previous master SM internal SA-DB via an SA-DB dump file. The client-re-registration offers database correctness and the SA-DB dump file replication provides lower setup time. Client re-registration is required since the SA-DB may not be up-to-date on the registrations listed in the master SM.
Furthermore, since the SM does not maintain SA-DB information for unknown nodes, it is very possible that some of the SA-DB information relating to nodes momentarily disconnected from the master SM become purged. Therefore, these nodes must re-register with the new SM when they are reconnected (they receive a client-re-register request from the SM). Relying only on client re-registration is also non-optimal as it takes some time to recreate the entire SA-DB and the network state.
NVIDIA SM HA replicates the SA-DB dump file from the current master SM to all the standby SMs running on NVIDIA switches. The SA-DB dump file replication provides further optimization to the standby SM that becomes master.
Standby SM loads the existing SA-DB file the old master has used. By using the existing SA-DB the amount of processing needed on client re-registration is lessened resulting in a reduced time to complete setting up the network.
SM HA does not replace InfiniBand spec requirement for client re-registration.
When running an SM HA cluster with more than 2 active OpenSM instances, IB multicast applications need to support client re-register or they may not work correctly after OpenSM failover.
    
    
ib ha
| ib ha <IB_subnet_ID> [ip <IP address> <subnet mask> [force]] no ib ha Creates a subnet <IB_subnet_ID> with the specified IP. The no form of the command removes this node from an InfiniBand subnet ID. | ||
| Syntax Description | IB subnet ID | Simple group name for shared IB config | 
| ip <IP address> | Assigns management IP address | |
| netmask | Netmask (e.g. 255.255.255.0 or /24) | |
| force | Joins if exists or creates if not | |
| Default | N/A | |
| Configuration Mode | config | |
| History | 3.1.0000 | |
| Example | switch (config) # ib ha my-subnet | |
| Related Commands | show ib ha | |
| Notes | A new subnet may be joined only after leaving the current one | |
ib smnode
| ib smnode <hostname> [create | disable | enable | sm-priority <priority>] no ib smnode <hostname> [create | disable | enable | sm-priority] Manages HA SM. The no form of the command removes HA SM node configuration. | ||
| Syntax Description | hostname | Specifies <hostname> SM configuration to modify. | 
| create | Creates SM configuration for selected node. | |
| disable | Makes SM inactive on selected node. | |
| enable | Makes SM active on selected node. | |
| sm-priority <priority> | Sets SM selected node priority (0=low, 15=high). | |
| Default | N/A | |
| Configuration Mode | config | |
| History | 3.1.0000 | |
| Example | switch (config) # ib smnode switch-1133ce create | |
| Related Commands | show ib smnode show ib smnodes | |
| Notes | ||
show ib smnode
| show ib smnode <hostname> {active | ha-role | ha-state | ip | sm-priority | sm-running | sm-state} Displays SM High availability information. | ||
| Syntax Description | hostname | Specifies <hostname> SM configuration to display | 
| active | Displays whether <hostname> is currently active | |
| ha-role | Displays the High Availability role of <hostname>. Possible return values are: offline, unknown, master, standby, or disabled | |
| ha-state | Possible return values are: offline, init, searching, joining, online, creating, waiting, leaving, join-sync, failed, removed, or regroup | |
| ip | Displays the local management IP address associated with the active node, <hostname>. If <hostname> is not active, the command displays “offline” | |
| sm-priority | Displays the SM priority for SM running on <hostname> | |
| sm-running | Displays if <hostname> has an SM running. The command will display “active” (that is, SM is running) only if <hostname> is currently active, has a license, is enabled as a potential SM, is active as SM, and if there is a maximum of 2 SMs in the fabric. | |
| sm-state | Displays if SM is enabled to run on <hostname> | |
| Default | N/A | |
| Configuration Mode | config | |
| History | 3.1.0000 | |
| 3.8.1000 | Updated Syntax Description | |
| Example | switch (config) # show ib smnode my-hostname sm-state | |
| Related Commands | show ib smnodes | |
| Notes | ||
show ib smnodes
| show ib smnodes [brief] Displays SM High availability information. | ||
| Syntax Description | brief | Displays information on all HA nodes | 
| Default | N/A | |
| Configuration Mode | config | |
| History | 3.1.0000 | |
| 3.8.1000 | Updated example | |
| 3.9.3100 | Updated output to reflect the OpenSM master also when the command is triggered from non-SM master | |
| Example | switch (config) # show ib smnodes | |
| Related Commands | ||
| Notes | ||
show ib ha
| show ib ha [brief] Displays information about all the systems that are active or might be able to run SM. | ||
| Syntax Description | brief | Displays brief HA information | 
| Default | N/A | |
| Configuration Mode | config | |
| History | 3.1.0000 | |
| 3.9.1000 | Updated example | |
| Example | switch (config) # show ib ha Global HA state: HA node local information: Name : barracuda-216 (active) <--- (local node) SM-HA state : standby IP : 10.7.48.50 Virtual switch membership: infiniband-default HA node local information: Name : barracuda-217 (not active) IP : offline Virtual switch membership: infiniband-default HA node local information: Name : scorpionib2-19 (active) SM-HA state : master IP : 10.7.51.169 Virtual switch membership: infiniband-default switch (config) # show ib ha brief Global HA state: ----------------------------------------------------------------------------------------- ID Local node SM-HA state IP Virtual switch membership --------------------------------------------------------------------------------------- barracuda-216 * standby 10.7.48.50 infiniband-default barracuda-217 standby 10.7.48.51 infiniband-default scorpionib2-19 master 10.7.51.169 infiniband-default | |
| Related Commands | ||
| Notes | ||