NVIDIA MLNX-OS User Manual v3.11.2016
v3.11.2016

Subnet Manager

The InfiniBand Subnet Manager (SM) is a centralized entity running in the switch. The SM discovers and configures all the InfiniBand fabric devices to enable traffic flow between those devices.

The SM applies network traffic related configurations such as Quality of Service (QoS), routing, and partitioning of the fabric devices. You can view and configure the Subnet Parameters (SM) via the CLI/WebUI menu. The embedded SM on the MLNX-OS can be used to manage fabrics up to 2048 nodes on x86 based systems.

The SM is used to discover and configure all the InfiniBand fabric devices to enable traffic flow between those devices.

Warning
  • Subnet manager running via MLNX-OS does not support Dragonfly+ topology or combination of Fat-Tree and Dragonfly+ topologies.

  • Subnet manager running via MLNX-OS does not support adaptive routing, fault routing (i.e., SHIELD or FRN),congestion control, SHIELD, and SHARP.

To enable Subnet Manager:

  1. Enable Subnet Manager (disabled by default). Run:

    Copy
    Copied!
                

    switch (config) # ib smnode my-sm enable

  2. (Optional) Set the priority for the Subnet Manager. Run:

    Copy
    Copied!
                

    switch (config) # ib smnode my-sm sm-priority <priority>

Warning

If rapid SM restarts are observed in what should be a quiet subnet, verify that all nodes running SM in the same management domain are in the same IB subnet. If they are not, fix the subnet.

Partitioning enforces isolation among systems sharing an InfiniBand fabric. Partitioning is not related to boundaries established by subnets, switches, or routers. Rather, a partition describes a set of end nodes within the fabric that may communicate. Each port of an end node is a member of at least one partition and may be a member of multiple partitions. A partition manager (part of the SM) assigns partition keys (PKEYs) to each channel adapter port. Each PKEY represents a partition. Reception of an invalid PKEY causes the packet to be discarded. Switches and routers may optionally be used to enforce partitioning. In this case the partition manager programs the switch or router with PKEY information and when the switch or router detects a packet with an invalid PKEY, it discards the packet.

Fabric administration can assign certain Service Levels (SLs) for particular partitions. This allows the SM to isolate traffic flows between those partitions, and even if both partitions operate at the same QoS level, each partition can be guaranteed its fair share of bandwidth regardless of whether nodes in other partitions misbehave or are over subscribed.

The switch enables the configuration of partitions in an InfiniBand fabric.

The default partition is created by the SM unconditionally (whether it was defined or not).

Relationship with ib0 Interface

IP interface “ib0” is running under the default PKEY (0x7fff) and can be used for in-band management connectivity to the system.

Configuring Partition

Warning

The partitions configuration is applicable and to be used only when the SM is enabled and running on the system.

  1. Create a partition. Run:

    Copy
    Copied!
                

    switch (config) # ib partition my-partition pkey 0x7ff2

  2. Enter partition configuration mode. Run:

    Copy
    Copied!
                

    switch (config) # partition my-partition switch (config partition name my-partition) #

  3. Add partition members. Run:

    Copy
    Copied!
                

    switch (config partition my-partition) # member all

  4. Verify the partition configuration. Run:

    Copy
    Copied!
                

    switch (config partition my-partition) # show ib partition Default PKey = 0x7FFF defmember = full ipoib = yes members GUID='ALL' member='full' my-partition PKey = 0x7ff2 members GUID='ALL' member='default'

Adaptive routing (AR) allows optimizing data traffic flow. The InfiniBand protocol uses multiple paths between any two points. Thus, when unexpected traffic patterns cause some paths to be overloaded, AR can automatically move traffic to less congested paths according to the current temporal state of the network.

AR support is enabled by default on system profile “ib-single-switch”. To disable AR run either the command “system profile ib-no-adaptive-routing-single-switch” or “system profile ib” with no-adaptive-routing parameter.

Warning

The AR option needs to be enabled in the SM for it to take effect.

When assigning logical paths to physical links, the UpDn algorithm tries to map the same number of paths per link to maximize use of the available bandwidth. This balancing is done statically, without knowledge of actual workloads and traffic patterns. Path balancing decisions are made locally, at each switch, without assuming anything about the physical topology. The resulting path assignments may not be optimal for typical Clos/Fat Tree workloads.

A routing option called “scatter-ports” is available for MinHop and UpDn routing engines which instructs the routing algorithm to randomize the local assignments of paths to links, which often results in better link utilization. The scatter-ports option requires an integer argument, which is the seed for the random number generator. It is recommended to use a prime number for the seed; a seed of zero turns off randomization.

GUID routing order list allows managing the order in which the SM processes the destination LIDs in the calculations of output port as part of MinHop or Up/Down routing algorithms only.

The order of GUID appearance is important as destinations corresponding to GUIDs appearing earlier in the routing list get precedence during the routing calculations over other destinations in the fabric. This can improve load balancing towards a specific set of end ports (e.g. storage nodes or other service nodes requiring high throughput).

If scatter-ports (randomization of the output port) option is set to non-zero, guid-routing-order-no-scatter defines whether or not a randomization should be applied to the destinations GUIDs mentioned in GUID routing order list.

Bulk update mode allows users to set multiple IB SM configurations without applying them until bulk mode is disabled.

When bulk update is disabled (default situation) every SM configuration is applied immediately. When bulk is enabled, all SM configuration is saved internally and is not applied until this mode is disabled.

Bulk mode is a non-persistent state. That is, if the switch is restarted, it boots up with this mode disabled, and all the configuration changes which are saved before system restart are applied.

Warning

Show commands convey every configuration change even if it is not applied yet.

© Copyright 2023, NVIDIA. Last updated on May 26, 2024.