NVIDIA MLNX-GW User Manual for NVIDIA Skyway Appliance v8.1.4300 (LTS)
NVIDIA MLNX-GW User Manual for NVIDIA Skyway Appliance v8.1.4300 (LTS)

Gateway Networking Features

Skyway GA100 is an appliance-based InfiniBand-to-Ethernet gateway, enabling Ethernet storage or other Ethernet-based communications to access the InfiniBand datacenter, and vice versa. The solution, leveraging ConnectX’s hardware-based forwarding of IP packets and standard IP-routing protocols, supports 200Gb/s HDR connectivity today, and is future-ready to support higher speeds.

Skyway contains 8 ConnectX VPI dual-port adapter cards which enable the hardware-based forwarding of IP packets between InfiniBand to Ethernet systems.

The Skyway solution comprises of Skyway appliance and one or two Layer-3 Ethernet switches. These Ethernet switches can be provided by the customer, but are an integral part of the solution.

A single Skyway module supports a maximum bandwidth of 1.6Tb/s, utilizing 16 ports with each reaching 100Gb/s traffic. Connectivity-wise, the InfiniBand ports can be connected to the InfiniBand network via HDR/HDR100 or EDR; and the Ethernet ports using 200Gb/s, 100Gb/s Ethernet.

NVIDIA Skyway enables establishing a High Availability (HA) environment that shares resources among multiple Skyway appliances (comprising a Skyway domain). HA minimizes downtime when any system or connectivity failure occurs. Skyway leverages its load balancing capabilities to distribute the workload to optimize the aggregate domain performance for traffic.

On the Ethernet side, Skyway load balancing and HA functions are achieved by leveraging Ethernet Link Aggregation (LAG) support. Link Aggregation Control Protocol (LACP) is used to establish LAG and to verify connectivity. On the InfiniBand side, these functions are achieved through guaranteed availability of fallback network adapters (HCAs) of the Skyway appliances that will execute the traffic flows if an HCA drops.

At initialization, up to 64 gateway group identifiers (GIDs) are spread evenly among all InfiniBand ports of the Skyway gateway appliance. When an InfiniBand node initiates a traffic flow through a gateway, it first sends a broadcast ARP request with the default Gateway IP Address to determine the gateway GID. All HCAs receive the request, but only the adapter assigned to handle the relevant range of GIDs corresponding to the sending node IP address will send back a response to the ARP request. When the originating node receives the gateway GID, it sends a path query to the subnet manager (SM) to determine the gateway local identifier (LID), and the communication flow is then performed as usual.

The dynamic assignment of the 64 gateway GIDs is the basic element for the load balancing and high availability of the entire system. If a change to the gateway(s) configuration occurs—for example, if a cable is dropped, an Ethernet link is disabled, or an appliance is powered off—then the gateway GIDs are reassigned by the MLNX-GW operating system to other HCAs to be handled. From the end-node point of view, nothing has changed—the same GID and LID remain valid even when handled by a different HCA (on the same or different Skyway appliance).

image2020-10-27_0-21-18.png

High Availability (HA) Details

  • A gateway domain is a set of Skyway appliances sharing the same InfiniBand subnet.

  • The HA protocol runs individually on each Skyway appliance in the domain.

  • Skyway appliances which belong to the same domain share the same domain ID.

  • Possible gateway domain roles are as follows:

    • Master Gateway

    • Active Backup Gateway(s)

    • Non-Active Backup Gateway(s)

  • In each gateway domain there is a single Master Gateway.

  • The domain's Master Gateway is responsible for GID assignment, which is the basis of HA and load balancing.

    • Based on GID assignment, each HCA is configured to know ARP requests it should respond to and the Host IP addresses that it should pass traffic to.

    • Every domain member distributes its InfiniBand host list to the rest of the domain members.

  • To monitor the health of the Skyway’s domain members, each member sends unicast UDP "keepalive" messages to the Master, containing, among other things, the number of its active ports in the domain (that is, the number of active HCAs that can pass traffic). Skyway HA information (including keepalive statistics) will be reflected in the CLI.

  • If an Active Backup Gateway fails to receive an advertisement confirming that the Master Gateway is functioning well within a prescheduled timeout, it will take over as Master Gateway and will inform the rest of the domain members of the role change.

  • To determine which gateway will become Master, priority value will be used by the gateway appliance. The value of 0 (zero) is reserved for the gateway appliance to indicate it is releasing responsibility from being the gateway Master. The range 1-255 is available for the gateway appliance. Higher values indicate higher priorities. The default value is (decimal) 100.

  • In case two gateway appliances share the same priority, the one with the higher system GUID (Globally Unique ID) will be considered as higher priority and will become the new domain Master Gateway.

For a list on HA-related commands, see High Availability section in the MLNX-GW Routing Overview chapter.
For a configuration example, see Configuring High Availability (HA) section.

VF Hashing

Virtual instances called Virtual Functions (VFs) can be seen as additional devices connected to the Physical Function (PF). All VFs under a PF share the same resources with the Physical Function.
In MLNX-GW, each HCA utilizes multiple VFs for load balancing. These VFs are created automatically.

There are two algorithms to load balance incoming traffic from InfiniBand to Ethernet between the various VFs:

  1. Hash based on Destination IP.

  2. The last byte of the Destination IP is load balanced Modulu the number of VFs.

Consider moving to the second method should the traffic balancing is not as expected. This may be relevant for relatively small number of IPs.
The second method would be efficient when the IPs are allocated from a few Class-C address pool.

For more information, see the commands gw vf-hash-policy and show gw vf-hash-policy.

To facilitate the bundling of InfiniBand hosts into groups, each with its own designated IPoIB subnet, the Skyway appliance/domain supports multiple PKEYs, where each PKEY represents a single IPoIB subnet.
A single Skyway appliance and a Skyway domain support traffic flows over both default (0x7fff) and non-default PKEYs.
Default PKEY traffic flows are supported by default without additional configuration . A non-default PKEY will become available following proper user configuration.
Multiple PKEYs/IPoIB subnets (up to 20) are supported over a single Skyway appliance/domain for a single InfiniBand subnet. Best practice implementations suggest having up to 10 such IPoIB subnets. A higher number of IPoIB subnets will incur a longer boot time.
PKEY interfaces (including the default PKEY interface) support IPv4 addressing only.

Warning
  • Do not assign the same IP subnet to two different PKEY interfaces. This scenario will not be enforced by the software.

  • While PKEY configuration is synced across all appliances conforming into a single Skyway domain, it is the user’s responsibility to properly configure the PKEYs across all the Skyway domain appliances.

For a list on PKEY-related commands, see PKEY InfiniBand Interface section in the MLNX-GW Routing Overview chapter.
For a configuration example, see Configuring Partition Keys (PKEYs) section.

© Copyright 2023, NVIDIA. Last updated on Sep 8, 2023.