IP over InfiniBand (IPoIB)

The IP over IB (IPoIB) ULP driver is a network interface implementation over InfiniBand. IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service. The IPoIB driver, ib_ipoib, exploits the following capabilities:

  • VLAN simulation over an InfiniBand network via child interfaces

  • High Availability via Bonding

  • Varies MTU values:

    • up to 4k in Datagram mode

    • up to 64k in Connected mode

  • Uses any ConnectX® IB ports (one or two)

  • Inserts IP/UDP/TCP checksum on outgoing packets

  • Calculates checksum on received packets

  • Support net device TSO through ConnectX® LSO capability to defragment large data- grams to MTU quantas.

  • Dual operation mode - datagram and connected

  • Large MTU support through connected mode

IPoIB also supports the following software based enhancements:

  • Giant Receive Offload

  • NAPI

  • Ethtool support

Warning

Note that switching between Enhanced mode and ULP mode can be done by setting ib_ipoib module parameter “ipoib_enhanced” to 1 or 0.

Enhanced IPoIB feature enables offloading ULP basic capabilities to a lower vendor specific driver, in order to optimize IPoIB data path. This will allow IPoIB to support multiple stateless offloads, such as RSS/TSS, and better utilize the features supported, enabling IPoIB datagram to reach peak performance in both bandwidth and latency.

Enhanced IPoIB supports/performs the following:

  • Stateless offloads (RSS, TSS)

  • Multi queues

  • Interrupt moderation

  • Multi partitions optimizations

  • Sharing send/receive Work Queues

  • Vendor specific optimizations

  • UD mode only

In order to verify that the driver is using Enhanced IPoIB, run:

Copy
Copied!
            

ip link show ibX

Output example:

Copy
Copied!
            

8: ib1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT qlen 1024  link/infiniband 00:00:00:67:fe:80:00:00:00:00:00:00:e4:1d:2d:03:00:a5:f0:2f brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

Note: The driver MAC should start with 00:xxxxxxx (Upstream) or 20:xxxxxxx (OFED) in case Enhanced IPoIB is enabled.

IPoIB ULP can run in two modes of operation: Connected mode and Datagram mode. By default, IPoIB ULP is set to work in Datagram mode.
For better scalability and performance, we recommend using the Datagram mode. However, the mode can be changed to Connected mode by editing the file /etc/infiniband/openib.conf and setting 'SET_IPOIB_CM=yes'.
After changing the mode, you need to restart the driver by running:

Copy
Copied!
            

/etc/init.d/openibd restart

To check the current mode used for out-going connections, enter

:

Copy
Copied!
            

cat /sys/class/net/ib<n>/mode

Warning

Changing the IPoIB mode (CM vs UD) requires the interface to be in 'down' state.

Warning

Connected mode is not supported when using enhanced IPoIB.

The physical port MTU in Datagram mode (indicates the port capability) default value is 4k, whereas the IPoIB port MTU ("logical" MTU) default value is 2k as it is set by the OpenSM.
To change the IPoIB MTU to 4k, edit the OpenSM partition file in the section of IPoIB setting as follow:

Copy
Copied!
            

Default=0xffff, ipoib, mtu=5 : ALL=full;

where:

"mtu=5" indicates that all IPoIB ports in the fabric are using 4k MTU, ("mtu=4" indi- cates 2k MTU)

Unless you have run the installation script mlnxofedinstall with the flag '-n', then IPoIB has not been configured by the installation. The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port, like any other network adapter card (i.e., you need to prepare a file called ifcfg-ib<n> for each port). The first port on the first HCA in the host is called interface ib0, the second port is called ib1, and so on.
IPoIB configuration can be based on DHCP or on a static configuration that you need to supply (see below). You can also apply a manual configuration that persists only until the next reboot or driver restart (see below).

IPoIB Configuration Based on DHCP

Setting an IPoIB interface configuration based on DHCP is performed similarly to the configuration of Ethernet interfaces. In other words, you need to make sure that IPoIB configuration files include the following line:

  • For RedHat:

    Copy
    Copied!
                

    BOOTPROTO=dhcp

  • For SLES:

    Copy
    Copied!
                

    BOOTPROTO='dchp' 

    Warning

    If IPoIB configuration files are included, ifcfg-ib<n> files will be installed under:

    /etc/sysconfig/network-scripts/ on a RedHat machine
    /etc/sysconfig/network/ on a SuSE machine.

    Warning

    A patch for DHCP may be required for supporting IPoIB. For further information, please see the REAME file available under the docs/dhcp/ directory.

Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages convey a client identifier field used to identify the DHCP session. This client identifier field can be used to associate an IP address with a client identifier value, such that the DHCP server will grant the same IP address to any client that conveys this client identifier.
The length of the client identifier field is not fixed in the specification. For the NVIDIA OFED for Linux package, it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier.

DHCP Server

In order for the DHCP server to provide configuration records for clients, an appropriate configuration file needs to be created. By default, the DHCP server looks for a configuration file called dhcpd.conf under /etc. You can either edit this file or create a new one and provide its full path to the DHCP server using the -cf flag (See a file example at docs/dhcpd.conf).
The DHCP server must run on a machine which has loaded the IPoIB module. To run the DHCP server from the command line, enter:

Copy
Copied!
            

dhcpd <IB network interface name> -d

Example:

Copy
Copied!
            

host1# dhcpd ib0 -d

DHCP Client (Optional)

Warning

A DHCP client can be used if you need to prepare a diskless machine with an IB driver.

In order to use a DHCP client identifier, you need to first create a configuration file that defines the DHCP client identifier.

Then run the DHCP client with this file using the following command:

Copy
Copied!
            

dhclient –cf <client conf file> <IB network interface name>

Example of a configuration file for the ConnectX (PCI Device ID 26428), called dhclient.conf:

Copy
Copied!
            

The value indicates a hexadecimal number interface "ib1" { send dhcp-client-identifier ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39; }

Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), called dhclient.conf:

Copy
Copied!
            

The value indicates a hexadecimal number interface "ib1" { send dhcp-client-identifier 20:00:55:04:01:fe:80:00:00:00:00:00:00:00:02:c9:02:00:23:13:92; }

In order to use the configuration file, run

:

Copy
Copied!
            

host1# dhclient –cf dhclient.conf ib1

Static IPoIB Configuration

If you wish to use an IPoIB configuration that is not based on DHCP, you need to supply the installation script with a configuration file (using the ‘-n’ option) containing the full IP configuration. The IPoIB configuration file can specify either or both of the following data for an IPoIB interface:

  • A static IPoIB configuration

  • An IPoIB configuration based on an Ethernet configuration
    See your Linux distribution documentation for additional information about configuring IP addresses.

The following code lines are an excerpt from a sample IPoIB configuration file:

Copy
Copied!
            

# Static settings; all values provided by this file IPADDR_ib0=11.4.3.175 NETMASK_ib0=255.255.0.0 NETWORK_ib0=11.4.0.0 BROADCAST_ib0=11.4.255.255 ONBOOT_ib0=1 # Based on eth0; each '*' will be replaced with a corresponding octet # from eth0. LAN_INTERFACE_ib0=eth0 IPADDR_ib0=11.4.'*'.'*' NETMASK_ib0=255.255.0.0 NETWORK_ib0=11.4.0.0 BROADCAST_ib0=11.4.255.255 ONBOOT_ib0=1 # Based on the first eth<n> interface that is found (for n=0,1,...); # each '*' will be replaced with a corresponding octet from eth<n>. LAN_INTERFACE_ib0= IPADDR_ib0=11.4.'*'.'*' NETMASK_ib0=255.255.0.0 NETWORK_ib0=11.4.0.0 BROADCAST_ib0=11.4.255.255 ONBOOT_ib0=1

Manually Configuring IPoIB

Warning

This manual configuration persists only until the next reboot or driver restart.

In order to use the configuration file, run

To manually configure IPoIB for the default IB partition (VLAN), perform the following steps:

  1. Configure the interface by entering the ifconfig command with the following items:
    - The appropriate IB interface (ib0, ib1, etc.)
    - The IP address that you want to assign to the interface
    - The netmask keyword
    - The subnet mask that you want to assign to the interface
    The following example shows how to configure an IB interface:

    Copy
    Copied!
                

    host1$ ifconfig ib0 11.4.3.175 netmask 255.255.0.0

  2. (Optional) Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib# argument.
    The following example shows how to verify the configuration:

    Copy
    Copied!
                

    host1$ ifconfig ib0 b0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:11.4.3.175 Bcast:11.4.255.255 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128  RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

  3. Repeat the first two steps on the remaining interface(s).

You can create sub-interfaces for a primary IPoIB interface to provide traffic isolation. Each such sub-interface (also called a child interface) has a different IP and network addresses from the primary (parent) interface. The default Partition Key (PKey), ff:ff, applies to the primary (parent) interface.
This section describes how to:

  • Create a subinterface

  • Remove a subinterface

Creating a Subinterface

In the following procedure, ib0 is used as an example of an IB sub-interface.

To create a child interface (sub-interface), follow this procedure:

  1. Decide on the PKey to be used in the subnet (valid values can be 0 or any 16-bit unsigned value). The actual PKey used is a 16-bit number with the most significant bit set. For example, a value of 1 will give a PKey with the value 0x8001.

  2. Create a child interface by running:

    Copy
    Copied!
                

    host1$ echo <PKey> > /sys/class/net/<IB subinterface>/create_child

    Example:

    Copy
    Copied!
                

    host1$ echo 1 > /sys/class/net/ib0/create_child

    This will create the interface ib0.8001.

  3. Verify the configuration of this interface by running:

    Copy
    Copied!
                

    host1$ ifconfig <subinterface>.<subinterface PKey>

    Using the example of the previous step:

    Copy
    Copied!
                

    host1$ ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128  RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

  4. As can be seen, the interface does not have IP or network addresses. To configure those, you should follow the manual configuration procedure described in "Manually Configuring IPoIB" section above.

  5. To be able to use this interface, a configuration of the Subnet Manager is needed so that the PKey chosen, which defines a broadcast address, be recognized.

Removing a Subinterface

To remove a child interface (subinterface), run:

Copy
Copied!
            

echo <subinterface PKey> /sys/class/net/<ib_interface>/delete_child

Using the example of the second step from the previous chapter:

Copy
Copied!
            

echo 0x8001 > /sys/class/net/ib0/delete_child

Note that when deleting the interface you must use the PKey value with the most significant bit set (e.g., 0x8000 in the example above).

To verify your configuration and IPoIB functionality are successful, perform the following steps:

  1. Verify the IPoIB functionality by using the ifconfig command.
    The following example shows how two IB nodes are used to verify IPoIB functionality. In the following example, IB node 1 is at 11.4.3.175, and IB node 2 is at 11.4.3.176:

    Copy
    Copied!
                

    host1# ifconfig ib0 11.4.3.175 netmask 255.255.0.0 host2# ifconfig ib0 11.4.3.176 netmask 255.255.0.0

  2. Enter the ping command from 11.4.3.175 to 11.4.3.176.

  3. The following example shows how to enter the ping command:

    Copy
    Copied!
                

    host1# ping -c 5 11.4.3.176 PING 11.4.3.176 (11.4.3.176) 56(84) bytes of data. 64 bytes from 11.4.3.176: icmp_seq=0 ttl=64 time=0.079 ms 64 bytes from 11.4.3.176: icmp_seq=1 ttl=64 time=0.044 ms 64 bytes from 11.4.3.176: icmp_seq=2 ttl=64 time=0.055 ms 64 bytes from 11.4.3.176: icmp_seq=3 ttl=64 time=0.049 ms 64 bytes from 11.4.3.176: icmp_seq=4 ttl=64 time=0.065 ms --- 11.4.3.176 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 3999ms rtt min/avg/max/mdev = 0.044/0.058/0.079/0.014 ms, pipe 2

To create an interface configuration script for the ibX and bondX interfaces, you should use the standard syntax (depending on your OS).
Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces: via the Linux Bonding Driver.

  • Network Script files for IPoIB slaves are named after the IPoIB interfaces (e.g: ifcfg- ib0)

  • The only meaningful bonding policy in IPoIB is High-Availability (bonding mode number 1, or active-backup)

  • Bonding parameter "fail_over_mac" is meaningless in IPoIB interfaces, hence, the only supported value is the default: 0

For a persistent bonding IPoIB Network configuration, use the same Linux Network Scripts semantics, with the following exceptions/ additions:

  • In the bonding master configuration file (e.g: ifcfg-bond0), in addition to Linux bonding semantics, use the following parameter: MTU=65520

    Warning

    65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode (See "IPoIB Mode Setting") and are configured with the same value. For IPoIB slaves that work in datagram mode, use MTU=2044. If you do not set the correct MTU or do not set MTU at all, performance of the interface might decrease.

    Dynamically Connected Transport (DCT)

  • In the bonding slave configuration file (e.g: ifcfg-ib0), use the same Linux Network Scripts semantics. In particular: DEVICE=ib0

  • In the bonding slave configuration file (e.g: ifcfg-ib0.8003), the line TYPE=InfiniBand is necessary when using bonding over devices configured with partitions (p_key)

  • For RHEL users:
    In /etc/modprobe.b/bond.conf add the following lines:

    Copy
    Copied!
                

    alias bond0 bonding

  • For SLES users:
    It is necessary to update the MANDATORY_DEVICES environment variable in /etc/sysconfig/network/config with the names of the IPoIB slave devices (e.g. ib0, ib1, etc.). Otherwise, bonding master may be created before IPoIB slave interfaces at boot time.
    It is possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master. However, It is NOT possible to mix Ethernet and IPoIB slaves under the same bonding master.

Warning

Restarting openibd does no keep the bonding configuration via Network Scripts. You have to restart the network service in order to bring up the bonding master. After the configuration is saved, restart the network service by running: /etc/init.d/network restart.

Dynamic PKey change means the PKey can be changed (add/removed) in the SM database and the interface that is attached to that PKey is updated immediately without the need to restart the driver.
If the PKey is already configured in the port by the SM, the child-interface can be used immediately. If not, the interface will be ready to use only when SM adds the relevant PKey value to the port after the creation of the child interface. No additional configuration is required once the child-interface is created.

This feature allows for accurate synchronization between the distributed entities over the network. The synchronization is based on symmetric Round Trip Time (RTT) between the master and slave devices.

This feature is enabled by default, and is also supported over PKey interfaces.

For more on the PTP feature, refer to Running Linux PTP with ConnectX-4/ConnectX-5/ConnectX-6 Community post.

For further information on Time-Stamping, follow the steps in "Time-Stamping Service".

1PPS is a time synchronization feature that allows the adapter to be able to send or receive 1 pulse per second on a dedicated pin on the adapter card using an SMA connector (SubMiniature version A). Only one pin is supported and could be configured as 1PPS in or 1PPS out.
For further information, refer to HowTo Test 1PPS on NVIDIA Adapters Community post.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.