IP over InfiniBand (IPoIB)
The IP over IB (IPoIB) ULP driver is a network interface implementation over InfiniBand. IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service. The IPoIB driver, ib_ipoib, exploits the following capabilities:
VLAN simulation over an InfiniBand network via child interfaces
High Availability via Bonding
Varies MTU values:
up to 4k in Datagram mode
up to 64k in Connected mode
Uses any ConnectX® IB ports (one or two)
Inserts IP/UDP/TCP checksum on outgoing packets
Calculates checksum on received packets
Support net device TSO through ConnectX® LSO capability to defragment large data- grams to MTU quantas.
Dual operation mode - datagram and connected
Large MTU support through connected mode
IPoIB also supports the following software based enhancements:
Giant Receive Offload
NAPI
Ethtool support
Note that switching between Enhanced mode and ULP mode can be done by setting ib_ipoib module parameter “ipoib_enhanced” to 1 or 0.
Enhanced IPoIB feature enables offloading ULP basic capabilities to a lower vendor specific driver, in order to optimize IPoIB data path. This will allow IPoIB to support multiple stateless offloads, such as RSS/TSS, and better utilize the features supported, enabling IPoIB datagram to reach peak performance in both bandwidth and latency.
Enhanced IPoIB supports/performs the following:
Stateless offloads (RSS, TSS)
Multi queues
Interrupt moderation
Multi partitions optimizations
Sharing send/receive Work Queues
Vendor specific optimizations
UD mode only
In order to verify that the driver is using Enhanced IPoIB, run:
ip link show ibX
Output example:
8
: ib1: <BROADCAST,MULTICAST> mtu 4092
qdisc noop state DOWN mode DEFAULT qlen 1024
link/infiniband 00
:00
:00
:67
:fe:80
:00
:00
:00
:00
:00
:00
:e4:1d:2d:03
:00
:a5:f0:2f brd
00
:ff:ff:ff:ff:12
:40
:1b:ff:ff:00
:00
:00
:00
:00
:00
:ff:ff:ff:ff
Note: The driver MAC should start with 00:xxxxxxx (Upstream) or 20:xxxxxxx (OFED) in case Enhanced IPoIB is enabled.
IPoIB ULP can run in two modes of operation: Connected mode and Datagram mode. By default, IPoIB ULP is set to work in Datagram mode.
For better scalability and performance, we recommend using the Datagram mode. However, the mode can be changed to Connected mode by editing the file /etc/infiniband/openib.conf and setting 'SET_IPOIB_CM=yes'.
After changing the mode, you need to restart the driver by running:
/etc/init.d/openibd restart
To check the current mode used for out-going connections, enter
:
cat /sys/class
/net/ib<n>/mode
Changing the IPoIB mode (CM vs UD) requires the interface to be in 'down' state.
Connected mode is not supported when using enhanced IPoIB.
The physical port MTU in Datagram mode (indicates the port capability) default value is 4k, whereas the IPoIB port MTU ("logical" MTU) default value is 2k as it is set by the OpenSM.
To change the IPoIB MTU to 4k, edit the OpenSM partition file in the section of IPoIB setting as follow:
Default=0xffff
, ipoib, mtu=5
: ALL=full;
where:
"mtu=5" indicates that all IPoIB ports in the fabric are using 4k MTU, ("mtu=4" indi- cates 2k MTU)
Unless you have run the installation script mlnxofedinstall with the flag '-n', then IPoIB has not been configured by the installation. The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port, like any other network adapter card (i.e., you need to prepare a file called ifcfg-ib<n> for each port). The first port on the first HCA in the host is called interface ib0, the second port is called ib1, and so on.
IPoIB configuration can be based on DHCP or on a static configuration that you need to supply (see below). You can also apply a manual configuration that persists only until the next reboot or driver restart (see below).
IPoIB Configuration Based on DHCP
Setting an IPoIB interface configuration based on DHCP is performed similarly to the configuration of Ethernet interfaces. In other words, you need to make sure that IPoIB configuration files include the following line:
For RedHat:
BOOTPROTO=dhcp
For SLES:
BOOTPROTO=
'dchp'
WarningIf IPoIB configuration files are included, ifcfg-ib<n> files will be installed under:
/etc/sysconfig/network-scripts/ on a RedHat machine
/etc/sysconfig/network/ on a SuSE machine.WarningA patch for DHCP may be required for supporting IPoIB. For further information, please see the REAME file available under the docs/dhcp/ directory.
Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hardware address. To overcome this problem, DHCP over InfiniBand messages convey a client identifier field used to identify the DHCP session. This client identifier field can be used to associate an IP address with a client identifier value, such that the DHCP server will grant the same IP address to any client that conveys this client identifier.
The length of the client identifier field is not fixed in the specification. For the NVIDIA OFED for Linux package, it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier.
DHCP Server
In order for the DHCP server to provide configuration records for clients, an appropriate configuration file needs to be created. By default, the DHCP server looks for a configuration file called dhcpd.conf under /etc. You can either edit this file or create a new one and provide its full path to the DHCP server using the -cf flag (See a file example at docs/dhcpd.conf).
The DHCP server must run on a machine which has loaded the IPoIB module. To run the DHCP server from the command line, enter:
dhcpd <IB network interface
name> -d
Example:
host1# dhcpd ib0 -d
DHCP Client (Optional)
A DHCP client can be used if you need to prepare a diskless machine with an IB driver.
In order to use a DHCP client identifier, you need to first create a configuration file that defines the DHCP client identifier.
Then run the DHCP client with this file using the following command:
dhclient –cf <client conf file> <IB network interface
name>
Example of a configuration file for the ConnectX (PCI Device ID 26428), called dhclient.conf:
The value indicates a hexadecimal number interface
"ib1"
{
send dhcp-client-identifier
ff:00
:00
:00
:00
:00
:02
:00
:00
:02
:c9:00
:00
:02
:c9:03
:00
:00
:10
:39
;
}
Example of a configuration file for InfiniHost III Ex (PCI Device ID 25218), called dhclient.conf:
The value indicates a hexadecimal number interface
"ib1"
{
send dhcp-client-identifier
20
:00
:55
:04
:01
:fe:80
:00
:00
:00
:00
:00
:00
:00
:02
:c9:02
:00
:23
:13
:92
;
}
In order to use the configuration file, run
:
host1# dhclient –cf dhclient.conf ib1
Static IPoIB Configuration
If you wish to use an IPoIB configuration that is not based on DHCP, you need to supply the installation script with a configuration file (using the ‘-n’ option) containing the full IP configuration. The IPoIB configuration file can specify either or both of the following data for an IPoIB interface:
A static IPoIB configuration
An IPoIB configuration based on an Ethernet configuration
See your Linux distribution documentation for additional information about configuring IP addresses.
The following code lines are an excerpt from a sample IPoIB configuration file:
# Static settings; all values provided by this
file
IPADDR_ib0=10.4
.3.175
NETMASK_ib0=255.255
.0.0
NETWORK_ib0=10.4
.0.0
BROADCAST_ib0=10.4
.255.255
ONBOOT_ib0=1
# Based on eth0; each '*'
will be replaced with a corresponding octet
# from eth0.
LAN_INTERFACE_ib0=eth0
IPADDR_ib0=10.4
.'*'
.'*'
NETMASK_ib0=255.255
.0.0
NETWORK_ib0=10.4
.0.0
BROADCAST_ib0=10.4
.255.255
ONBOOT_ib0=1
# Based on the first eth<n> interface
that is found (for
n=0
,1
,...);
# each '*'
will be replaced with a corresponding octet from eth<n>.
LAN_INTERFACE_ib0=
IPADDR_ib0=10.4
.'*'
.'*'
NETMASK_ib0=255.255
.0.0
NETWORK_ib0=10.4
.0.0
BROADCAST_ib0=10.4
.255.255
ONBOOT_ib0=1
Manually Configuring IPoIB
This manual configuration persists only until the next reboot or driver restart.
To manually configure IPoIB for the default IB partition (VLAN), perform the following steps:
Configure the interface by entering the ifconfig command with the following items:
- The appropriate IB interface (ib0, ib1, etc.)
- The IP address that you want to assign to the interface
- The netmask keyword
- The subnet mask that you want to assign to the interface
The following example shows how to configure an IB interface:host1$ ifconfig ib0
10.4
.3.175
netmask255.255
.0.0
(Optional) Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib# argument.
The following example shows how to verify the configuration:host1$ ifconfig ib0 b0 Link encap:UNSPEC HWaddr
80
-00
-04
-04
-FE-80
-00
-00
-00
-00
-00
-00
-00
-00
-00
-00
inet addr:10.4
.3.175
Bcast:10.4
.255.255
Mask:255.255
.0.0
UP BROADCAST MULTICAST MTU:65520
Metric:1
RX packets:0
errors:0
dropped:0
overruns:0
frame:0
TX packets:0
errors:0
dropped:0
overruns:0
carrier:0
collisions:0
txqueuelen:128
RX bytes:0
(0.0
b) TX bytes:0
(0.0
b)Repeat the first two steps on the remaining interface(s).
You can create sub-interfaces for a primary IPoIB interface to provide traffic isolation. Each such sub-interface (also called a child interface) has a different IP and network addresses from the primary (parent) interface. The default Partition Key (PKey), ff:ff, applies to the primary (parent) interface.
This section describes how to:
Create a subinterface
Remove a subinterface
Creating a Subinterface
In the following procedure, ib0 is used as an example of an IB sub-interface.
To create a child interface (sub-interface), follow this procedure:
Decide on the PKey to be used in the subnet (valid values can be 0 or any 16-bit unsigned value). The actual PKey used is a 16-bit number with the most significant bit set. For example, a value of 1 will give a PKey with the value 0x8001.
Create a child interface by running:
host1$ echo <PKey> > /sys/
class
/net/<IB subinterface>/create_childExample:
host1$ echo
1
> /sys/class
/net/ib0/create_childThis will create the interface ib0.8001.
Verify the configuration of this interface by running:
host1$ ifconfig <subinterface>.<subinterface PKey>
Using the example of the previous step:
host1$ ifconfig ib0.
8001
ib0.8001
Link encap:UNSPEC HWaddr80
-00
-00
-4A-FE-80
-00
-00
-00
-00
-00
-00
-00
-00
-00
-00
BROADCAST MULTICAST MTU:2044
Metric:1
RX packets:0
errors:0
dropped:0
overruns:0
frame:0
TX packets:0
errors:0
dropped:0
overruns:0
carrier:0
collisions:0
txqueuelen:128
RX bytes:0
(0.0
b) TX bytes:0
(0.0
b)As can be seen, the interface does not have IP or network addresses. To configure those, you should follow the manual configuration procedure described in "Manually Configuring IPoIB" section above.
To be able to use this interface, a configuration of the Subnet Manager is needed so that the PKey chosen, which defines a broadcast address, be recognized.
Removing a Subinterface
To remove a child interface (subinterface), run:
echo <subinterface PKey> /sys/class
/net/<ib_interface>/delete_child
Using the example of the second step from the previous chapter:
echo 0x8001
> /sys/class
/net/ib0/delete_child
Note that when deleting the interface you must use the PKey value with the most significant bit set (e.g., 0x8000 in the example above).
To verify your configuration and IPoIB functionality are successful, perform the following steps:
Verify the IPoIB functionality by using the ifconfig command.
The following example shows how two IB nodes are used to verify IPoIB functionality. In the following example, IB node 1 is at 10.4.3.175, and IB node 2 is at 10.4.3.176:host1# ifconfig ib0
10.4
.3.175
netmask255.255
.0.0
host2# ifconfig ib010.4
.3.176
netmask255.255
.0.0
Enter the ping command from 10.4.3.175 to 10.4.3.176.
The following example shows how to enter the ping command:
host1# ping -c
5
10.4
.3.176
PING10.4
.3.176
(10.4
.3.176
)56
(84
) bytes of data.64
bytes from10.4
.3.176
: icmp_seq=0
ttl=64
time=0.079
ms64
bytes from10.4
.3.176
: icmp_seq=1
ttl=64
time=0.044
ms64
bytes from10.4
.3.176
: icmp_seq=2
ttl=64
time=0.055
ms64
bytes from10.4
.3.176
: icmp_seq=3
ttl=64
time=0.049
ms64
bytes from10.4
.3.176
: icmp_seq=4
ttl=64
time=0.065
ms ---10.4
.3.176
ping statistics ---5
packets transmitted,5
received,0
% packet loss, time 3999ms rtt min/avg/max/mdev =0.044
/0.058
/0.079
/0.014
ms, pipe2
To create an interface configuration script for the ibX and bondX interfaces, you should use the standard syntax (depending on your OS).
Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces: via the Linux Bonding Driver.
Network Script files for IPoIB slaves are named after the IPoIB interfaces (e.g: ifcfg- ib0)
The only meaningful bonding policy in IPoIB is High-Availability (bonding mode number 1, or active-backup)
Bonding parameter "fail_over_mac" is meaningless in IPoIB interfaces, hence, the only supported value is the default: 0
For a persistent bonding IPoIB Network configuration, use the same Linux Network Scripts semantics, with the following exceptions/ additions:
In the bonding master configuration file (e.g: ifcfg-bond0), in addition to Linux bonding semantics, use the following parameter: MTU=65520
Warning65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode (See "IPoIB Mode Setting") and are configured with the same value. For IPoIB slaves that work in datagram mode, use MTU=2044. If you do not set the correct MTU or do not set MTU at all, performance of the interface might decrease.
Dynamically Connected Transport (DCT)
In the bonding slave configuration file (e.g: ifcfg-ib0), use the same Linux Network Scripts semantics. In particular: DEVICE=ib0
In the bonding slave configuration file (e.g: ifcfg-ib0.8003), the line TYPE=InfiniBand is necessary when using bonding over devices configured with partitions (p_key)
For RHEL users:
In /etc/modprobe.b/bond.conf add the following lines:alias bond0 bonding
For SLES users:
It is necessary to update the MANDATORY_DEVICES environment variable in /etc/sysconfig/network/config with the names of the IPoIB slave devices (e.g. ib0, ib1, etc.). Otherwise, bonding master may be created before IPoIB slave interfaces at boot time.
It is possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master. However, It is NOT possible to mix Ethernet and IPoIB slaves under the same bonding master.
Restarting openibd does no keep the bonding configuration via Network Scripts. You have to restart the network service in order to bring up the bonding master. After the configuration is saved, restart the network service by running: /etc/init.d/network restart.
Dynamic PKey change means the PKey can be changed (add/removed) in the SM database and the interface that is attached to that PKey is updated immediately without the need to restart the driver.
If the PKey is already configured in the port by the SM, the child-interface can be used immediately. If not, the interface will be ready to use only when SM adds the relevant PKey value to the port after the creation of the child interface. No additional configuration is required once the child-interface is created.
This feature allows for accurate synchronization between the distributed entities over the network. The synchronization is based on symmetric Round Trip Time (RTT) between the master and slave devices.
This feature is enabled by default, and is also supported over PKey interfaces.
For more on the PTP feature, refer to Running Linux PTP with ConnectX-4/ConnectX-5/ConnectX-6 Community post.
For further information on Time-Stamping, follow the steps in "Time-Stamping Service".
1PPS is a time synchronization feature that allows the adapter to be able to send or receive 1 pulse per second on a dedicated pin on the adapter card using an SMA connector (SubMiniature version A). Only one pin is supported and could be configured as 1PPS in or 1PPS out.
For further information, refer to HowTo Test 1PPS on NVIDIA Adapters Community post.