ESXi 6.5: NVIDIA ConnectX-4 onwards NICs NATIVE ESXi Driver for VMware vSphere User Manual v4.16.71.1
Linux Kernel Upstream Release Notes v6.5

Ethernet Network

ConnectX®-4 onward adapter cards' ports can be individually configured to work as InfiniBand or Ethernet ports. The port type depends on the card type. In case of a VPI card, the default type is IB. If you wish to change the port type use the mlxconfig script.

To use a VPI card as an Ethernet only card, run:

Copy
Copied!
            

/opt/mellanox/bin/mlxconfig -d /dev/mt4115_pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2

The protocol types are:

  • Port Type 1 = IB

  • Port Type 2 = Ethernet

For further information on how to set the port type in ConnectX®-4 onwards, please refer to the MFT User Manual (www.mellanox.com → Products → Software → InfiniBand/VPI Software → MFT - Firmware Tools).

Warning

Wake-on-LAN (WoL) is applicable only to adapter cards that support this feature.

Wake-on-LAN (WoL) is a technology that allows a network professional to remotely power on a computer or to wake it up from sleep mode.

  • To enable WoL:

    Copy
    Copied!
                

    esxcli network nic set -n <nic name> -w g

    or

    Copy
    Copied!
                

    set /net/pNics/<nic name>/wol g

  • To disable WoL:

    Copy
    Copied!
                

    vsish -e set /net/pNics/<nic name>/wol d

  • To verify configuration:

    Copy
    Copied!
                

    esxcli network nic get -n vmnic5 Advertised Auto Negotiation: true Advertised Link Modes: 10000baseT/Full, 40000baseT/Full, 100000baseT/Full, 100baseT/Full, 1000baseT/Full, 25000baseT/Full, 50000baseT/Full Auto Negotiation: false Cable Type: DA Current Message Level: -1 Driver Info: Bus Info: 0000:82:00:1 Driver: nmlx5_core Firmware Version: 12.20.1010 Version: 4.15.10.3 Link Detected: true Link Status: Up Name: vmnic5 PHYAddress: 0 Pause Autonegotiate: false Pause RX: false Pause TX: false Supported Ports: Supports Auto Negotiation: true Supports Pause: false Supports Wakeon: false Transceiver: Wakeon: MagicPacket(tm)

For further information, see https://kb.vmware.com/s/article/1004089

The driver is set to auto-negotiate by default. However, the link speed can be forced to a specific link speed supported by ESXi using the following command:

Copy
Copied!
            

esxcli network nic set -n <vmnic> -S <speed> -D <full, half>

Example:

Copy
Copied!
            

esxcli network nic set -n vmnic4 -S 10000 -D full

Where:

  • <speed> can be one of the supported speeds that can be queried using: "esxcli network nic get -n vmnic1"

  • <vmnic> is the vmnic for the Mellanox card as provided by ESXi

  • <full, half> The duplex to set this NIC to. Acceptable values are: [full, half]

The driver can be reset to auto-negotiate using the following command:

Copy
Copied!
            

esxcli network nic set -n <vmnic> -a

Example:

Copy
Copied!
            

esxcli network nic set -n vmnic4 -a

where <vmnic> is the vmnic for the Mellanox card as provided by ESXi.

Priority Flow Control (PFC) IEEE 802.1Qbb applies pause functionality to specific classes of traffic on the Ethernet link. PFC can provide different levels of service to specific classes of Ethernet traffic (using IEEE 802.1p traffic classes).

Warning

When PFC is enabled, Global Pause will be operationally disabled, regardless of what is configured for the Global Pause Flow Control.

To configure PFC, do the following:

image2019-7-17_16-36-20.png

  1. Enable PFC for specific priorities.

    Copy
    Copied!
                

    esxcfg-module nmlx5_core -s "pfctx=0x08 pfcrx=0x08"

    The parameters, “pfctx” (PFC TX) and “pfcrx” (PFC RX), are specified per host. If you have more than a single card on the server, all ports will be enabled with PFC (Global Pause will be disabled even if configured).

    The value is a bitmap of 8 bits = 8 priorities. We recommend that you enable only lossless applications on a specific priority.

    To run more than one flow type on the server, turn on only one priority (e.g. priority 3), which should be configured with the parameters "0x08" = 00001000b (binary). Only the 4th bit is on (starts with priority 0,1,2 and 3 -> 4th bit).

    Warning

    The values of “pfctx” and “pfcrx” must be identical.

  2. Restart the host for changes to the module parameters to take effect.

    Copy
    Copied!
                

    reboot

Receive Side Scaling (RSS) technology allows spreading incoming traffic between different receive descriptor queues. Assigning each queue to different CPU cores allows better load balancing of the incoming traffic and improve performance.

Default Queue Receive Side Scaling (DRSS)

Default Queue RSS (DRSS) allows the user to configure multiple hardware queues backing up the default RX queue. DRSS improves performance for large scale multicast traffic between hypervisors and Virtual Machines interfaces.

To configure DRSS, use the 'DRSS' module parameter which replaces the previously advertised 'device_rss' module parameter ('device_rss' is now obsolete). The 'drss' module parameter and 'device_rss' are mutually exclusive

If the 'device_rss' module parameter is enabled, the following functionality will be configured:

  • The new Default Queue RSS mode will be triggered and all hardware RX rings will be utilized, similar to the previous 'device_rss' functionality

  • Module parameters 'DRSS' and 'RSS' will be ignored, thus the NetQ RSS, or the standard NetQ will be active

To query the 'DRSS' module parameter default, its minimal or maximal values, and restrictions, run a standard esxcli command.

For example:

Copy
Copied!
            

#esxcli system module parameters list -m nmlx5_core


NetQ RSS

NetQ RSS is a new module parameter for ConnectX-4 adapter cards providing identical functionality as the ConnectX-3 module parameter 'num_rings_per_rss_queue'. The new module parameter allows the user to configure multiple hardware queues backing up the single RX queue. NetQ RSS improves vMotion performance and multiple streams of IPv4/IPv6 TCP/ UDP/IPSEC bandwidth over single interface between the Virtual Machines.

To configure NetQ RSS, use the 'RSS' module parameter. To query the 'RSS' module parameter default, its minimal or maximal values, and restrictions, run a standard esxcli command.

For example:

Copy
Copied!
            

#esxcli system module parameters list -m nmlx5_core

Warning

Using NetQ RSS is preferred over the Default Queue RSS. Therefore, if both module parameters are set but the system lacks resources to support both, NetQ RSS will be used instead of DRSS.

Important Notes

If the 'DRSS' and 'RSS' module parameters set by the user cannot be enforced by the system due to lack of resources, the following actions are taken in a sequential order:

  1. The system will attempt to provide the module parameters default values instead of the ones set by the user

  2. The system will attempt to provide 'RSS' (NetQ RSS mode) default value. The Default Queue RSS will be disabled

  3. The system will load with only standard NetQ queues

  4. 'DRSS' and 'RSS' parameters are disabled by default, and the system loads with standard NetQ mode

Remote Direct Memory Access (RDMA) is the remote memory management capability that allows server-to-server data movement directly between application memory without any CPU involvement. RDMA over Converged Ethernet (RoCE) is a mechanism to provide this efficient data transfer with very low latencies on lossless Ethernet networks. With advances in data center convergence over reliable Ethernet, ConnectX® EN with RoCE uses the proven and efficient RDMA transport to provide the platform for deploying RDMA technology in mainstream data center application at 10GigE and 40GigE link-speed. ConnectX® EN with its hardware offload support takes advantage of this efficient RDMA transport (InfiniBand) services over Ethernet to deliver ultra-low latency for performance-critical and transaction intensive applications such as financial, database, storage, and content delivery networks.

When working with RDMA applications over Ethernet link layer the following points should be noted:

  • The presence of a Subnet Manager (SM) is not required in the fabric. Thus, operations that require communication with the SM are managed in a different way in RoCE. This does not affect the API but only the actions such as joining multicast group, that need to be taken when using the API

  • Since LID is a layer 2 attribute of the InfiniBand protocol stack, it is not set for a port and is displayed as zero when querying the port

  • With RoCE, the alternate path is not set for RC QP and therefore APM is not supported

  • GID format can be of 2 types, IPv4 and IPv6. IPv4 GID is a IPv4-mapped IPv6 address while IPv6 GID is the IPv6 address itself

  • VLAN tagged Ethernet frames carry a 3-bit priority field. The value of this field is derived from the InfiniBand SL field by taking the 3 least significant bits of the SL field

  • RoCE traffic is not shown in the associated Ethernet device's counters since it is offloaded by the hardware and does not go through Ethernet network driver. RoCE traffic is counted in the same place where InfiniBand traffic is counted:

    Copy
    Copied!
                

    esxcli rdma device stats get -d [RDMA device]

    Warning

    It is recommended to use RoCE with PFC enabled in driver and network switches.

    For how to enable PFC in the driver see “Priority Flow Control (PFC)” section.

RoCE Modes

RoCE encapsulates InfiniBand transport in one of the following Ethernet packet

  • RoCEv1 - dedicated ether type (0x8915)

  • RoCEv2 - UDP and dedicated UDP port (4791)

RoCEv1

RoCE v1 protocol is defined as RDMA over Ethernet header (as shown in the figure above). It uses ethertype 0x8915 and may can be used with or without the VLAN tag. The regular Ethernet MTU applies on the RoCE frame.

RoCEv2

A straightforward extension of the RoCE protocol enables traffic to operate in IP layer 3 environments. This capability is obtained via a simple modification of the RoCE packet format. Instead of the GRH used in RoCE, IP routable RoCE packets carry an IP header which allows traversal of IP L3 Routers and a UDP header (RoCEv2 only) that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP.

The proposed RoCEv2 packets use a well-known UDP destination port value that unequivocally distinguishes the datagram. Similar to other protocols that use UDP encapsulation, the UDP source port field is used to carry an opaque flow-identifier that allows network devices to implement packet forwarding optimizations (e.g. ECMP) while staying agnostic to the specifics of the protocol header format.

Furthermore, since this change exclusively affects the packet format on the wire, and due to the fact that with RDMA semantics packets are generated and consumed below the AP, applications can seamlessly operate over any form of RDMA service, in a completely transparent way.

GID Table Population

The GID table is automatically populated by the ESXi RDMA stack using the 'binds' mechanism, and has a maximum size of 128 entries per port. Each bind can be of type RoCE v1 or RoCE v2, where entries of both types can coexist on the same table. Binds are created using IP-based GID generation scheme.

For more information, please refer to the "VMkernel APIs Reference Manual."

Prerequisites

The following are the driver’s prerequisites in order to set or configure RoCE:

  • ConnectX®-4 firmware version 12.17.2020 and above

  • ConnectX®-4 Lx firmware version 14.17.2020 and above

  • ConnectX®-5 firmware version 16.20.1000 and above

  • All InfiniBand verbs applications which run over InfiniBand verbs should work on RoCE links if they use GRH headers.

  • All ports must be set to use Ethernet protocol

Running and Configuring RoCE on ESXi VMs

RoCE on ESXi VMs can run on VMs which are associated with either SR-IOV EN Virtual Functions or passthrough.

In order to function reliably, RoCE requires a form of flow control. While it is possible to use global flow control, this is normally undesirable, for performance reasons.

The normal and optimal way to use RoCE is to use Priority Flow Control (PFC). To use PFC, it must be enabled on all endpoints and switches in the flow path.

On ESXi, the PFC settings should be set on the ESXi host only and not on the VMs as the ESXi host is the one to control PFC settings. PFC settings can be changed using the mlx5_core parameters pfctx and pfcrx. For further information, please refer to “nmlx5_core Parameters”.

For further information on how to use and run RoCE on the VM, please refer t the VM's driver User Manual. Additional information can be found at the RoCE Over L2 Network Enabled with PFC User Guide:

http://www.mellanox.com/related-docs/prod_software/ RoCE_with_Priority_Flow_Control_Application_Guide.pdf

Explicit Congestion Notification (ECN)

Explicit Congestion Notification (ECN) is an extension to the Internet Protocol and to the Transmission Control Protocol and is defined in RFC 3168 (2001). ECN allows end-to-end notification of network congestion without dropping packets. ECN is an optional feature that may be used between two ECN-enabled endpoints when the underlying network infrastructure also supports it.

ECN is enabled by default (ecn=1). To disable it, set the “ecn” module parameter to 0. For most use cases, the default setting of the ECN are sufficient. However, if further changes are required, use the nmlxcli management tool to tune the ECN algorithm behavior. For further information on the tool, see "34258511” section. The nmlxcli management tool can also be used to provide ECN different statistics.

VXLAN/Geneve hardware offload enables the traditional offloads to be performed on the encapsulated traffic. With ConnectX® family adapter cards, data center operators can decouple the overlay network layer from the physical NIC performance, thus achieving native performance in the new network architecture.

Configuring Overlay Networking Stateless Hardware Offload

VXLAN/Geneve hardware offload includes:

  • TX: Calculates the Inner L3/L4 and the Outer L3 checksum

  • RX:

    • Checks the Inner L3/L4 and the Outer L3 checksum

    • Maps the VXLAN traffic to an RX queue according to:

      • Inner destination MAC address

      • Outer destination MAC address

      • VXLAN ID

VXLAN/Geneve hardware offload is enabled by default and its status cannot be changed.

VXLAN/Geneve configuration is done in the ESXi environment via VMware NSX manager. For additional NSX information, please refer to VMware documentation: http://pubs.vmware.com/NSX-62/index.jsp#com.vmware.nsx.install.doc/GUID-D8578F6E-A40C-493A-9B43-877C2B75ED52.html.

Packet Capture utility duplicates all traffic, including RoCE, in its raw Ethernet form (before stripping) to a dedicated "sniffing" QP, and then passes it to an ESX drop capture point.
It allows gathering of Ethernet and RoCE bidirectional traffic via pktcap-uw and viewing it using regular Ethernet tools, e.g. Wireshark.

Warning

By nature, RoCE traffic is much faster than ETH. Meaning there is a significant gap between RDMA traffic rate and Capture rate.
Therefore actual "sniffing" RoCE traffic with ETH capture utility is not feasible for long periods.

Components

Packet Capture Utility is comprised of two components:

  • ConnectX-4 RDMA module sniffer:
    This component is part of the Native ConnectX-4 RDMA driver for ESX and resides in Kernel space.

  • RDMA management interface:
    User space utility which manages the ConnectX-4 Packet Capture Utility

Usage

  1. Installed the latest ConnectX-4 driver bundle.

  2. Make sure all Native nmlx5 drivers are loaded

    Copy
    Copied!
                

    esxcli system module list | grep nmlx nmlx5_core true true nmlx5_rdma true true

  3. Install the nmlxcli management tool (esxcli extension) using the supplied bundle
    MLNX-NATIVE-NMLXCLI_<version>.zip

    When the nmlxcli management tool is installed, the following esxli commands namespace is available:

    Copy
    Copied!
                

    # esxcli mellanox uplink sniffer

    This namespace allows user basic packet capture utility operations such as: query, enable or disable.
    Usage of the tool is shown by running one of the options below:

    Copy
    Copied!
                

    snifferesxcli mellanox uplink sniffer {cmd} [cmd options]

    Options:

    Copy
    Copied!
                

    disable Disable sniffer on specified uplink * Requires -u/--uplink-name parameter enable Enable sniffer on specified uplink * Requires -u/--uplink-name parameter query Query operational state of sniffer on specified uplink * Requires -u/--uplink-name parameter

  4. Determine the uplink device name.

    Copy
    Copied!
                

    Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description ------ ------------ ---------- ------------ ----------- ------ ------ ----------------- ---- ------------------------------------------------------- vmnic4 0000:07:00.0 nmlx5_core Up Up 100000 Full 7c:fe:90:63:f2:d6 1500 Mellanox Technologies MT27700 Family [ConnectX-4] vmnic5 0000:07:00.1 nmlx5_core Up Up 100000 Full 7c:fe:90:63:f2:d7 1500 Mellanox Technologies MT27700 Family [ConnectX-4]

  5. Enable the packet capture utility for the required device(s).

    Copy
    Copied!
                

    esxcli mellanox uplink sniffer enable -u <vmnic_name>

  6. Use the ESX internal packet capture utility to capture the packets.

    Copy
    Copied!
                

    pktcap-uw --capture Drop --o <capture_file>

  7. Generate the RDMA traffic through the RDMA device.

  8. Stop the capture.

  9. Disable the packet capture utility.

    Copy
    Copied!
                

    esxcli mellanox uplink sniffer disable -u <vmnic_name>

  10. Query the packet capture utility.

    Copy
    Copied!
                

    esxcli mellanox uplink sniffer query -u <vmnic_name>

Limitations

  • Capture duration: Packet Capture Utility is a debug tool, meant to be used for bind failure diagnostics and short period packet sniffing. Running it for a long period of time with stress RDMA traffic will cause undefined behavior. Gaps in capture packets may appear.

  • Overhead: A significant performance decrease is expected when the tool is enabled:

    • The tool creates a dedicated QP and HW duplicates all RDMA traffic to this QP, before stripping the ETH headers.

    • The captured packets reported to ESX are duplicated by the network stack adding to the overhaul execution time

  • Drop capture point: The tool uses the VMK_PKTCAP_POINT_DROP to pass the captured traffic. Meaning whomever is viewing the captured file will see all RDMA capture in addition to all the dropped packets reported to the network stack.

  • ESX packet exhaustion: During the enable phase (/opt/mellanox/bin/ nmlx4_sniffer_mgmt-user -a vmrdma3 -e) the Kernel component allocates sniffer resources, and among these are the OS packets which are freed upon tool’s disable. Multiple consecutive enable/disable calls may cause temporary failures when the tool requests to allocate these packets. It is recommended to allow sufficient time between consecutive disable and enable to fix this issue.

Data Center Bridging (DCB) uses DCBX to exchange configuration information with directly connected peers. DCBX operations can be configured to set PFC or ETS values. DCB is enabled by default on the host side, you can choose between DCB modes using the dcbx module parameter. Example for setting the software mode:

Copy
Copied!
            

esxcli system module parameters set -m nmlx5_core -p dcbx=2

For hardware mode, you also need to make sure it is supported and enabled on the firmware by setting these values with mlxconfig tool:

  • LLDP_NB_DCBX = 1

  • Both: LLDP_NB_RX = 2 and LLDP_NB_TX_ = 2

  • At least one of: DCBX_IEEE = 1 or DCBX_CEE = 1

Example: /opt/mellanox/bin/mlxconfig -d 0000:05:00.0 set LLDP_NB_DCBX_P1=1

© Copyright 2023, NVIDIA. Last updated on Aug 31, 2023.