Created on Jun 9, 2019
Updated Mar 30, 2020
This How To provides how to configure Priority Flow Control (PFC) a NVIDIA Spectrum installed with MLNX-OS and running RoCE over a lossless network, in PCP-based QoS mode.
This post assumes VMware ESXi 6.5/6.7 native and MLNX-OS version 3.6.5000 and above.
- vSphere Command-Line Interface Concepts and Examples
Hardware and Software Requirements
1. A server platform with an adapter card based on one of the following NVIDIA Technologies’ HCA devices:
2. Installer Privileges: The installation requires administrator privileges on the target machine.
3. Device ID: For the latest list of device IDs, please visit NVIDIA website.
RDMA over Converged Ethernet
RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access(RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.
Priority Flow Control (PFC)
Priority Flow Control (PFC) IEEE 802.1Qbb applies pause functionality to specific classes of traffic on the Ethernet link. The goal of this mechanism is to ensure zero loss under congestion in data center bridging(DCB) networks, and to allow, as a sample, for prioritization of RoCE traffic over TCP traffic. PFC can provide different levels of service to specific classes of Ethernet traffic (using IEEE 802.1p traffic classes).
Explicit Congestion Notification (ECN)
Explicit Congestion Notification (ECN) is an extension to the Internet Protocol and to the Transmission Control Protocol and is defined in RFC 3168 (2001). ECN allows end-to-end notification of network congestion without dropping packets. ECN is an optional feature that may be used between two ECN-enabled endpoints when the underlying network infrastructure also supports it.
- 2x ESXi 6.5/6.7 hosts.
- 2x ConnectX®-3/ConnectX®-4/ConnectX®-4 Lx/ConnectX®-5, or any combination thereof.
- 1x NVIDIA Ethernet Spectrum Switch SN2700
Network Switch Configuration
Please start from the How-To Get Started with NVIDIA switches guide if you don't familiar with NVIDIA switch software.
In first step please update your switch OS to the latest ONYX OS software. Please use this community guide How-To Upgrade MLNX-OS Software on NVIDIA switch systems.
There are several industry standard network configuration for RoCE deployment.
You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.
In our deployment we will configure our network to be lossless and will use PCP or DSCP-based QoS mode on host and switch sides.
NVIDIA Onyx version 3.8.2004 and above.
To see RoCE configuration run:
To monitor RoCE counters run:
MLNX-OS version 3.6.5000 up to version 3.8.2004
Please run the following 5 steps:
A switch please configure your switch accordingly by following steps:
1. Please sure that MLNX-OS version 3.6.5000 up to version 3.8.2004 on your switch.
2. Enable ECN Marking.
3. Create the RoCE pool and set QoS. Configure the traffic pool for RoCE.
4. Set a strict priority to CNPs over traffic class 6
5. Per port configuration
Set a QoS trust mode for the interface.
PCP only – tagged
DSCP only – non tagged IP frames
DSCP for non tagged , PCP for tagged
Configure the switchport
Switch configuration example
Below is our switch configuration you can use as reference with the QoS trust mode for the interface -DSCP only(default). You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration.
Configure PFC on Mellanox drivers (mnlx drivers). Note that there is a different driver for each adapter. In this example we will enable PFC on priority 3 on the receive (Rx) and transmit (Tx).
The following command enables PFC on the host. The parameters, "pfctx" (PFC TX) and "pfcrx" (PFC RX), are specified per host. If you have more than a card on the server, all ports must be enabled with PFC.
The value is a bitmap of 8 bits = 8 priorities.
To run more than one flow type on the server, turn on only one priority (e.g. priority 3), which should be configured with the parameters "0x08" = 00001000b (binary). Only the 4th bit is on (starts with priority 0,1,2 and 3 -> 4th bit).
Note: When PFC is enabled, Global Pause will be operationally disabled, regardless of what is configured for the Global Pause Flow Control.
The values of “pfctx” and “pfcrx” must be identical.
If you configured SR-IOV, you need to Re-enable SR-IOV in the driver and set the max_vfs module parameter.
We recommend that you enable only lossless applications on a specific priority.
Configure trust mode to DSCP and enabling PFC allows running PFC based on the L3 DSCP field rather than the L2 PCP field. This will eliminate the need for VLAN in the Ethernet header.
Trust state has two values:
- pcp - mapping pcp to priority. The default mapping is 1 to 1.
- dscp - mapping dscp to priority. The default mapping is priority[dscp] = dscp>>3.
The default trust state is "pcp".
ConnectX-3 specify(TM-mapping PCP to priority):
ConnectX-4/5 specify(TM-mapping DSCP to priority:
To read the current module configuration, run:
Configure Global RDMA PCP (L2 Egress Priority) and DSCP Values (L3-Optional)
The RMDA service level (sl) field for the address handles user priority and is mapped to the PCP portion of the VLAN tag.
The traffic class (tc) field of the address handles the GRH header and is mapped to the IP header DSCP bits.
You can force PCP and DSCP values (for RDMA traffic only).
The RDMA driver (nmlx5_rdma) supports global settings for the PCP (sl) and DSCP (traffic class) through the following module parameters:
pcp_force: values: (-1) - 7, default: (-1 = off)
The specified value will be set as the PCP for all outgoing RoCE traffic, regardless of the sl value specified. This parameter cannot be enabled when dscp_to_pcp is enabled.
dscp_force: values: (-1) - 63, default: (-1 = off)
The specified value will be set as the DSCP portion (6 bits) of the Type of Service (ToS) (8 bits) for all outgoing RoCE traffic, regardless of the traffic class specified.
dscp_to_pcp: values 0 (off) - 1 (on), default: 0
When enabled, the three MSBs of the DSCP value will be considered as the PCP for all outgoing RoCE traffic. If dscp_force is not used, then the DSCP value used for mapping is taken from the traffic class field in the GRH header. Otherwise, it takes the value set in dscp_force.
This parameter cannot be enabled when pcp_force is enabled.
Log into a ESXi vSphere Command-Line Interface with root permissions.
For example, to force the PCP value to egress with a value of 3:
For example, to configure the DSCP(Optional):
For enable ECN with default parameters.
1. Download a latest NVIDIA Mellanox Packet Capture Utility for ESXi 6.5/6.7.
2. Use SCP or any other file transfer method to copy the driver to the required ESXi host.
3. Log into a ESXi vSphere Command-Line Interface with root permissions.
4. Enter Maintenance Mode the ESXi host.
5. Install the Packet Capture Utility on ESXi host.
6. Reboot the ESXi server.
7. Check physical network interface status.
8. Enable ECN on relevant device.
9. Exit Maintenance Mode the ESXi host.
ESXi VLAN Configuration
The topology below describes two machines. Both of them have vmnic5 as the adapter uplink.
To set the VLAN ID for the traffic to 100, run:
1. Edit the distributed port group settings.
2. Choose "VLAN" from the left panel.
3. Set the VLAN type to "VLAN".
4. Set the VLAN tag to "100".
5. Click "OK".
For more information refer to VMware documentation.
Log into a ESXi vSphere Command-Line Interface with root permissions.
For verification purposes, when you are using Mellanox switches you can lower the speed of one of the switch ports, forcing the use of PFC pause frames due to insufficient bandwidth:
Note: You can create congestion to force PFC to be enabled using other methods. For example, you can use two hosts to send traffic to a third host, which is a simple configuration.
Run traffic between the hosts on priority 3.
Note: The final PCP values will be decided by the pcp_force.
Note: If PFC for a priority is not enabled by pfctx and pfcrx, the HCA counters for that priority will not increment, and the data will be counted on priority 0 instead.
See that both the HCA and switch transmitted/received pause frames on priority 3: