image image image image image image



On This Page

Created on Jun 9, 2019

Updated Mar 30, 2020

On This Page

Introduction

This How To provides how to configure Priority Flow Control (PFC) a NVIDIA Spectrum installed with MLNX-OS and running RoCE over a lossless network, in PCP-based QoS mode.

This post assumes VMware ESXi 6.5/6.7 native and MLNX-OS version 3.6.5000 and above.

References

Hardware and Software Requirements

1. A server platform with an adapter card based on one of the following NVIDIA Technologies’ HCA devices:

2. Installer Privileges: The installation requires administrator privileges on the target machine.

3. Device ID: For the latest list of device IDs, please visit NVIDIA website.

Components Overview

RDMA over Converged Ethernet

From Wikipedia

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.

Priority Flow Control (PFC)

Priority Flow Control (PFC) IEEE 802.1Qbb applies pause functionality to specific classes of traffic on the Ethernet link. The goal of this mechanism is to ensure zero loss under congestion in data center bridging(DCB) networks, and to allow, as a sample, for prioritization of RoCE traffic over TCP traffic. PFC can provide different levels of service to specific classes of Ethernet traffic (using IEEE 802.1p traffic classes).

Explicit Congestion Notification (ECN)

Explicit Congestion Notification (ECN) is an extension to the Internet Protocol and to the Transmission Control Protocol and is defined in RFC 3168 (2001). ECN allows end-to-end notification of network congestion without dropping packets. ECN is an optional feature that may be used between two ECN-enabled endpoints when the underlying network infrastructure also supports it.

Setup Overview

Equipment

  • 2x ESXi 6.5/6.7 hosts.
  • 2x ConnectX®-3/ConnectX®-4/ConnectX®-4 Lx/ConnectX®-5, or any combination thereof.
  • 1x NVIDIA Ethernet Spectrum Switch SN2700

Logical Design

Network Switch Configuration

Please start from the How-To Get Started with NVIDIA switches guide if you don't familiar with NVIDIA switch software.

For more information please refer to the MLNX-OS User Manual located at enterprise-support.nvidia.com/s/ or enterprise-support.nvidia.com/s/ → Switches

In first step please update your switch OS to the latest ONYX OS software. Please use this community guide How-To Upgrade MLNX-OS Software on NVIDIA switch systems.

There are several industry standard network configuration for RoCE deployment.

You are welcome to follow the Recommended Network Configuration Examples for RoCE Deployment for our recommendations and instructions.

In our deployment we will configure our network to be lossless and will use PCP or DSCP-based QoS mode on host and switch sides.

Switch Configuration

NVIDIA Onyx version 3.8.2004 and above.

Run: 

Switch console
switch (config) #roce lossless
Note: The RoCE feature has been automated so that all that is needed (PFC+ECN) to run RoCE on lossless fabric is running the "roce" command.

To see RoCE configuration run:

Switch console
show roce

To monitor RoCE counters run: 

Switch console
show interface ethernet counters roce


MLNX-OS version 3.6.5000 up to version 3.8.2004

Please run the following 5 steps:

A switch please configure your switch accordingly by following steps:

1. Please sure that MLNX-OS version 3.6.5000 up to version  3.8.2004 on your switch.

2. Enable ECN Marking.

Note: For a fair sharing of switch buffer with other traffic classes, it is recommended to configure ECN on all other traffic classes.
Switch Console
switch (config) # interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500

3. Create the RoCE pool and set QoSConfigure the traffic pool for RoCE.

Switch Console
switch (config) # no advanced buffer management force
switch (config) # traffic pool roce type lossless
switch (config) # traffic pool roce memory percent 50.00
switch (config) # traffic pool roce map switch-priority 3

4. Set a strict priority to CNPs over traffic class 6

Switch Console
switch (config) # interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict

5. Per port configuration

Set a QoS trust mode for the interface.

PCP only – tagged

Switch Console
switch (config) # interface ethernet 1/1 qos trust L2

DSCP only – non tagged IP frames

Switch Console
switch (config) # interface ethernet 1/1 qos trust L3

DSCP for non tagged , PCP for tagged

Switch Console
switch (config) # interface ethernet 1/1-1/32 qos trust both 

Configure the switchport

Switch Console
switch (config) # interface ethernet 1/1-1/32 switchport mode trunk
switch (config) # interface ethernet 1/1-1/32 switchport trunk allowed-vlan [vlan-id]

Switch configuration example
Below is our switch configuration you can use as reference with the QoS trust mode for the interface -DSCP only(default). You can copy/paste it to you switch but please be aware that this is clean switch configuration and if you may corrupt your existing configuration.

Switch Configuration Sample
swx-vmw-1-1 [standalone: master] > enable
swx-wmv-1-1 [standalone: master] # configure terminal
swx-vmw-1-1 [standalone: master] (config) # show running-config

##
## Running database "initial"
## Generated at 2018/03/10 09:38:38 +0000
## Hostname: swx-vmw-1-1
##
##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable
##
## License keys
##
license install LK2-RESTRICTED_CMDS_GEN2-44T1-4H83-RWA5-G423-GY7U-8A60-E0AH-ABCD
##
## Interface Ethernet buffer configuration
##
traffic pool roce type lossless
traffic pool roce memory percent 50.00
traffic pool roce map switch-priority 3
##
## QoS switch configuration
##
interface ethernet 1/1-1/32 qos trust L3
interface ethernet 1/1-1/32 traffic-class 3 congestion-control ecn minimum-absolute 150 maximum-absolute 1500
##
## DCBX ETS configuration
##
interface ethernet 1/1-1/32 traffic-class 6 dcb ets strict
##
## Other IP configuration
##
hostname swx-vmw-1-1
##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********
##
## Network management configuration
##
# web proxy auth basic password ********
##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID 108bb9eb3e99edff47fc86e71cba530b6a6b8991
# (public-cert config omitted since private-key config is hidden)
##
## Persistent prefix mode setting
##
cli default prefix-modes enable

Driver Configuration

Configure PFC

Configure PFC on NVIDIA drivers (mnlx drivers). Note that there is a different driver for each adapter. In this example we will enable PFC on priority 3 on the receive (Rx) and transmit (Tx).

The following command enables PFC on the host. The parameters, "pfctx" (PFC TX) and "pfcrx" (PFC RX), are specified per host. If you have more than a card on the server, all ports must be enabled with PFC.

The value is a bitmap of 8 bits = 8 priorities.

To run more than one flow type on the server, turn on only one priority (e.g. priority 3), which should be configured with the parameters "0x08" = 00001000b (binary). Only the 4th bit is on (starts with priority 0,1,2 and 3 -> 4th bit).

Note: When PFC is enabled, Global Pause will be operationally disabled, regardless of what is configured for the Global Pause Flow Control.
The values of “pfctx” and “pfcrx” must be identical.
If you configured SR-IOV, you need to Re-enable SR-IOV in the driver and set the max_vfs module parameter.
We recommend that you enable only lossless applications on a specific priority.

Configure trust mode to DSCP and enabling PFC allows running PFC based on the L3 DSCP field rather than the L2 PCP field. This will eliminate the need for VLAN in the Ethernet header.

Trust state has two values:

  • pcp - mapping pcp to priority. The default mapping is 1 to 1.
  • dscp - mapping dscp to priority. The default mapping is priority[dscp] = dscp>>3.

The default trust state is "pcp".

Examples:

ConnectX-3 specify(TM-mapping PCP to priority):

ESXi Console
~ esxcli system module parameters set -m nmlx4_en -p "pfctx=0x08 pfcrx=0x08 trust_state=1 max_vfs=4"

ConnectX-4/5 specify(TM-mapping DSCP to priority:

ESXi Console
~ esxcli system module parameters set -m nmlx5_core -p "pfctx=0x08 pfcrx=0x08 trust_state=2 max_vfs=4"

To read the current module configuration, run:

ConnectX-3:

ESXi Console
~ esxcli system module parameters list -m nmlx4_en

ConnectX-4/5:

ESXi Console
~ esxcli system module parameters list -m nmlx5_core
... 
pfcrx int 0x08 Priority based Flow Control policy on RX. 
Values : 0-255
It's 8 bits bit mask, each bit indicates priority [0-7]. Bit value:
1 - respect incoming pause frames on the specified priority.
0 - ignore incoming pause frames on the specified priority.
Notes: Must be equal to pfctx.
Default: 0
pfctx int 0x08 Priority based Flow Control policy on TX.
Values : 0-255
It's 8 bits bit mask, each bit indicates priority [0-7]. Bit value:
1 - respect incoming pause frames on the specified priority.
0 - ignore incoming pause frames on the specified priority.
Notes: Must be equal to pfcrx.
Default: 0
...

Configure Global RDMA PCP (L2 Egress Priority) and DSCP Values (L3-Optional)

The RMDA service level (sl) field for the address handles user priority and is mapped to the PCP portion of the VLAN tag.

The traffic class (tc) field of the address handles the GRH header and is mapped to the IP header DSCP bits.

You can force PCP and DSCP values (for RDMA traffic only).

The RDMA driver (nmlx5_rdma) supports global settings for the PCP (sl) and DSCP (traffic class) through the following module parameters:

  1. pcp_force: values: (-1) - 7, default: (-1 = off)

    The specified value will be set as the PCP for all outgoing RoCE traffic, regardless of the sl value specified. This parameter cannot be enabled when dscp_to_pcp is enabled.

  2. dscp_force: values: (-1) - 63, default: (-1 = off)

    The specified value will be set as the DSCP portion (6 bits) of the Type of Service (ToS) (8 bits) for all outgoing RoCE traffic, regardless of the traffic class specified.

  3. dscp_to_pcp: values 0 (off) - 1 (on), default: 0

    When enabled, the three MSBs of the DSCP value will be considered as the PCP for all outgoing RoCE traffic. If dscp_force is not used, then the DSCP value used for mapping is taken from the traffic class field in the GRH header. Otherwise, it takes the value set in dscp_force.

    This parameter cannot be enabled when pcp_force is enabled.

Log into a ESXi vSphere Command-Line Interface with root permissions.

For example, to force the PCP value to egress with a value of 3:

ConnectX-3:

ESXi Console
# esxcli system module parameters set -m nmlx4_rdma -p "pcp_force=3"

ConnectX-4/5:

ESXi Console
# esxcli system module parameters set -m nmlx5_rdma -p "pcp_force=3"

Name Type Value Description
----------------- ---- ----- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
dscp_force int DSCP value to force on outgoing RoCE traffic.
Values : -1 - Disabled, 0-63 - DSCP value to force
Default: -1
dscp_to_pcp int Map the three MSBs of DSCP value to PCP on outgoing RoCE traffic.
Cannot be enabled when pcp_force is active.
Values : 0 - Disabled, 1 - Enabled
Default: 0
enable_nmlx_debug int Enable debug prints.
Values : 1 - enabled, 0 - disabled
Default: 0
pcp_force int 3 PCP value to force on outgoing RoCE traffic. 
Cannot be active when dscp_to_pcp is enabled.
Values : -1 - Disabled, 0-7 - PCP value to force
Default: -1


For example
, to configure the DSCP(Optional):

ConnectX-3:

ESXi Console
# esxcli system module parameters set -m nmlx4_rdma -p "dscp_force=26"

ConnectX-4/5:

ESXi Console
# esxcli system module parameters set -m nmlx5_rdma -p "dscp_force=26"

Configure ECN

For enable ECN with default parameters.

1. Download a latest NVIDIA NVIDIA Packet Capture Utility for ESXi 6.5/6.7.

2. Use SCP or any other file transfer method to copy the driver to the required ESXi host.

3. Log into a ESXi vSphere Command-Line Interface with root permissions.

4. Enter Maintenance Mode the ESXi host.

5. Install the Packet Capture Utility on ESXi host.
Sample:

ESXi Console
~ esxcli software vib install -d /tmp/MLNX-NATIVE-NMLXCLI_1.17.13.8-10EM-670.0.0.7535516.zip

6. Reboot the ESXi server.

7. Check physical network interface status.

ESXi Console
~ esxcli network nic list

Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description

------ ------------ ---------- ------------ ----------- ------ ------ ----------------- ---- ----------------------------------------------------

vmnic0 0000:81:00.0 igbn Up Up 1000 Full 0c:c4:7a:e3:5c:8c 1500 Intel Corporation I350 Gigabit Network Connection

vmnic1 0000:81:00.1 igbn Up Down 0 Half 0c:c4:7a:e3:5c:8d 1500 Intel Corporation I350 Gigabit Network Connection

...

vmnic4 0000:02:00.1 nmlx5_core Up Down 0 Half ec:0d:9a:8c:c7:97 1500 Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

vmnic5 0000:02:00.0 nmlx5_core Up Up 100000 Full ec:0d:9a:8c:c7:96 1500 Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

8. Enable ECN on relevant device.

ConnectX-3/4/5:

ESXi Console
~ esxcli mellanox uplink ecn rRoceNp enable -u vmnic5

~ esxcli mellanox uplink ecn rRoceRp enable -u vmnic5

9. Exit Maintenance Mode the ESXi host.

ESXi VLAN Configuration

The topology below describes two machines. Both of them have vmnic5 as the adapter uplink.

To set the VLAN ID for the traffic to 100, run:

1. Edit the distributed port group settings.

2. Choose "VLAN" from the left panel.

3. Set the VLAN type to "VLAN".

4. Set the VLAN tag to "100".

5. Click "OK".

For more information refer to VMware documentation.

Verification

Log into a ESXi vSphere Command-Line Interface with root permissions.

For verification purposes, when you are using NVIDIA switches you can lower the speed of one of the switch ports, forcing the use of PFC pause frames due to insufficient bandwidth:

Switch Console
switch (config) # interface ethernet 1/1 shutdown
switch (config) # interface ethernet 1/1 speed 10000
switch (config) # no interface ethernet 1/1 shutdown

Note: You can create congestion to force PFC to be enabled using other methods. For example, you can use two hosts to send traffic to a third host, which is a simple configuration.

Run traffic between the hosts on priority 3.

Note: The final PCP values will be decided by the pcp_force.

Note: If PFC for a priority is not enabled by pfctx and pfcrx, the HCA counters for that priority will not increment, and the data will be counted on priority 0 instead.

See that both the HCA and switch transmitted/received pause frames on priority 3:

# vish -e cat /net/pNics/vmnic5/stats | grep -e "Pause\|PerPrio"

PerPrio[0]
rxPause : 0
txPause : 0
PerPrio[1]
rxPause : 0
txPause : 0
PerPrio[2]
rxPause : 0
txPause : 0
PerPrio[3]
rxPause : 3348591
txPause : 12217
PerPrio[4]
rxPause : 0
txPause : 0
PerPrio[5]
rxPause : 0
txPause : 0
PerPrio[6]
rxPause : 0
txPause : 0
PerPrio[7]
rxPause : 0
txPause : 0


Done !

Authors


Boris Kovalev

Boris Kovalev has worked for the past several years as a Solutions Architect, focusing on NVIDIA Networking/Mellanox technology, and is responsible for complex machine learning, Big Data and advanced VMware-based cloud research and design. Boris previously spent more than 20 years as a senior consultant and solutions architect at multiple companies, most recently at VMware. He has written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions which are available at the Mellanox Documents website.



Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright
© 2022 NVIDIA Corporation & affiliates. All Rights Reserved.