Quality of Service (QoS)

Quality of Service (QoS) is a mechanism of assigning a priority to a network flow (socket, rdma_cm connection) and manage its guarantees, limitations and its priority over other flows. This is accomplished by mapping the user's priority to a hardware TC (traffic class) through a 2/3 stage process. The TC is assigned with the QoS attributes and the different flows behave accordingly.

Mapping traffic to TCs consists of several actions which are user controllable, some controlled by the application itself and others by the system/network administrators.
The following is the general mapping traffic to Traffic Classes flow:

  1. The application sets the required Type of Service (ToS).

  2. The ToS is translated into a Socket Priority (sk_prio).

  3. The sk_prio is mapped to a User Priority (UP) by the system administrator (some applications set sk_prio directly).

  4. The UP is mapped to TC by the network/system administrator.

  5. TCs hold the actual QoS parameters

QoS can be applied on the following types of traffic. However, the general QoS flow may vary among them:

  • Plain Ethernet - Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver

  • RoCE - Applications use the RDMA API to transmit using Queue Pairs (QPs)

  • Raw Ethernet QP - Application use VERBs API to transmit using a Raw Ethernet QP

Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver. The following is the Plain Ethernet QoS mapping flow:

  1. The application sets the ToS of the socket using setsockopt (IP_TOS, value).

  2. ToS is translated into the sk_prio using a fixed translation:

    Copy
    Copied!
                

    TOS 0 <=> sk_prio 0 TOS 8 <=> sk_prio 2 TOS 24 <=> sk_prio 4 TOS 16 <=> sk_prio 6

  3. The Socket Priority is mapped to the UP in the following conditions:

    1. If the underlying device is a VLAN device, egress_map is used controlled by the vconfig command. This is per VLAN mapping.

    2. If the underlying device is not a VLAN device, the mapping is done in the driver.

  4. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if DCBX is used.

Warning

Socket applications can use setsockopt (SK_PRIO, value) to directly set the sk_prio of the socket. In this case, the ToS to sk_prio fixed mapping is not needed. This allows the application and the administrator to utilize more than the 4 values possible via ToS.

Warning

In the case of a VLAN interface, the UP obtained according to the above mapping is also used in the VLAN tag of the traffic.

Applications use RDMA-CM API to create and use QPs. The following is the RoCE QoS mapping flow:

  1. The application sets the ToS of the QP using the rdma_set_option option(RDMA_OPTION_ID_TOS, value).

  2. ToS is translated into the Socket Priority (sk_prio) using a fixed translation:

    Copy
    Copied!
                

    TOS 0 <=> sk_prio 0 TOS 8 <=> sk_prio 2 TOS 24 <=> sk_prio 4 TOS 16 <=> sk_prio 6

  3. The Socket Priority is mapped to the User Priority (UP) using the tc command.

  • In the case of a VLAN device where the parent real device is used for the purpose of this mapping

  • If the underlying device is a VLAN device, and the parent real device was not used for the mapping, the VLAN device's egress_map is used

4. UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if DCBX is used.

Warning

With RoCE, there can only be 4 predefined ToS values for the purpose of QoS mapping.

For RoCE old kernels that do not support set_egress_map, use the tc_wrap script to map between sk_prio and UP. Use tc_wrap with option -u. For example:

Copy
Copied!
            

tc_wrap -i <ethX> -u <skprio2up mapping>

The different QoS properties that can be assigned to a TC are:

Strict Priority

When setting a TC's transmission algorithm to be 'strict', then this TC has absolute (strict) priority over other TC strict priorities coming before it (as determined by the TC number: TC 7 is the highest priority, TC 0 is lowest). It also has an absolute priority over nonstrict TCs (ETS).
This property needs to be used with care, as it may easily cause starvation of other TCs.
A higher strict priority TC is always given the first chance to transmit. Only if the highest strict priority TC has nothing more to transmit, will the next highest TC be considered.
Nonstrict priority TCs will be considered last to transmit.
This property is extremely useful for low latency low bandwidth traffic that needs to get immediate service when it exists, but is not of high volume to starve other transmitters in the system.

Enhanced Transmission Selection (ETS)

Enhanced Transmission Selection standard (ETS) exploits the time periods in which the offered load of a particular Traffic Class (TC) is less than its minimum allocated bandwidth by allowing the difference to be available to other traffic classes.
After servicing the strict priority TCs, the amount of bandwidth (BW) left on the wire may be split among other TCs according to a minimal guarantee policy.
If, for instance, TC0 is set to 80% guarantee and TC1 to 20% (the TCs sum must be 100), then the BW left after servicing all strict priority TCs will be split according to this ratio.
Since this is a minimum guarantee, there is no maximum enforcement. This means, in the same example, that if TC1 did not use its share of 20%, the reminder will be used by TC0.
ETS is configured using the mlnx_qos tool (mlnx_qos ) which allows you to:

  • Assign a transmission algorithm to each TC (strict or ETS)

  • Set minimal BW guarantee to ETS TCs
    Usage:

    Copy
    Copied!
                

    mlnx_qos -i \[options\]

Rate Limit

Rate limit defines a maximum bandwidth allowed for a TC. Please note that 10% deviation from the requested values is considered acceptable.

Trust State

Trust state enables prioritizing sent/received packets based on packet fields.
The default trust state is PCP. Ethernet packets are prioritized based on the value of the field (PCP/DSCP).
For further information on how to configure Trust mode, please refer to HowTo Configure Trust State on NVIDIA Adapters community post.

Warning

Setting the Trust State mode shall be done before enabling SR-IOV in order to propagate the Trust State to the VFs.


Receive Buffer

By default, the receive buffer configuration is controlled automatically. Users can override the receive buffer size and receive buffer's xon and xoff thresholds using mlnx_qos tool.
For further information, please refer to HowTo Tune the Receive buffers on NVIDIA Adapters community post.

DCBX Control Mode

DCBX settings, such as "ETS" and "strict priority" can be controlled by firmware or software. When DCBX is controlled by firmware, changes of QoS settings cannot be done by the software. The DCBX control mode is configured using the mlnx_qos -d os/fw command.
For further information on how to configure the DCBX control mode, please refer to mlnx_qos community post.

mlnx_qos

mlnx_qos is a centralized tool used to configure QoS features of the local host. It communicates directly with the driver thus does not require setting up a DCBX daemon on the system.
The mlnx_qos tool enables the administrator of the system to:

  • Inspect the current QoS mappings and configuration
    The tool will also display maps configured by TC and vconfig set_egress_map tools, in order to give a centralized view of all QoS mappings.

  • Set UP to TC mapping

  • Assign a transmission algorithm to each TC (strict or ETS)

  • Set minimal BW guarantee to ETS TCs

  • Set rate limit to TCs

  • Set DCBX control mode

  • Set cable length

  • Set trust state

Warning

For an unlimited ratelimit, set the ratelimit to 0.

Usage

Copy
Copied!
            

mlnx_qos -i <interface> \[options\]

Options

--version

Show the program's version number and exit

-h, --help

Show this help message and exit

-f LIST, --pfc=LIST

Set priority flow control for each priority. LIST is

a comma separated value for each priority starting from

0 to 7. Example: 0,0,0,0,1,1,1,1 enable PFC on TC4-7

-p LIST, --prio_tc=LIST

Maps UPs to TCs. LIST is 8 comma-separated TC numbers. Example: 0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and UPs 4-7 to TC1

-s LIST, --tsa=LIST

Transmission algorithm for each TC. LIST is comma separated algorithm names for each TC. Possible algorithms: strict, ets and vendor. Example: vendor,strict,ets,ets,ets,ets,ets,ets sets TC0 to vendor, TC1 to strict, TC2-7 to ets

-t LIST, --tcbw=LIST

Set the minimally guaranteed %BW for ETS TCs. LIST is comma-separated percents for each TC. Values set to TCs that are not configured to ETS algorithm are ignored but must be present. Example: if TC0,TC2 are set to ETS, then 10,0,90,0,0,0,0,0will set TC0 to 10% and TC2 to 90%. Percents must sum to 100

-r LIST, --ratelimit=LIST

Rate limit for TCs (in Gbps). LIST is a comma-separated Gbps limit for each TC. Example: 1,8,8 will limit TC0 to 1Gbps, and TC1,TC2 to 8 Gbps each

-d DCBX, --dcbx=DCBX

Set dcbx mode to firmware controlled(fw) or OS controlled(os). Note, when in OS mode, mlnx_qos should not be used in parallel with other dcbx tools, such as lldptool

--trust=TRUST

set priority trust state to pcp or dscp

--dscp2prio=DSCP2PRIO

Set/del a (dscp,prio) mapping. Example 'set,30,2' maps dscp 30 to priority 2. 'del,30,2' resets the dscp 30 mapping back to the default setting priority 0

--cable_len=CABLE_LEN

Set cable_len for buffer's xoff and xon thresholds

-i INTF, --interface=INTF

Interface name

-a

Show all interface's TCs

Get Current Configuration

Copy
Copied!
            

ofed_scripts/utils/mlnx_qos -i ens1f0 DCBX mode: OS controlled Priority trust state: dscp dscp2prio mapping: prio:0 dscp:07,06,05,04,03,02,01,00, prio:1 dscp:15,14,13,12,11,10,09,08, prio:2 dscp:23,22,21,20,19,18,17,16, prio:3 dscp:31,30,29,28,27,26,25,24, prio:4 dscp:39,38,37,36,35,34,33,32, prio:5 dscp:47,46,45,44,43,42,41,40, prio:6 dscp:55,54,53,52,51,50,49,48, prio:7 dscp:63,62,61,60,59,58,57,56, Cable len: 7 PFC configuration: priority 0 1 2 3 4 5 6 7 enabled 0 0 0 0 0 0 0 0 tc: 0 ratelimit: unlimited, tsa: vendor priority: 1 tc: 1 ratelimit: unlimited, tsa: vendor priority: 0 tc: 2 ratelimit: unlimited, tsa: vendor priority: 2 tc: 3 ratelimit: unlimited, tsa: vendor priority: 3 tc: 4 ratelimit: unlimited, tsa: vendor priority: 4 tc: 5 ratelimit: unlimited, tsa: vendor priority: 5 tc: 6 ratelimit: unlimited, tsa: vendor priority: 6 tc: 7 ratelimit: unlimited, tsa: vendor priority: 7

Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2

Copy
Copied!
            

# mlnx_qos -i <interface> -p 0,1,2 -r 3,4,2 tc: 0 ratelimit: 3 Gbps, tsa: strict up: 0 skprio: 0 skprio: 1 skprio: 2 (tos: 8) skprio: 3 skprio: 4 (tos: 24) skprio: 5 skprio: 6 (tos: 16) skprio: 7 skprio: 8 skprio: 9 skprio: 10 skprio: 11 skprio: 12 skprio: 13 skprio: 14 skprio: 15 up: 3 up: 4 up: 5 up: 6 up: 7 tc: 1 ratelimit: 4 Gbps, tsa: strict up: 1 tc: 2 ratelimit: 2 Gbps, tsa: strict up: 2

ConfigureQoS. Map UP0,7 to tc0,1,2,3 to tc1 and 4,5,6 to tc2. Set tc0,tc1 as ets and tc2 as strict. Divide ets 30% for tc0 and 70% for tc1

Copy
Copied!
            

# mlnx_qos -i <interface> -s ets,ets,strict -p 0,1,1,1,2,2,2 -t 30,70 tc: 0 ratelimit: 3 Gbps, tsa: ets, bw: 30% up: 0 skprio: 0 skprio: 1 skprio: 2 (tos: 8) skprio: 3 skprio: 4 (tos: 24) skprio: 5 skprio: 6 (tos: 16) skprio: 7 skprio: 8 skprio: 9 skprio: 10 skprio: 11 skprio: 12 skprio: 13 skprio: 14 skprio: 15 up: 7 tc: 1 ratelimit: 4 Gbps, tsa: ets, bw: 70% up: 1 up: 2 up: 3 tc: 2 ratelimit: 2 Gbps, tsa: strict up: 4 up: 5 up: 6

tc and tc_wrap.py

The tc tool is used to create 8 Traffic Classes (TCs).
The tool will either use the sysfs (/sys/class/net/<ethX>/qos/tc_num) or the tc tool to create the TCs.

Usage

Copy
Copied!
            

tc_wrap.py -i <interface> \[options\]

Options

--version

show program's version number and exit

-h, --help

show this help message and exit

-u SKPRIO_UP, --skprio_up=SKPRIO_UP

maps sk_prio to priority for RoCE. LIST is <=16 comma separated priority. index of element is sk_prio

-i INTF, --interface=INTF

Interface name

Example

Run:

Copy
Copied!
            

tc_wrap.py -i enp139s0

Output:

Copy
Copied!
            

Tarrfic classes are set to 8   UP 0 skprio: 0 (vlan 5) UP 1 skprio: 1 (vlan 5) UP 2 skprio: 2 (vlan 5 tos: 8) UP 3 skprio: 3 (vlan 5) UP 4 skprio: 4 (vlan 5 tos: 24) UP 5 skprio: 5 (vlan 5) UP 6 skprio: 6 (vlan 5 tos: 16) UP 7 skprio: 7 (vlan 5)

Additional Tools

tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher. This is a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs interface is available.

  • mlnx_qos tool (package: ofed-scripts) requires python version 2.5 < = X

  • tc_wrap.py (package: ofed-scripts) requires python version 2.5 < = X

ConnectX-4 and above devices allow packet pacing (traffic shaping) per flow. This capability is achieved by mapping a flow to a dedicated send queue and setting a rate limit on that Send queue.
Note the following:

  • Up to 512 send queues are supported

  • 16 different rates are supported

  • The rates can vary from 1 Mbps to line rate in 1 Mbps resolution

  • Multiple queues can be mapped to the same rate (each queue is paced independently)

  • It is possible to configure rate limit per CPU and per flow in parallel

System Requirements

  • MLNX_OFED, v3.3 or higher

  • MLNX_EN, v3.3 or higher

  • MLNX_OFED, v3.3 or higher

  • Linux kernel v4.1 or higher

  • ConnectX-4 or ConnectX-4 Lx adapter cards with an official firmware version

Packet Pacing Configuration

Warning

This configuration is non-persistent and does not survive driver restart.

  1. Firmware Activation:
    First, make sure MFT service (mst) is started:

    Copy
    Copied!
                

    # mst start

    Then run:

    Copy
    Copied!
                

    #echo "MLNX_RAW_TLV_FILE" > /tmp/mlxconfig_raw.txt #echo “0x00000004 0x0000010c 0x00000000 0x00000001" >> /tmp/mlxconfig_raw.txt #yes | mlxconfig -d <mst_dev> -f /tmp/mlxconfig_raw.txt set_raw > /dev/null #reboot /mlxfwreset

    Copy
    Copied!
                

    #echo "MLNX_RAW_TLV_FILE" > /tmp/mlxconfig_raw.txt #echo “0x00000004 0x0000010c 0x00000000 0x00000000" >> /tmp/mlxconfig_raw.txt #yes | mlxconfig -d <mst_dev >-f /tmp/mlxconfig_raw.txt set_raw > /dev/null #reboot /mlxfwreset

  2. Driver Activation:
    There are two operation modes for Packet Pacing:

  1. Rate limit per CPU core:
    When XPS is enabled, traffic from a CPU core will be sent using the corresponding send queue. By limiting the rate on that queue, the transmit rate on that CPU core will be limited. For example:

    Copy
    Copied!
                

    echo 300 > /sys/class/net/ens2f1/queues/tx-0/tx_maxrate

    In this case, the rate on Core 0 (tx-0) is limited to 300Mbit/sec.

  2. Rate limit per flow:

    1. The driver allows opening up to 512 additional send queues using the following command:

      Copy
      Copied!
                  

      ethtool -L ens2f1 other 1200

      In this case, 1200 additional queues are opened

    2. Create flow mapping.
      Users can map a certain destination IP and/or destination layer 4 Port to a specific send queue. The match precedence is as follows:

  • IP + L4 Port

  • IP only

  • L4 Port only

  • No match (the flow would be mapped to default queues)
    To create flow mapping:
    Configure the destination IP. Write the IP address in hexadecimal representation to the relevant sysfs entry. For example, to map IP address 192.168.1.1 (0xc0a80101) to send queue 310, run the following command:

    Copy
    Copied!
                

    echo 0xc0a80101 > /sys/class/net/ens2f1/queues/tx-310/flow_map/dst_ip

    To map Destination L4 3333 port (either TCP or UDP) to the same queue, run:

    Copy
    Copied!
                

    echo 3333 > /sys/class/net/ens2f1/queues/tx-310/flow_map/dst_port

    From this point on, all traffic destined to the given IP address and L4 port will be sent using send queue 310. All other traffic will be sent using the original send queue.

iii. Limit the rate of this flow using the following command:

Copy
Copied!
            

echo 100 > /sys/class/net/ens2f1/queues/tx-310/tx_maxrate

Warning

Each queue supports only a single IP+Port combination.

© Copyright 2023, NVIDIA. Last updated on Nov 27, 2023.