Quality of Service

Quality of Service (QoS) is a mechanism of assigning a priority to a network flow and manage its guarantees, limitations and its priority over other flows. This is accomplished by mapping the User Priority (UP) to a hardware Traffic Class (TC). TC is assigned with the QoS attributes and the different flows behave accordingly.

Warning

Packet Pacing and Quality of Service (QoS) features do not co-exist.

Procedure_Heading_Icon.PNG

To be able to work with QoS, make sure to disable Packet Pacing in firmware:

  1. Create a file with the following content.

    Copy
    Copied!
                

    # vim /tmp/disable_packet_pacing.txt MLNX_RAW_TLV_FILE 0x00000004 0x0000010c 0x00000000 0x00000000

  2. Update firmware configuration to disable Packing Pacing.

    Copy
    Copied!
                

    mlxconfig -d pci0:<x>:0:0 -f /tmp/disable_packet_pacing.txt set_raw

  3. Reset the firmware.

    Copy
    Copied!
                

    mlxfwreset -d pci0:<x>:0:0 reset

PCP is used as a means for classifying and managing network traffic, and providing QoS in Layer 2 Ethernet networks. It uses the 3-bit PCP field in the VLAN header for the purpose of packet classification.

To create a VLAN interface and assign the desired priority to it:

Procedure_Heading_Icon.PNG

Copy
Copied!
            

# ifconfig mce<N>.<vlan> create # ifconfig mce<N>.<vlan> vlanpcp <prio>

VLAN 0 Priority Tagging

The VLAN 0 Priority Tagging feature enables 802.1Q Ethernet frames to be transmitted with VLAN ID set to zero.
Setting the VLAN ID tag to zero allows its tag to be ignored, and the Ethernet frame to be processed according to the priority configured in the 802.1P bits of the 802.1Q Ethernet frame header.

Procedure_Heading_Icon.PNG

To enable VLAN 0 priority tagging on a specific interface:

Copy
Copied!
            

# ifconfig mce<N> pcp <prio>

To disable VLAN 0 priority tagging on a specific interface:

Procedure_Heading_Icon.PNG

Copy
Copied!
            

# ifconfig mce<N> -pcp

Warning

Switch port must be configured to accept VLAN 0 priority tagged packets. Otherwise, these packets may be dropped.

Differentiated services or DiffServ is a computer networking architecture that specifies a simple and scalable mechanism for classifying and managing network traffic and providing quality of service (QoS) on IP networks.
DiffServ uses a 6-bit DSCP in the 8-bit DS field in the IP header for packet classification purposes. The DS field replaces the outdated IPv4 TOS field.

Trust state enables prioritizing sent/received packets based on packet fields.

The default trust state is PCP. Ethernet packets are prioritized based on the value of the field (PCP/DSCP/BOTH).

Procedure_Heading_Icon.PNG

To configure Trust State, use the following sysctl node:

Copy
Copied!
            

# sysctl -d dev.mce.<N>.conf.qos.trust_state dev.mce.<N>.conf.qos.trust_state: Set trust state, 1:PCP 2:DSCP 3:BOTH

RDMA application is responsible for setting QoS values.

  • In RDMA CM mode, QoS is set in the rdma_id_private struct in the tos field.
    Incoming RDMA CM connections always take precedence setting the current priority.

  • In non-RDMA CM mode, priority values are set using a modify_qp command with ibv_qp_attr parameter. IPv4 type of service (“ToS”) and IPv6 traffic class are set using the attr.ah_attr.grh.traffic_class field. VLAN PCP is set using the attr.ah_attr.sl field.

This feature allows users to map a specific User Priority (UP) to a specific TC.

Note that this configuration is permanent and will not be reset to default unless manually changed.

Example

Procedure_Heading_Icon.PNG

To map UP 5 to TC 4 on device mce0:

Copy
Copied!
            

# sysctl dev.mce.0.conf.qos.prio_0_7_tc=1,0,2,3,4,4,6,7 dev.mce.0.conf.qos.prio_0_7_tc: 1 0 2 3 4 5 6 7 -> 1 0 2 3 4 4 6 7

Note: By default, UP 0 is mapped to TC 1, and UP 1 is mapped to TC 0:

Copy
Copied!
            

# sysctl dev.mce.0.conf.qos.prio_0_7_tc dev.mce.0.conf.qos.prio_0_7_tc: 1 0 2 3 4 5 6 7

Each DSCP value can be mapped to a priority using the following sysctl nodes:

Copy
Copied!
            

dev.mce.<N>.conf.qos.dscp_56_63_prio: 7 7 7 7 7 7 7 7 dev.mce.<N>.conf.qos.dscp_48_55_prio: 6 6 6 6 6 6 6 6 dev.mce.<N>.conf.qos.dscp_40_47_prio: 5 5 5 5 5 5 5 5 dev.mce.<N>.conf.qos.dscp_32_39_prio: 4 4 4 4 4 4 4 4 dev.mce.<N>.conf.qos.dscp_24_31_prio: 3 3 3 3 3 3 3 3 dev.mce.<N>.conf.qos.dscp_16_23_prio: 2 2 2 2 2 2 2 2 dev.mce.<N>.conf.qos.dscp_8_15_prio: 1 1 1 1 1 1 1 1 dev.mce.<N>.conf.qos.dscp_0_7_prio: 0 0 0 0 0 0 0 0

Example:

Copy
Copied!
            

# sysctl dev.mce.0.conf.qos.dscp_0_7_prio=1,1,1,1,1,1,1,1 dev.mce.0.conf.qos.dscp_0_7_prio: 0 0 0 0 0 0 0 0 -> 1 1 1 1 1 1 1 1

This feature allows users to rate limit a specific TC. Rate limit defines a maximum bandwidth allowed for a TC. Please note that 10% deviation from the requested values is considered acceptable.

Warning

Note that instead of setting the maximum rate for a single priority, you should pass the maximum rates for all relevant priorities as a single input.

Notes:

  • This configuration is permanent and will not be set to default unless manually changed.

  • Rate is specified in kilobits, where kilo=1000.

  • Rate must be divisible by 100,000, meaning that values must be in 100Mbs units.

  • Examples for valid values:

    • 200000 - 200Mbs

    • 1000000 - 1Gbs

    • 3400000 - 3.4Gbs

  • 0 value = unlimited rate

Example:

Procedure_Heading_Icon.PNG

To “rate limit” TC 4 on device mce1 to 2.4Gbits:

Copy
Copied!
            

# sysctl dev.mce.0.conf.qos.tc_max_rate=0,0,0,0,2400000,0,0,0 dev.mce.0.conf.qos.tc_max_rate: 0 0 0 0 0 0 0 0 -> 0 0 0 0 2400000 0 0 0

Warning

To be able to fully utilize this feature, make sure Priority Flow Control (PFC) feature is enabled.

Enhanced Transmission Selection standard (ETS) exploits the time periods in which the offered load of a particular Traffic Class (TC) is less than its minimum allocated bandwidth by allowing the difference to be available to other traffic classes.

After servicing the strict priority TCs, the amount of bandwidth (BW) left on the wire may be split among other TCs according to a minimal guarantee policy.

If, for instance, TC0 is set to 80% guarantee and TC1 to 20% (the TCs sum must be 100), then the BW left after servicing all strict priority TCs will be split according to this ratio.

Since this is a minimal guarantee, there is no maximum enforcement. This means, in the same example, that if TC1 did not use its share of 20%, the reminder will be used by TC0.

Example:

Copy
Copied!
            

sysctl dev.mce.0.conf.qos.tc_rate_share=20,10,10,10,10,10,10,20

In this example, Priority 7 and Priority 0 are guaranteed for 20% of the bandwidth, and all the rest are guaranteed for 10% of the bandwidth.

Hardware buffers configuration can be tuned for priority flow control (PFC).

Parameter

Description

dev.mce.X.conf.qos.buffers_size

This parameter is used to set the buffer size.
The hardware allows to configure up to eight buffers sizes. The total sum of all buffers must not exceed the hardware memory size. The limitation is enforced automatically. Sysctl allows to set each buffer size. Buffer space exhaustion causes the adapter card to send xoff to the other side of the link.

dev.mce.X.conf.qos.buffers_prio

This parameter shows the mapping between priority to buffer.
Maps buffer index into the hardware-defined priority.
Note that the priority is the internal number after translation from the external QoS parameters.

dev.mce.X.conf.qos.cable_length

For more precise determination of the moment when xoff should be issued, users may specify the cable length in meters to calculate the signal propagation delay.

© Copyright 2023, NVIDIA. Last updated on May 24, 2023.