Quality of Service (QoS) is a mechanism of assigning a priority to a network flow and manage its guarantees, limitations and its priority over other flows. This is accomplished by mapping the User Priority (UP) to a hardware Traffic Class (TC). TC is assigned with the QoS attributes and the different flows behave accordingly.
Packet Pacing and Quality of Service (QoS) features do not co-exist.
To be able to work with QoS, make sure to disable Packet Pacing in firmware:
Create a file with the following content.
# vim /tmp/disable_packet_pacing.txt MLNX_RAW_TLV_FILE
0x00000004
0x0000010c
0x00000000
0x00000000
Update firmware configuration to disable Packing Pacing.
mlxconfig -d pci0:<x>:
0
:0
-f /tmp/disable_packet_pacing.txt set_rawReset the firmware.
mlxfwreset -d pci0:<x>:
0
:0
reset
PCP is used as a means for classifying and managing network traffic, and providing QoS in Layer 2 Ethernet networks. It uses the 3-bit PCP field in the VLAN header for the purpose of packet classification.
To create a VLAN interface and assign the desired priority to it:
# ifconfig mce<N>.<vlan> create
# ifconfig mce<N>.<vlan> vlanpcp <prio>
VLAN 0 Priority Tagging
The VLAN 0 Priority Tagging feature enables 802.1Q Ethernet frames to be transmitted with VLAN ID set to zero.
Setting the VLAN ID tag to zero allows its tag to be ignored, and the Ethernet frame to be processed according to the priority configured in the 802.1P bits of the 802.1Q Ethernet frame header.
To enable VLAN 0 priority tagging on a specific interface:
# ifconfig mce<N> pcp <prio>
To disable VLAN 0 priority tagging on a specific interface:
# ifconfig mce<N> -pcp
Switch port must be configured to accept VLAN 0 priority tagged packets. Otherwise, these packets may be dropped.
Differentiated services or DiffServ is a computer networking architecture that specifies a simple and scalable mechanism for classifying and managing network traffic and providing quality of service (QoS) on IP networks.
DiffServ uses a 6-bit DSCP in the 8-bit DS field in the IP header for packet classification purposes. The DS field replaces the outdated IPv4 TOS field.
Trust state enables prioritizing sent/received packets based on packet fields.
The default trust state is PCP. Ethernet packets are prioritized based on the value of the field (PCP/DSCP/BOTH).
To configure Trust State, use the following sysctl node:
# sysctl -d dev.mce.<N>.conf.qos.trust_state
dev.mce.<N>.conf.qos.trust_state: Set trust state, 1
:PCP 2
:DSCP 3
:BOTH
RDMA application is responsible for setting QoS values.
In RDMA CM mode, QoS is set in the rdma_id_private struct in the tos field.
Incoming RDMA CM connections always take precedence setting the current priority.In non-RDMA CM mode, priority values are set using a modify_qp command with ibv_qp_attr parameter. IPv4 type of service (“ToS”) and IPv6 traffic class are set using the attr.ah_attr.grh.traffic_class field. VLAN PCP is set using the attr.ah_attr.sl field.
This feature allows users to map a specific User Priority (UP) to a specific TC.
Note that this configuration is permanent and will not be reset to default unless manually changed.
Example
To map UP 5 to TC 4 on device mce0:
# sysctl dev.mce.0
.conf.qos.prio_0_7_tc=1
,0
,2
,3
,4
,4
,6
,7
dev.mce.0
.conf.qos.prio_0_7_tc: 1
0
2
3
4
5
6
7
-> 1
0
2
3
4
4
6
7
Note: By default, UP 0 is mapped to TC 1, and UP 1 is mapped to TC 0:
# sysctl dev.mce.0
.conf.qos.prio_0_7_tc
dev.mce.0
.conf.qos.prio_0_7_tc: 1
0
2
3
4
5
6
7
Each DSCP value can be mapped to a priority using the following sysctl nodes:
dev.mce.<N>.conf.qos.dscp_56_63_prio: 7
7
7
7
7
7
7
7
dev.mce.<N>.conf.qos.dscp_48_55_prio: 6
6
6
6
6
6
6
6
dev.mce.<N>.conf.qos.dscp_40_47_prio: 5
5
5
5
5
5
5
5
dev.mce.<N>.conf.qos.dscp_32_39_prio: 4
4
4
4
4
4
4
4
dev.mce.<N>.conf.qos.dscp_24_31_prio: 3
3
3
3
3
3
3
3
dev.mce.<N>.conf.qos.dscp_16_23_prio: 2
2
2
2
2
2
2
2
dev.mce.<N>.conf.qos.dscp_8_15_prio: 1
1
1
1
1
1
1
1
dev.mce.<N>.conf.qos.dscp_0_7_prio: 0
0
0
0
0
0
0
0
Example:
# sysctl dev.mce.0
.conf.qos.dscp_0_7_prio=1
,1
,1
,1
,1
,1
,1
,1
dev.mce.0
.conf.qos.dscp_0_7_prio: 0
0
0
0
0
0
0
0
-> 1
1
1
1
1
1
1
1
This feature allows users to rate limit a specific TC. Rate limit defines a maximum bandwidth allowed for a TC. Please note that 10% deviation from the requested values is considered acceptable.
Note that instead of setting the maximum rate for a single priority, you should pass the maximum rates for all relevant priorities as a single input.
Notes:
This configuration is permanent and will not be set to default unless manually changed.
Rate is specified in kilobits, where kilo=1000.
Rate must be divisible by 100,000, meaning that values must be in 100Mbs units.
Examples for valid values:
200000 - 200Mbs
1000000 - 1Gbs
3400000 - 3.4Gbs
0 value = unlimited rate
Example:
To “rate limit” TC 4 on device mce1 to 2.4Gbits:
# sysctl dev.mce.0
.conf.qos.tc_max_rate=0
,0
,0
,0
,2400000
,0
,0
,0
dev.mce.0
.conf.qos.tc_max_rate: 0
0
0
0
0
0
0
0
-> 0
0
0
0
2400000
0
0
0
To be able to fully utilize this feature, make sure Priority Flow Control (PFC) feature is enabled.
Enhanced Transmission Selection standard (ETS) exploits the time periods in which the offered load of a particular Traffic Class (TC) is less than its minimum allocated bandwidth by allowing the difference to be available to other traffic classes.
After servicing the strict priority TCs, the amount of bandwidth (BW) left on the wire may be split among other TCs according to a minimal guarantee policy.
If, for instance, TC0 is set to 80% guarantee and TC1 to 20% (the TCs sum must be 100), then the BW left after servicing all strict priority TCs will be split according to this ratio.
Since this is a minimal guarantee, there is no maximum enforcement. This means, in the same example, that if TC1 did not use its share of 20%, the reminder will be used by TC0.
Example:
sysctl dev.mce.0
.conf.qos.tc_rate_share=20
,10
,10
,10
,10
,10
,10
,20
In this example, Priority 7 and Priority 0 are guaranteed for 20% of the bandwidth, and all the rest are guaranteed for 10% of the bandwidth.
Hardware buffers configuration can be tuned for priority flow control (PFC).
Parameter |
Description |
dev.mce.X.conf.qos.buffers_size |
This parameter is used to set the buffer size. |
dev.mce.X.conf.qos.buffers_prio |
This parameter shows the mapping between priority to buffer. |
dev.mce.X.conf.qos.cable_length |
For more precise determination of the moment when xoff should be issued, users may specify the cable length in meters to calculate the signal propagation delay. |