image image image image image

On This Page

Packet Burst Handling

This feature allows packet burst handling, while avoiding packet drops that may occur when a large amount of packets is sent in a short period of time. For the feature’s registry keys, see section Performance Registry Keys.

By default, the feature is disabled, and the AsyncReceiveIndicate registry key is set to 0. To enable the feature, choose one of the following options:

  • To enable packet burst buffering using threaded DPC (recommended), set the AsyncReceiveIndicate registry key to 1.
  • To enable packet burst buffering using polling, set the AsyncReceiveIndicate to 2.

To control the number of reserved receive packets, set the RfdReservationFactor registry key:

Default

150

Recommended

10,000

Maximum

5,000,000

The memory consumption will increase in accordance with the "RfdReservationFactor" registry key value.

Dropless Mode

This feature helps avoid dropping packets when the driver is not posting receive descriptors fast enough to the device (e.g. in cases of high CPU utilization).

Enabling/Disabling the Feature

There are two ways to enable/disable this feature:

  • Send down an OID to the driver. The following is the information buffer format:

    Typedef struct _DROPLESS_MODE
    {
    UINT32 signature; UINT8	dropless_mode;
    } DROPLESS_MODE, *PDROPLESS_MODE;
    OID code0xFFA0C932
    Signature value(ULONG) 0x0C1EA2
    Dropless_mode value1- Enables the feature
    2- Disables the feature

    The driver sets a default timeout value of 5 milliseconds.

  • Add the "DelayDropTimeout" registry key, set the value to one of the following options, and reload the adapter:

    DelayDropTimeout“50" (recommended value to set the timeout to is 5 milliseconds)
    "0" to disable

The registry key should be added to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce- bfc1-08002be10318}\<IndexValue>
To find the IndexValue, refer to section Finding the Index Value of the Network Interface.

Status Query

The status of the feature can be queried by sending down the same OID code (0xFFA0C932). If enabled, the driver will fill up the information buffer in the following format

DROPLESS_MODE *answer = (DROPLESS_MODE *)InformationBuffer; 
answer->signature = MLX_OID_BUFFER_SIGNATURE;
answer->dropless_mode = 1;

The Dropless_mode value will be set to 0 if disabled.

Timeout Values and Timeout Notification

The feature’s timeout values are defined as follows:

Registry value units100usec
Default driver value50 (5 milliseconds)
Accepted values0 (disabled) to100 (10 milliseconds)


When the feature is enabled and a packet is received for an RQ with no receive WQEs, the packet processing is delayed, waiting for receive WQEs to be posted. The feature allows the flow control mechanism to take over, thus avoiding packet loss. During this period, the timer starts ticking, and if receive WQEs are not posted before the timer expires, the packet is dropped, and the feature is disabled. 

The driver notifies of the timer’s expiration by generating an event log with event ID 75 and the following message:

"Delay drop timer timed out for RQ Index [RqId]. Dropless mode feature is now disabled".

The feature can be re-enabled by sending down an OID call again with a non-zero timeout value. Every time the feature is enabled by the user, the driver logs an event with event ID 77 and the following message:
"Dropless mode entered. For more details, please refer to the user manual document"

Similarly, every time the feature is disabled by the user, the driver logs an event with event ID 78 and the following message:
"Dropless mode exited. For more details, please refer to the user manual document."

RDMA over Converged Ethernet (RoCE)

Remote Direct Memory Access (RDMA) is the remote memory management capability that allows server to server data movement directly between application memory without any CPU involvement. RDMA over Converged Ethernet (RoCE) is a mechanism to provide this efficient data transfer with very low latencies on lossless Ethernet networks. With advances in data center convergence over reliable Ethernet, ConnectX® EN with RoCE uses the proven and efficient RDMA transport to provide the platform for deploying RDMA technology in mainstream data center application at 10GigE and 40GigE link-speed. ConnectX® EN with its hardware offload support takes advantage of this efficient RDMA transport (InfiniBand) services over Ethernet to deliver ultra-low latency for performance-critical and transaction intensive applications such as financial, database, storage, and content delivery networks. RoCE encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type. While the use of GRH is optional within InfiniBand subnets, it is mandatory when using RoCE. Applications written over IB verbs should work seamlessly, but they require provisioning of GRH information when creating address vectors. The library and driver are modified to provide mapping from GID to MAC addresses required by the hardware.

IP Routable (RoCEv2)

RoCE has two addressing modes: MAC based GIDs, and IP address based GIDs. In RoCE IP based, if the IP address changes while the system is running, the GID for the port will automatically be updated with the new IP address, using either IPv4 or IPv6.

RoCE IP based allows RoCE traffic between Windows and Linux systems, which use IP based GIDs by default.

A straightforward extension of the RoCE protocol enables traffic to operate in layer 3 environments. This capability is obtained via a simple modification of the RoCE packet format. Instead of the GRH used in RoCE, routable RoCE packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP.

The proposed RoCEv2 packets use a well-known UDP destination port value that unequivocally distinguishes the datagram. Similar to other protocols that use UDP encapsulation, the UDP source port field is used to carry an opaque flow-identifier that allows network devices to implement packet forwarding optimizations (e.g. ECMP) while staying agnostic to the specifics of the protocol header format.

The UDP source port is calculated as follows: UDP.SrcPort = (SrcPort XOR DstPort) OR 0xC000 , where SrcPort and DstPort are the ports used to establish the connection. 
For example, in a Network Direct application, when connecting to a remote peer, the destination IP address and the destination port must be provided as they are used in the calculation above. The source port provision is optional.

Furthermore, since this change exclusively affects the packet format on the wire, and due to the fact that with RDMA semantics packets are generated and consumed below the AP applications can seamlessly operate over any form of RDMA service (including the routable version of RoCE as shown in the RoCE and RoCE v2 Frame Format Differences diagram), in a completely transparent way (Standard RDMA APIs are IP based already for all existing RDMA technologies).



The fabric must use the same protocol stack in order for nodes to communicate.

In earlier versions, the default value of RoCE mode was RoCE v1. As of WinOF-2 v1.30, the default value of RoCE mode will be RoCEv2.

Upgrading from earlier versions to version 1.30 or above will save the old default value (RoCEv1).

RoCE Configuration

In order to function reliably, RoCE requires a form of flow control. While it is possible to use global flow control, this is normally undesirable, for performance reasons.

The normal and optimal way to use RoCE is to use Priority Flow Control (PFC). To use PFC, it must be enabled on all endpoints and switches in the flow path.

In the following section we present instructions to configure PFC on Mellanox ConnectX™ cards. There are multiple configuration steps required, all of which may be performed via PowerShell. Therefore, although we present each step individually, you may ultimately choose to write a PowerShell script to do them all in one step. Note that administrator privileges are required for these steps.

The NIC is configured by default to enable RoCE. If the switch is not configured to enable ECN and/or PFC, this will cause performance degradation. Thus, it is recommended to enable ECN on the switch or disable the *NetworkDirect registry key.

For more information on how to enable ECN and PFC on the switch, refer to the https://support.mellanox.com/docs/DOC-2855 community page.

Configuring Windows Host

Since PFC is responsible for flow controlling at the granularity of traffic priority, it is necessary to assign different priorities to different types of network traffic.

As per RoCE configuration, all ND/NDK traffic is assigned to one or more chosen priorities, where PFC is enabled on those priorities.

Configuring Windows host requires configuring QoS. To configure QoS, please follow the procedure described in Configuring Quality of Service (QoS)

Global Pause (Flow Control)

To use Global Pause (Flow Control) mode, disable QoS and Priority:

PS $ Disable-NetQosFlowControl
PS $ Disable-NetAdapterQos <interface name>

To confirm flow control is enabled in adapter parameters:

Go to: Device manager --> Network adapters --> Mellanox ConnectX-4/ConnectX-5 Ethernet Adapter --> Properties -->Advanced tab


Configuring SwitchX® Based Switch System

To enable RoCE, the SwitchX should be configured as follows:

  • Ports facing the host should be configured as access ports, and either use global pause or Port Control Protocol (PCP) for priority flow control
  • Ports facing the network should be configured as trunk ports, and use Port Control Protocol (PCP) for priority flow control

For further information on how to configure SwitchX, please refer to SwitchX User Manual.

Configuring Arista Switch

  1. Set the ports that face the hosts as trunk.

    (config)# interface et10
    (config-if-Et10)# switchport mode trunk
  2. Set VID allowed on trunk port to match the host VID.

    (config-if-Et10)# switchport trunk allowed vlan 100
  3. Set the ports that face the network as trunk.

    (config)# interface et20
    (config-if-Et20)# switchport mode trunk
  4. Assign the relevant ports to LAG.

    (config)# interface et10
    (config-if-Et10)# dcbx mode ieee
    (config-if-Et10)# speed forced 40gfull
    (config-if-Et10)# channel-group 11 mode active
  5. Enable PFC on ports that face the network.

    (config)# interface et20
    (config-if-Et20)# load-interval 5
    (config-if-Et20)# speed forced 40gfull
    (config-if-Et20)# switchport trunk native vlan tag
    (config-if-Et20)# switchport trunk allowed vlan 11
    (config-if-Et20)# switchport mode trunk
    (config-if-Et20)# dcbx mode ieee
    (config-if-Et20)# priority-flow-control mode on
    (config-if-Et20)# priority-flow-control priority 3 no-drop
    
Using Global Pause (Flow Control)

To enable Global Pause on ports that face the hosts, perform the following:

(config)# interface et10
(config-if-Et10)# flowcontrol receive on
(config-if-Et10)# flowcontrol send on
Using Priority Flow Control (PFC)

To enable PFC on ports that face the hosts, perform the following:

(config)# interface et10
(config-if-Et10)# dcbx mode ieee
(config-if-Et10)# priority-flow-control mode on
(config-if-Et10)# priority-flow-control priority 3 no-drop

Configuring Router (PFC only)

The router uses L3's DSCP value to mark the egress traffic of L2 PCP. The required mapping, maps the three most significant bits of the DSCP into the PCP. This is the default behavior, and no additional configuration is required.

Copying Port Control Protocol (PCP) between Subnets

The captured PCP option from the Ethernet header of the incoming packet can be used to set the PCP bits on the outgoing Ethernet header.

Configuring the RoCE Mode

RoCE mode is configured per adapter or per driver. If RoCE mode key is set for the adapter, then it will be used. Otherwise, it will be configured by the per-driver key. The per-driver key is shared between all devices in the system.

The supported RoCE modes depend on the firmware installed. If the firmware does not support the needed mode, the fallback mode would be the maximum supported RoCE mode of the installed NIC.

RoCE is enabled by default. Configuring or disabling the RoCE mode can be done via the registry key.

To update it for a specific adapter using the registry key, set the roce_mode as follows:

  1. Find the registry key index value of the adapter according to section Finding the Index Value of the Network Interface.
  2. Set the roce_mode in the following path:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\<IndexValue>

To update it for all the devices using the registry key, set the roce_mode as follows:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\mlx5\Parameters\Roce

For changes to take effect, please restart the network adapter after changing this registry key.


Registry Key Parameters

The following are per-driver and will apply to all available adapters.

Parameters NameParameter typeDescriptionAllowed and Default Values

roce_mode

DWORD

Sets the RoCE mode. The following are the possible RoCE modes:

  • RoCE MAC Based
  • RoCE v2
  • No RoCE
  • RoCE MAC Based = 0
  • [Default] RoCE v2 = 2
  • No RoCE = 4


RoCEv2 Congestion Management (RCM)

Network Congestion occurs when the number of packets being transmitted through the network approaches the packet handling the capacity of the network. A congested network will suffer from throughput deterioration manifested by increasing time delays and high latency.

In lossy environments, this leads to a packet loss. In lossless environments, it leads to “victim flows” (streams of data which are affected by the congestion, caused by other data flows that pass through the same network).

The figure below demonstrates a victim flow scenario. In the absence of congestion control, flow X'Y suffers from reduced bandwidth due to flow F'G, which experiences congestion. To address this, Congestion Control methods and protocols were defined.

This chapter describes (in High-Level), RoCEv2 Congestion Management (RCM), and provides a guide on how to configure it in Windows environment.

RoCEv2 Congestion Management (RCM) provides the capability to avoid congestion hot spots and optimize the throughput of the fabric.

With RCM, congestion in the fabric is reported back to the “sources” of traffic. The sources, in turn, react by throttling down their injection rates, thus preventing the negative effects of fabric buffer saturation and increased queuing delays.

For signaling of congestion, RCM relies on the mechanism defined in RFC3168, also known as DCQCN.

The source node and destination node can be considered as a “closed-loop control” system. Starting from the trigger, when the destination node reflects the congestion alert to the source node, the source node reacts by decreasing, and later on increasing, the Tx rates according to the feedback provided. The source node keeps increasing the Tx rates until the system reaches a steady state of non-congested flow with traffic as high rate as possible.

The RoCEv2 Congestion Management feature is composed of the following points:

  • Congestion Point (CP) - detects congestion and marks packets using the DCQCN bits
  • Notification Point (NP) (receiving end node) - reacts to the DCQCN marked packets by sending congestion notification packets (CNPs)
  • Reaction Point (RP) (transmitting end node) - reduces the transmission rate according to the received CNPs

These components can be seen in the High-Level sequence diagram below:

For further details, please refer to the IBTA RoCeV2 Spec, Annex A-17.

Restrictions and Limitations


Restrictions and Limitations
General
  • In order for RCM to function properly, the elements in the communication path must support and be configured for RCM (nodes) and DCQCN marking (Switches, Routers).
  • ConnectX®-4 and ConnectX®-4 Lx support congestion control only with RoCEv2.
  • RCM does not remove/replace the need for flow control.
    In order for RoCEv2 to work properly, flow control must be configured.
    It is not recommended to configure RCM without PFC or global pauses.
Mellanox
  • Minimal firmware version - 2.30
  • Minimal driver version - 1.35
  • Mellanox switch support as of “Spectrum” based switch systems
  • RCM is supported only when using a physical adapter

RCM Configuration

RCM configuration to Mellanox adapter is done via mlx5cmd tool.

To view the current status of RCM on the adapter, run the following command:

mlx5Cmd.exe -Qosconfig -Dcqcn -Name <Network Adapter Name> -Get


Example of RCM being disabled:

PS C:\Users\admin\Desktop> mlx5Cmd.exe -Qosconfig -Dcqcn -Name "Ethernet" -Get
DCQCN RP attributes for adapter "EthernetcnRPEnablePrio0: 0
	DcqcnRPEnablePrio1: 0
	DcqcnRPEnablePrio2: 0
	DcqcnRPEnablePrio3: 0
	DcqcnRPEnablePrio4: 0
	DcqcnRPEnablePrio5: 0
	DcqcnRPEnablePrio6: 0
	DcqcnRPEnablePrio7: 0
	DcqcnClampTgtRate: 0
	DcqcnClampTgtRateAfterTimeInc: 1
	DcqcnRpgTimeReset: 100
	DcqcnRpgByteReset: 400
	DcqcnRpgThreshold: 5
	DcqcnRpgAiRate: 10
	DcqcnRpgHaiRate: 100
	DcqcnAlphaToRateShift: 11
	DcqcnRpgMinDecFac: 50
	DcqcnRpgMinRate: 1
	DcqcnRateToSetOnFirstCnp: 3000
	DcqcnDceTcpG: 32
	DcqcnDceTcpRtt: 4
	DcqcnRateReduceMonitorPeriod: 32
	DcqcnInitialAlphaValue: 0

DCQCN NP attributes for adapter "Ethernet":
	DcqcnNPEnablePrio0: 0
	DcqcnNPEnablePrio1: 0
	DcqcnNPEnablePrio2: 0
	DcqcnNPEnablePrio3: 0
	DcqcnNPEnablePrio4: 0
	DcqcnNPEnablePrio5: 0
	DcqcnNPEnablePrio6: 0
	DcqcnNPEnablePrio7: 0
	DcqcnCnpDscp: 0
	DcqcnCnp802pPrio: 7
	DcqcnCnpPrioMode: 1
The command was executed successfully


To enable/disable DCQCN on the adapter, run the following command:

mlx5Cmd.exe -Qosconfig -Dcqcn -Name <Network Adapter Name> -Enable/Disable


This can be used on all priorities or on a specific priority.

PS C:\Users\admin\Desktop> mlx5Cmd.exe -Qosconfig -Dcqcn -Name "Ethernet" -Enable

PS C:\Users\admin\Desktop> mlx5Cmd.exe -Qosconfig -Dcqcn -Name "Ethernet" -Get
DCQCN RP attributes for adapter "Ethernet":
	DcqcnRPEnablePrio0: 1
	DcqcnRPEnablePrio1: 1
	DcqcnRPEnablePrio2: 1
	DcqcnRPEnablePrio3: 1
	DcqcnRPEnablePrio4: 1
	DcqcnRPEnablePrio5: 1
	DcqcnRPEnablePrio6: 1
	DcqcnRPEnablePrio7: 1
	DcqcnClampTgtRate: 0
	DcqcnClampTgtRateAfterTimeInc: 1
	DcqcnRpgTimeReset: 100
	DcqcnRpgByteReset: 400
	DcqcnRpgThreshold: 5
	DcqcnRpgAiRate: 10
	DcqcnRpgHaiRate: 100
	DcqcnAlphaToRateShift: 11
	DcqcnRpgMinDecFac: 50
	DcqcnRpgMinRate: 1
	DcqcnRateToSetOnFirstCnp: 3000
	DcqcnDceTcpG: 32
	DcqcnDceTcpRtt: 4
	DcqcnRateReduceMonitorPeriod: 32
	DcqcnInitialAlphaValue: 0

DCQCN NP attributes for adapter "Ethernet":
	DcqcnNPEnablePrio0: 1
	DcqcnNPEnablePrio1: 1
	DcqcnNPEnablePrio2: 1
	DcqcnNPEnablePrio3: 1
	DcqcnNPEnablePrio4: 1
	DcqcnNPEnablePrio5: 1
	DcqcnNPEnablePrio6: 1
	DcqcnNPEnablePrio7: 1
	DcqcnCnpDscp: 0
	DcqcnCnp802pPrio: 7
	DcqcnCnpPrioMode: 1
The command was executed successfully


RCM Parameters

The table below lists the parameters that can be configured, their description and allowed values.

Parameter (Type)Allowed Values

DcqcnEnablePrio0 (BOOLEAN)

0/1

DcqcnEnablePrio1 (BOOLEAN)

0/1

DcqcnEnablePrio2 (BOOLEAN)

0/1

DcqcnEnablePrio3 (BOOLEAN)

0/1

DcqcnEnablePrio4 (BOOLEAN)

0/1

DcqcnEnablePrio5 (BOOLEAN)

0/1

DcqcnEnablePrio6 (BOOLEAN)

0/1

DcqcnEnablePrio7 (BOOLEAN)

0/1

DcqcnClampTgtRate (1 bit)

0/1

DcqcnClampTgtRateAfterTimeInc (1 bit)

0/1

DcqcnCnpDscp (6 bits)

0 - 63

DcqcnCnp802pPrio (3 bits)

0 - 7

DcqcnCnpPrioMode(1 bit)

0/1

DcqcnRpgTimeReset (uint32)

0 - 131071 [uSec]

DcqcnRpgByteReset (uint32)

0 - 32767 [64 bytes]

DcqcnRpgThreshold (uint32)

1 - 31

DcqcnRpgAiRate (uint32)

1 - line rate [Mbit/sec]

DcqcnRpgHaiRate (uint32)

1 - line rate [Mbit/sec]

DcqcnAlphaToRateShift (uint32)

0 - 11

DcqcnRpgMinDecFac (uint32)

0 - 100

DcqcnRpgMinRate (uint32)

0 - line rate

DcqcnRateToSetOnFirstCnp (uint32)

0 - line rate [Mbit/sec]

DcqcnDceTcpG (uint32)

0 - 1023 (fixed point fraction of 1024)

DcqcnDceTcpRtt (uint32)

0 - 131071 [uSec]

DcqcnRateReduceMonitorPeriod (uint32)

0 - UINT32-1 [uSec]

DcqcnInitialAlphaValue (uint32)

0 - 1023 (fixed point fraction of 1024)

An attempt to set a greater value than the parameter’s maximum "line rate" value (if exists), will fail. The maximum "line rate" value will be set instead.

RCM Default Parameters

Every parameter has a default value assigned to it. The default value was set for optimal congestion control by Mellanox. In order to view the default parameters on the adapter, run the following command:

mlx5Cmd.exe -Qosconfig -Dcqcn -Name <Network Adapter Name> -Defaults

RCM with Untagged Traffic

Congestion control for untagged traffic is configured with the port default priority that is used for untagged frames.The port default priority configuration is done via mlx5Cmd tool.

Parameter (Type)Allowed and Default ValuesNote

DefaultUntaggedPriority

0 - 7

Default: 0

As of WinOF-2 v2.10, this key can be changed dynamically. In any case of an illegal input, the value will fall back to the default value and not to the last value used.

To view the current default priority on the adapter, run the following command:

mlx5Cmd.exe -QoSConfig -DefaultUntaggedPriority -Name -Get



To set the default priority to a specific priority on the adapter, run the following command:

mlx5Cmd.exe -QoSConfig -DefaultUntaggedPriority -Name -Set


Congestion Control Behavior when Changing the Parameters

Changing the values of the parameters may strongly affect the congestion control efficiency.
Please make sure you fully understand the parameter usage, value and expected results before changing its default value.

CNP Priority
ParameterDescription
fCnpDscp

This parameter changes the priority value on IP level that can be set for CNPs.

DcqcnCnpPrioModeIf this parameter is set to '0', then use DcqcnCnp802pPrio as the priority value (802.1p) on the Ethernet header of generated CNPs. Otherwise, the priority value of CNPs will be taken from received packets that were marked as DCQCN packets.
DcqcnCnp802pPrioThis parameter changes the priority value (802.1p) on the Ethernet header of generated CNPs. Set DcqcnCnpPrioMode to '0' in order to use this priority value.
alpha -”α” = Rate Reduction Factor

The device maintains an “alpha” value per QP. This alpha value estimates the current congestion severity in the fabric.

ParameterDescription
DcqcnInitialAlphaValue

This parameter sets the initial value of alpha that should be used when receiving the first CNP for a flow (expressed in a fixed point fraction of 2^10).

The value of alpha is updated once every DcqcnDceTcpRtt, regardless of the reception of a CNP. If a CNP is received during this time frame, alpha value will increase. If no CNP reception happens, alpha value will decrease.

DcqcnDceTcpG/DcqcnDceTcpRtt

These two parameters maintain alpha.

  • If a CNP is received on the RP - alpha is increased: (1-DcqcnDecTcpG)*α + DcqcnDecTcpG
  • If no CNP is received for a duration of DcqcnDceTcpRtt microseconds, alpha is decreased: (1- DcqcnDecTcpG) * alpha
“RP” Decrease

Changing the DcqcnRateToSetOnFirstCnp parameter determines the Current Rate (CR) that will be set once the first CNP is received.

The rate is updated only once every DcqcnRateReduceMonitorPeriod microseconds (multiple CNPs received during this time frame will not affect the rate) by using the following two formulas:

  • Cr1(new) = (1- (α /(2^DcqcnAlphaToRateShift)) )*Cr(old)
  • Cr2(new) = Cr(old)/DcqcnRpgMinDecFac

The maximal reduced rate will be chosen from these two formulas.

The target rate will be updated to the previous current rate according to the behavior stated in section Increase on the “RP”.

ParameterDescription
DcqcnRpgMinDecFac

This parameter defines the maximal ratio of decrease in a single step (Denominator: !zero. Please see formula above).

DcqcnAlphaToRateShiftThis parameter defines the decrease rate for a given alpha (see formula above)
DcqcnRpgMinRate

In addition to the DcqcnRpgMinDecFac , the DcqcnRpgMinRate parameter defines the minimal rate value for the entire single flow.

Note: Setting it to a line rate will disable Congestion Control.

“RP” Increase

RP increases its sending rate using a timer and a byte counter. The byte counter increases rate for every DcqcnRpgByteResetx64 bytes (mark it as B), while the timer increases rate every DcqcnRpgTimeReset time units (mark it as T). Every successful increase due to bytes transmitted/time passing is counted in a variable called rpByteStage and rpTimeStage (respectively).

The DcqcnRpgThreshold parameter defines the number of successive increase iteration (mark it as Th). The increase flow is divided into 3 types of phases, which are actually states in the “RP Rate Control State Machine”. The transition between the steps is decided according to DcqcnRpgThreshold parameter.

  • Fast Recovery
    If MAX (rpByteStage, rpTimeStage) < Th.
    No change to Target Rate (Tr)
  • Additive Increase
    If MAX (rpByteStage, rpTimeStage) > Th. && MIN (rpByteStage, rpTimeStage) < Th.
    DcqcnRpgAiRate value is used to increase Tr
  • Hyper Additive Increase
    If MAX (rpByteStage, rpTimeStage) > Th. && MIN (rpByteStage, rpTimeStage) > Th.
    DcqcnRpgHaiRate value is used to increase Tr

For further details, please refer to 802.1Qau standard, sections 32.11-32.15.

ParameterDescription
DcqcnClampTgtRateAfterTimeInc

When receiving a CNP, the target rate should be updated if the transmission rate was increased due to the timer, and not only due to the byte counter.

DcqcnClampTgtRate

If set, whenever a CNP is processed, the target rate is updated to be the current rate.

Mellanox Commands and Examples

For a full description of Congestion Control commands please refer to section MlxCmd Utilities.

Set a value for one or more parameters:
Command

Mlx5Cmd.exe -Qosconfig -Dcqcn -Name <Network Adapter Name> -Set -Arg1 <value> -Arg2 <value>

Example

PS C:\Users\admin\Desktop> Mlx5Cmd.exe -Qosconfig -Dcqcn -Name "Ethernet" -Set -DcqcnClampTgtRate 1 -DcqcnCnpDscp 3

Enable/Disable DCQCN for a specific priority:
Command

Mlx5Cmd.exe -Qosconfig -Dcqcn -Name <Network Adapter Name> -Enable <prio>

Example

PS C:\Users\admin\Desktop> Mlx5Cmd.exe -Qosconfig -Dcqcn -Name "Ethernet" -Enable/Disable 3

Enable/Disable DCQCN for all priorities:
Command

Mlx5Cmd.exe -Qosconfig -Dcqcn -Name <Network Adapter Name> -Enable

Example

PS C:\Users\admin\Desktop> Mlx5Cmd.exe -Qosconfig -Dcqcn -Name "Ethernet" -Enable/Disable

Set port default priority for a specific priority:
Command

Mlx5Cmd.exe -Qosconfig -DefaultUntaggedPriority -Name <Network Adapter Name> -Set <prio>

Example

PS C:\Users\admin\Desktop> Mlx5Cmd.exe -Qosconfig -DefaultUntaggedPriority -Name "Ethernet" -Set 3

Restore the default settings of DCQCN the are defined by Mellanox:
Command

Mlx5Cmd.exe -Dcqcn -Name <Network Adapter Name> -Restore

Example

PS C:\Users\admin\Desktop> Mlx5Cmd.exe -Dcqcn -Name "Ethernet" -Restore

For information on the RCM counters, please refer to section Mellanox WinOF-2 Congestion Control.

Teaming and VLAN

Windows Server 2012 and above supports Teaming as part of the operating system. Please refer to Microsoft guide “NIC Teaming in Windows Server 2012”. 

Note that the Microsoft teaming mechanism is only available on Windows Server distributions.

Configuring a Network to Work with VLAN in Windows Server 2012 and Above

In this procedure you DO NOT create a VLAN, rather use an existing VLAN ID.

To configure a port to work with VLAN using the Device Manager:

  1. Open the Device Manager.
  2. Go to the Network adapters.
  3. Go to the properties of Mellanox ConnectX®-4 Ethernet Adapter card.
  4. Go to the Advanced tab.
  5. Choose the VLAN ID in the Property window.
  6. Set its value in the Value window.

Configuring Quality of Service (QoS)

QoS Configuration

Prior to configuring Quality of Service, you must install Data Center Bridging using one of the following methods:

Disabling Flow Control Configuration

Device manager->Network adapters->Mellanox ConnectX-4/ConnectX-5 Ethernet Adapter->Properties->Advanced tab

Installing the Data Center Bridging using the Server Manager
  1. Open the 'Server Manager'.
  2. Select 'Add Roles and Features'.
  3. Click Next.
  4. Select 'Features' on the left panel.
  5. Check the 'Data Center Bridging' checkbox.
  6. Click 'Install'.
Installing the Data Center Bridging using PowerShell

Enable Data Center Bridging (DCB).

PS $ Install-WindowsFeature Data-Center-Bridging
Configuring QoS on the Host

The procedure below is not saved after you reboot your system. Hence, we recommend you create a script using the steps below and run it on the startup of the local machine.
Please see the procedure below on how to add the script to the local machine startup scripts.

  1. Change the Windows PowerShell execution policy.

    PS $ Set-ExecutionPolicy AllSigned
  2. Remove the entire previous QoS configuration.

    PS $ Remove-NetQosTrafficClass
    PS $ Remove-NetQosPolicy -Confirm:$False
  3. Create a Quality of Service (QoS) policy and tag each type of traffic with the relevant priority.
    In this example, TCP/UDP use priority 1, SMB over TCP use priority 3.

    PS $ New-NetQosPolicy "DEFAULT" -store Activestore -Default -PriorityValue8021Action 3
    PS $ New-NetQosPolicy "TCP" -store Activestore -IPProtocolMatchCondition TCP -PriorityValue8021Action 1
    PS $ New-NetQosPolicy "UDP" -store Activestore -IPProtocolMatchCondition UDP -PriorityValue8021Action 1
    New-NetQosPolicy “SMB” -SMB -PriorityValue8021Action 3
  4. Create a QoS policy for SMB over SMB Direct traffic on Network Direct port 445.

    PS $ New-NetQosPolicy "SMBDirect" -store Activestore -NetDirectPortMatchCondition 445 -PriorityValue8021Action 3
  5. [Optional] If VLANs are used, mark the egress traffic with the relevant VlanID.
    The NIC is referred as "Ethernet 4” in the examples below.

    PS $ Set-NetAdapterAdvancedProperty -Name "Ethernet 4" -RegistryKeyword "VlanID" -RegistryValue "55"
  6. [Optional] Configure the IP address for the NIC.
    If DHCP is used, the IP address will be assigned automatically.

    PS $ Set-NetIPInterface -InterfaceAlias “Ethernet 4” -DHCP Disabled
    PS $ Remove-NetIPAddress -InterfaceAlias “Ethernet 4” -AddressFamily IPv4 -Confirm:$false
    PS $ New-NetIPAddress -InterfaceAlias “Ethernet 4” -IPAddress 192.168.1.10 -PrefixLength 24 -Type Unicast
  7. [Optional] Set the DNS server (assuming its IP address is 192.168.1.2).

    PS $ Set-DnsClientServerAddress -InterfaceAlias “Ethernet 4” -ServerAddresses 192.168.1.2

    After establishing the priorities of ND/NDK traffic, the priorities must have PFC enabled on them.

  8. Disable Priority Flow Control (PFC) for all other priorities except for 3.

    PS $ Disable-NetQosFlowControl 0,1,2,4,5,6,7
  9. Enable QoS on the relevant interface.

    PS $ Enable-NetAdapterQos -InterfaceAlias "Ethernet 4"
  10. Enable PFC on priority 3.

    PS $ Enable-NetQosFlowControl -Priority 3
Adding the Script to the Local Machine Startup Scripts
  1. From the PowerShell invoke.

    gpedit.msc
  2. In the pop-up window, under the 'Computer Configuration' section, perform the following:
    1. Select Windows Settings.
    2. Select Scripts (Startup/Shutdown).
    3. Double click Startup to open the Startup Properties.
    4. Move to “PowerShell Scripts” tab.
    5. Click Add.
      The script should include only the following commands:

      PS $ Remove-NetQosTrafficClass
      PS $ Remove-NetQosPolicy -Confirm:$False
      PS $ set-NetQosDcbxSetting -Willing 0
      PS $ New-NetQosPolicy "SMB" -Policystore Activestore -NetDirectPortMatchCondition 445 -PriorityValue8021Action 3
      PS $ New-NetQosPolicy "DEFAULT" -Policystore Activestore -Default -PriorityValue8021Action 3
      PS $ New-NetQosPolicy "TCP" -Policystore Activestore -IPProtocolMatchCondition TCP -PriorityValue8021Action 1
      PS $ New-NetQosPolicy "UDP" -Policystore Activestore -IPProtocolMatchCondition UDP -PriorityValue8021Action 1
      PS $ Disable-NetQosFlowControl 0,1,2,4,5,6,7
      PS $ Enable-NetAdapterQos -InterfaceAlias "port1"
      PS $ Enable-NetAdapterQos -InterfaceAlias "port2"
      PS $ Enable-NetQosFlowControl -Priority 3
      PS $ New-NetQosTrafficClass -name "SMB class" -priority 3 -bandwidthPercentage 50 -Algorithm ETS
    6. Browse for the script's location.

    7. Click OK

    8. To confirm the settings applied after boot run:

      PS $ get-netqospolicy -policystore activestore

Enhanced Transmission Selection (ETS)

Enhanced Transmission Selection (ETS) provides a common management framework for assignment of bandwidth to frame priorities as described in the IEEE 802.1Qaz specification.

For further details on configuring ETS on Windows™ Server, please refer to: http://technet.microsoft.com/en-us/library/hh967440.aspx

Differentiated Services Code Point (DSCP)

DSCP is a mechanism used for classifying network traffic on IP networks. It uses the 6-bit Differentiated Services Field (DS or DSCP field) in the IP header for packet classification purposes. Using Layer 3 classification enables you to maintain the same classification semantics beyond local network, across routers.

Every transmitted packet holds the information allowing network devices to map the packet to the appropriate 802.1Qbb CoS. For DSCP based PFC or ETS, the packet is marked with a DSCP value in the Differentiated Services (DS) field of the IP header. In case DSCP is enabled, QoS traffic counters are incremented based on the DSCP mapping described in section Receive Trust State.

System Requirements
Operating Systems: Windows Server 2012 and onward
Firmware version: 12/14/16.18.1000 and higher

Setting the DSCP in the IP Header


Marking the DSCP value in the IP header is done differently for IP packets constructed by the NIC (e.g. RDMA traffic) and for packets constructed by the IP stack (e.g. TCP traffic).

  • For IP packets generated by the IP stack, the DSCP value is provided by the IP stack. The NIC does not validate the match between DSCP and Class of Service (CoS) values. CoS and DSCP values are expected to be set through standard tools, such as PowerShell command New-NetQosPolicy using PriorityValue8021Action and DSCPAction flags respectively.
  • For IP packets generated by the NIC (RDMA), the DSCP value is generated according to the CoS value programmed for the interface. CoS value is set through standard tools, such as PowerShell command New-NetQosPolicy using PriorityValue8021Action flag. The NIC uses a mapping table between the CoS value and the DSCP value configured through the RroceDscpMarkPriorityFlow- Control[0-7] Registry keys

Configuring Quality of Service for TCP and RDMA Traffic

  1. Verify that DCB is installed and enabled (is not installed by default).

    PS $ Install-WindowsFeature Data-Center-Bridging
  2. Import the PowerShell modules that are required to configure DCB.

    PS $ import-module NetQos
    PS $ import-module DcbQos
    PS $ import-module NetAdapter
  3. Enable Network Adapter QoS.

    PS $ Set-NetAdapterQos -Name "CX4_P1" -Enabled 1
  4. Enable Priority Flow Control (PFC) on the specific priority 3,5.

    PS $ Enable-NetQosFlowControl 3,5

Configuring DSCP to Control PFC for TCP Traffic

Create a QoS policy to tag All TCP/UDP traffic with CoS value 3 and DSCP value 9.

PS $ New-NetQosPolicy "DEFAULT" -Default -PriorityValue8021Action 3 -DSCPAction 9

DSCP can also be configured per protocol.

PS $ New-NetQosPolicy "TCP" -IPProtocolMatchCondition TCP -PriorityValue8021Action 3 -DSCPAction 16
PS $ New-NetQosPolicy "UDP" -IPProtocolMatchCondition UDP -PriorityValue8021Action 3 -DSCPAction 32

Configuring DSCP to Control ETS for TCP Traffic

  • Create a QoS policy to tag All TCP/UDP traffic with CoS value 0 and DSCP value 8.

    PS $ New-NetQosPolicy "DEFAULT" -Default -PriorityValue8021Action 0 -DSCPAction 8 -PolicyStore activestore
  • Configure DSCP with value 16 for TCP/IP connections with a range of ports.

    PS $ New-NetQosPolicy "TCP1" -DSCPAction 16 -IPDstPortStartMatchCondition 31000 -IPDstPortEndMatchCondition 31999 -IPProtocol TCP -PriorityValue8021Action 0 -PolicyStore activestore
  • Configure DSCP with value 24 for TCP/IP connections with another range of ports.

    PS $ New-NetQosPolicy "TCP2" -DSCPAction 24 -IPDstPortStartMatchCondition 21000 -IPDstPortEndMatchCondition 31999 -IPProtocol TCP -PriorityValue8021Action 0 -PolicyStore activestore
  • Configure two Traffic Classes with bandwidths of 16% and 80%.

    PS $ New-NetQosTrafficClass -name "TCP1" -priority 3 -bandwidthPercentage 16 -Algorithm ETS
    PS $ New-NetQosTrafficClass -name "TCP2" -priority 5 -bandwidthPercentage 80 -Algorithm ETS

Configuring DSCP to Control PFC for RDMA Traffic

Create a QoS policy to tag the ND traffic for port 10000 with CoS value 3.

PS $ New-NetQosPolicy "ND10000" -NetDirectPortMatchCondition 10000 - PriorityValue8021Action 3



Related Commands
Get-NetAdapterQos
Gets the QoS properties of the network adapter
Get-NetQosPolicy
Retrieves network QoS policies
Get-NetQosFlowControl
Gets QoS status per priority

Receive Trust State

Received packets Quality of Service classification can be done according to the DSCP value, instead of PCP, using the RxTrustedState registry key. The mapping between wire DSCP values to the OS priority (PCP) is static, as follows:

DSCP ValuePriority

0-7

0

8-15

1

16-23

2

24-31

3

32-39

4

40-47

5

48-55

6

56-63

7

Mellanox Commands and Examples

When using this feature, it is expected that the transmit DSCP to Priority mapping (the PriorityToDscpMappingTable _* registry key) will match the above table to create a consistent mapping on both directions.

Registry Settings

The following attributes must be set manually and will be added to the miniport registry.

For more information on configuring registry keys, see section Configuring the Driver Registry Keys.

Registry KeyDescription

TxUntagPriorityTag

If 0x1, do not add 802.1Q tag to transmitted packets which are assigned 802.1p priority, but are not assigned a non-zero VLAN ID (i.e. priority-tagged).

Default 0x0, for DSCP based PFC set to 0x1.

Note: These packets will count on the original priority, even if the registry is on.

RxUntaggedMapToLossless

If 0x1, all untagged traffic is mapped to the lossless receive queue.

Default 0x0, for DSCP based PFC set to 0x1.

Note: As of WinOF-2 v2.10, this key can be changed dynamically. In any case of an illegal input, the value will fall back to the default value and not to the last value used.

PriorityToDscpMappingTable_<ID>

A value to mark DSCP for RoCE packets assigned to CoS=ID, when priority flow control is enabled. The valid values range is from 0 to 63,

Default is ID value, e.g. PriorityToDscpMappingTable_3 is 3.

ID values range from 0 to 7.

DscpBasedEtsEnabled

If 0x1 - all DSCP based ETS feature is enabled, if 0x0 - disabled. Default 0x0.

DscpBasedpfcEnabled

If set, the DSCP value on the ROCE packet will be based according to the priority set.

DscpForGlobalFlowControl

Default DSCP value for flow control. Default 0x1a.

RxTrustedState


Default using host priority (PCP) is 1

Default using DSCP value is 2

For changes to take effect, restart the network adapter after changing any of the above registry keys.

Default Settings

When DSCP configuration registry keys are missing in the miniport registry, the following defaults are assigned:

Registry KeyDefault Value

TxUntagPriorityTag

0

RxUntaggedMapToLossles

0

PriorityToDscpMappingTable_0

0

PriorityToDscpMappingTable_1

1

PriorityToDscpMappingTable_2

2

PriorityToDscpMappingTable_3

3

PriorityToDscpMappingTable_4

4

PriorityToDscpMappingTable_5

5

PriorityToDscpMappingTable_6

6

PriorityToDscpMappingTable_7

7

DscpBasedEtsEnabled

eth:0

DscpForGlobalFlowControl

26

Receive Segment Coalescing (RSC)

RSC allows reduction of CPU utilization when dealing with large TCP message size. It allows the driver to indicate to the Operating System once, per-message and not per-MTU that Packet Offload can be disabled for IPv4 or IPv6 traffic in the Advanced tab of the driver proprieties.

RSC provides diagnostic counters documented at : Receive Segment Coalescing (RSC)

Wake-on-LAN (WoL)

Wake-on-LAN is a technology that allows a network admin to remotely power on a system or to wake it up from sleep mode by a network message. WoL is enabled by default.

To check whether or not WoL is supported by adapter card:

  1. Check if mlxconfig recognizes the feature.

    mlxconfig -d /dev/mst/mt4117_pciconf0 show_confs
  2. Check if the firmware used in your system supports WoL.

    mlxconfig -d /dev/mst/mt4117_pciconf0 query

Data Center Bridging Exchange (DCBX)

Data Center Bridging Exchange (DCBX) protocol is an LLDP based protocol which manages and negotiates host and switch configuration. The WinOF-2 driver supports the following:

  • PFC - Priority Flow Control
  • ETS - Enhanced Transmission Selection
  • Application priority

The protocol is widely used to assure lossless path when running multiple protocols at the same time. DCBX is functional as part of configuring QoS mentioned in section Configuring Quality of Service (QoS). Users should make sure the willing bit on the host is enabled, using PowerShell if needed:

set-NetQosDcbxSetting -Willing 1


This is required to allow negotiating and accepting peer configurations. Willing bit is set to 1 by default by the operating system. The new settings can be queried by calling the following command in PowerShell.

Get-NetAdapterQos

The below configuration was received from the switch in the below example.


The output would look like the following:

RCM counters

In a scenario where both peers are set to Willing, the adapter with a lower MAC address takes the settings of the peer. 

DCBX is disabled in the driver by default and in the some firmware versions as well.

To use DCBX:

  1. Query and enable DCBX in the firmware.
    1. Download the WinMFT package from the following link: http://www.mellanox.com/page/management_tools
    2. Install WinMFT package and go to \Program Files\Mellanox\WinMFT
    3. Get the list of devices, run "mst status".
    4. Verify is the DCBX is enabled or disabled, run "mlxconfig.exe -d mt4117_pciconf0 query" .
    5. If disabled, run the following commands for a dual-port card.

      mlxconfig -d mt4117_pciconf0 set LLDP_NB_RX_MODE_P1=2
      mlxconfig -d mt4117_pciconf0 set LLDP_NB_TX_MODE_P1=2
      mlxconfig -d mt4117_pciconf0 set LLDP_NB_DCBX_P1=1
      mlxconfig -d mt4117_pciconf0 set LLDP_NB_RX_MODE_P2=2
      mlxconfig -d mt4117_pciconf0 set LLDP_NB_TX_MODE_P2=2
      mlxconfig -d mt4117_pciconf0 set LLDP_NB_DCBX_P2=1
  2. Add the "DcbxMode" registry key, set the value to "2" and reload the adapter. 
    The registry key should be added to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\<IndexValue>
    T
    o find the IndexValue, refer to section Finding the Index Value of the Network Interface.

Receive Path Activity Monitoring

In the event where the device or the Operating System unexpectedly becomes unresponsive for a long period of time, the Flow Control mechanism may send pause frames, which will cause congestion spreading to the entire network.

To prevent this scenario, the device monitors its status continuously, attempting to detect when the receive pipeline is stalled. When the device detects a stall for a period longer than a pre-configured timeout, the Flow Control mechanisms (Global Pause and PFC) are automatically disabled.

If the PFC is in use, and one or more priorities are stalled, the PFC will be disabled on all priorities. When the device detects that the stall has ceased, the flow control mechanism will resume with its previously configured behavior.

Two registry parameters control the mechanism’s behavior: the DeviceRxStallTime­out key controls the time threshold for disabling the flow control, and the DeviceRxStallWatermark key controls a diagnostics counter that can be used for early detection of stalled receive. WinOF-2 provides two counters to monitor the activity of this feature: "Minor Stall Watermark Reached" and "Criti­cal Stall Watermark Reached". For more information, see section Ethernet Registry Keys.

Head of Queue Lifetime Limit

This feature enables the system to drop the packets that have been awaiting transmission for a long period of time, preventing the system from hanging. The implementation of the feature complies with the Head of Queue Lifetime Limit (HLL) definition in the InfiniBand™ Architecture Specification (see Related Documents).

The HLL has three registry keys for configuration:

TCHeadOfQueueLifeTimeLimit, TCStallCount and TCHeadOfQueueLifeTimeLimitEnable (see section Ethernet Registry Keys).

VXLAN

VXLAN technology provides scalability and security challenges solutions. It requires extension of the traditional stateless offloads to avoid performance drop. ConnectX®-4 and onwards adapter cards offer stateless offloads for a VXLAN packet, similar to the ones offered to non-encapsulated packets. VXLAN protocol encapsulates its packets using outer UDP header.

ConnectX®-4 and onwards support offloading of tasks related to VXLAN packet processing, such as TCP header checksum and VMQ (i.e.: directing incoming VXLAN packets to the appropriate VM queue).

Due to hardware limitation, on a dual-port adapter, VXLAN offload service cannot be provided simultaneously for both Ethernet ports if they are not using the same UDP port for VXLAN tunneling.

VXLAN can be configured using the standardized * VxlanUDPPortNumber and *EncapsulatedPacketTaskOffloadVxlan keys.

Threaded DPC

A threaded DPC is a DPC that the system executes at IRQL = PASSIVE_LEVEL. An ordinary DPC preempts the execution of all threads, and cannot be preempted by a thread or by another DPC. If the system has a large number of ordinary DPCs queued, or if one of those DPCs runs for a long period time, every thread will remain paused for an arbitrarily long period of time. Thus, each ordinary DPC increases the system latency, which can damage the performance of time-sensitive applications, such as audio or video playback.

Conversely, a threaded DPC can be preempted by an ordinary DPC, but not by other threads. Therefore, the user should use threaded DPCs rather than ordinary DPCs, unless a particular DPC must not be preempted, even by another DPC.

For more information, please refer to Introduction to Threaded DPCs.