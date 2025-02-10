Remote Direct Memory Access (RDMA) is the remote memory management capability that allows server to server data movement directly between application memory without any CPU involvement. RDMA over Converged Ethernet (RoCE) is a mechanism to provide this efficient data transfer with very low latencies on lossless Ethernet networks. With advances in data center convergence over reliable Ethernet, ConnectX® EN with RoCE uses the proven and efficient RDMA transport to provide the platform for deploying RDMA technology in mainstream data center application at 10GigE and 40GigE link-speed. ConnectX® EN with its hardware offload support takes advantage of this efficient RDMA transport (InfiniBand) services over Ethernet to deliver ultra-low latency for performance-critical and transaction intensive applications such as financial, database, storage, and content delivery networks. RoCE encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type. While the use of GRH is optional within InfiniBand subnets, it is mandatory when using RoCE. Applications written over IB verbs should work seamlessly, but they require provisioning of GRH information when creating address vectors. The library and driver are modified to provide mapping from GID to MAC addresses required by the hardware.

RoCE has two addressing modes: MAC based GIDs, and IP address based GIDs. In RoCE IP based, if the IP address changes while the system is running, the GID for the port will automatically be updated with the new IP address, using either IPv4 or IPv6.

RoCE IP based allows RoCE traffic between Windows and Linux systems, which use IP based GIDs by default.

A straightforward extension of the RoCE protocol enables traffic to operate in layer 3 environments. This capability is obtained via a simple modification of the RoCE packet format. Instead of the GRH used in RoCE, routable RoCE packets carry an IP header which allows traversal of IP L3 Routers and a UDP header that serves as a stateless encapsulation layer for the RDMA Transport Protocol Packets over IP.

The proposed RoCEv2 packets use a well-known UDP destination port value that unequivocally distinguishes the datagram. Similar to other protocols that use UDP encapsulation, the UDP source port field is used to carry an opaque flow-identifier that allows network devices to implement packet forwarding optimizations (e.g. ECMP) while staying agnostic to the specifics of the protocol header format.

The UDP source port is calculated as follows: UDP.SrcPort = (SrcPort XOR DstPort) OR 0xC000 , where SrcPort and DstPort are the ports used to establish the connection.

For example, in a Network Direct application, when connecting to a remote peer, the destination IP address and the destination port must be provided as they are used in the calculation above. The source port provision is optional.

Furthermore, since this change exclusively affects the packet format on the wire, and due to the fact that with RDMA semantics packets are generated and consumed below the AP applications can seamlessly operate over any form of RDMA service (including the routable version of RoCE as shown in the RoCE and RoCE v2 Frame Format Differences diagram), in a completely transparent way (Standard RDMA APIs are IP based already for all existing RDMA technologies).

Note The fabric must use the same protocol stack in order for nodes to communicate.

Note In earlier versions, the default value of RoCE mode was RoCE v1. As of WinOF-2 v1.30, the default value of RoCE mode will be RoCEv2. Upgrading from earlier versions to version 1.30 or above will save the old default value (RoCEv1).





In order to function reliably, RoCE requires a form of flow control. While it is possible to use global flow control, this is normally undesirable, for performance reasons.

The normal and optimal way to use RoCE is to use Priority Flow Control (PFC). To use PFC, it must be enabled on all endpoints and switches in the flow path.

In the following section we present instructions to configure PFC on NVIDIA® ConnectX® family cards. There are multiple configuration steps required, all of which may be performed via PowerShell. Therefore, although we present each step individually, you may ultimately choose to write a PowerShell script to do them all in one step. Note that administrator privileges are required for these steps.

Note The NIC is configured by default to enable RoCE. If the switch is not configured to enable ECN and/or PFC, this will cause performance degradation. Thus, it is recommended to enable ECN on the switch or disable the *NetworkDirect registry key. For more information on how to enable ECN and PFC on the switch, refer to the https://enterprise-support.nvidia.com/docs/DOC-2855 community page.

Note Since PFC is responsible for flow controlling at the granularity of traffic priority, it is necessary to assign different priorities to different types of network traffic. As per RoCE configuration, all ND/NDK traffic is assigned to one or more chosen priorities, where PFC is enabled on those priorities.

Configuring Windows host requires configuring QoS. To configure QoS, please follow the procedure described in Configuring Quality of Service (QoS)

To use Global Pause (Flow Control) mode, disable QoS and Priority:

Copy Copied! PS $ Disable-NetQosFlowControl PS $ Disable-NetAdapterQos < interface name>

Go to: Device manager --> Network adapters --> Mellanox ConnectX-4/ConnectX-5 Ethernet Adapter --> Properties -->Advanced tab

Set the ports that face the hosts as trunk. Copy Copied! (config)# interface et10 (config- if -Et10)# switchport mode trunk Set VID allowed on trunk port to match the host VID. Copy Copied! (config- if -Et10)# switchport trunk allowed vlan 100 Set the ports that face the network as trunk. Copy Copied! (config)# interface et20 (config- if -Et20)# switchport mode trunk Assign the relevant ports to LAG. Copy Copied! (config)# interface et10 (config- if -Et10)# dcbx mode ieee (config- if -Et10)# speed forced 40gfull (config- if -Et10)# channel-group 11 mode active Enable PFC on ports that face the network. Copy Copied! (config)# interface et20 (config- if -Et20)# load-interval 5 (config- if -Et20)# speed forced 40gfull (config- if -Et20)# switchport trunk native vlan tag (config- if -Et20)# switchport trunk allowed vlan 11 (config- if -Et20)# switchport mode trunk (config- if -Et20)# dcbx mode ieee (config- if -Et20)# priority-flow-control mode on (config- if -Et20)# priority-flow-control priority 3 no-drop

Copy Copied! (config)# interface et10 (config- if -Et10)# flowcontrol receive on (config- if -Et10)# flowcontrol send on





Copy Copied! (config)# interface et10 (config- if -Et10)# dcbx mode ieee (config- if -Et10)# priority-flow-control mode on (config- if -Et10)# priority-flow-control priority 3 no-drop

The router uses L3's DSCP value to mark the egress traffic of L2 PCP. The required mapping, maps the three most significant bits of the DSCP into the PCP. This is the default behavior, and no additional configuration is required.

The captured PCP option from the Ethernet header of the incoming packet can be used to set the PCP bits on the outgoing Ethernet header.

RoCE mode is configured per adapter or per driver. If RoCE mode key is set for the adapter, then it will be used. Otherwise, it will be configured by the per-driver key. The per-driver key is shared between all devices in the system.

Note The supported RoCE modes depend on the firmware installed. If the firmware does not support the needed mode, the fallback mode would be the maximum supported RoCE mode of the installed NIC.

Note RoCE is enabled by default. Configuring or disabling the RoCE mode can be done via the registry key.

To update it for a specific adapter using the registry key, set the roce_mode as follows:

Find the registry key index value of the adapter according to section Finding the Index Value of the Network Interface. Set the roce_mode in the following path: Copy Copied! HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\<IndexValue>

Copy Copied! HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\mlx5\Parameters\Roce

Note For changes to take effect, please restart the network adapter after changing this registry key.

The following are per-driver and will apply to all available adapters.