RDMA Over Converged Ethernet (RoCE)
RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory-to-memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by the hardware which enables lower latency, higher throughput, and better performance compared to software-based protocols.
RoCE traffic can take advantage of IP/Ethernet L3/L2 Quality of Service (QoS). Given some of the most prevalent use cases for RDMA technology (e.g. low latency, high bandwidth), the use of QoS becomes particularly relevant in a converged environment where RoCE traffic shares the underlying network with other TCP/UDP packets. In this regard, RoCE traffic is no different than other IP flows: QoS is achieved through proper configuration of relevant mechanisms in the fabric.
RoCE Packet Structure
Configuration of IP/Ethernet L3/L2 QoS is determined by the RoCE application using the The SL component in the Address Vector.
RoCE Congestion Management
RoCE Congestion Management (RCM) relies on the mechanism defined in RFC3168 in the ECN protocol for the signaling of congestion. While ECN marks packets that arrive to their destination, the congestion notification is sent back to the source using a CNP packet, which limits the rate of the packet injection for the relevant QP.
Definitions/Abbreviation
Definitions/Abbreviation |
Description |
RDMA |
Remote Direct Memory Access |
RoCE |
RDMA over Converged Ethernet |
Lossless Network |
As with RoCE, the underlying networks for RoCEv2 should be configured as lossless. In this context, lossless does not mean that packets are absolutely never lost. |
RCM |
RoCE Congestion Management |
ECN |
Explicit Congestion Notification |
CNP |
Congestion Notification Packet |
PFC |
Priority Flow Control |
Configuring simplified RoCE in NVIDIA Onyx allows the user to select the RoCE configuration that best suits their use-case. To configure the simplified RoCE setting, configure the default mode of RoCE based on the NVIDIA recommended definitions or the advanced mode for specific DCN and use cases. There are three modes in which RoCE can be configured: lossless, semi-lossless, and lossy.
RoCE Configuration Modes
Options |
Functionality |
Lossless |
This is the most optimal and automated option and is the default mode for the command, but requires a lossless network (PFC). In addition to the PFC control that exists in semi-lossless, it includes that following features:
|
Semi-lossless |
Requires a one-way PFC between the host and the ToR (the fabric will remain lossy). In addition to the elements common to all options, it includes the following:
|
Lossy |
No PFC, but has the factors common to all modes. |
The following configuration is used in each of the predefined modes:
RoCE Parameters
Parameters |
Lossy |
Semi-lossless |
Lossless |
Port trust mode L3 |
|
|
|
Port sw-prio-TC mapping
|
|
|
|
Port ETS
|
|
|
|
Port ECN absolute threshold 150-1500 TC 3 (RoCE) |
|
|
|
LLDP + Application TLV (RoCE) (UDP, Protocol: 4791, Priority 3) |
|
|
|
Enable PFC on sw-prio 3 (RoCE) |
|
|
|
Prio 3 to roce lossless traffic pool |
|
The RoCE command defines the switch default values for several parameters defined in details in the RoCE Parameters table, above. Changes made by the user for RoCE-related parameters will not be changed by the RoCE command when executed.
Changing buffer configuration mode to "advanced buffer management" after configuring RoCE returns the buffer configuration to its default configuration.
For more information about this feature and its potential applications, please refer to the following community posts:
Recommended Network Configuration Examples for RoCE Deployment
How To Configure RoCEv2 for ConnectX-3 Pro Using SwitchX Switches
Lossless RoCE Configuration for Onyx Switches in DSCP-Based QoS Mode
How To Configure RoCE Over a Lossy Fabric (ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L3)
How To Configure RoCE With ECN End-to-End Using ConnectX-4 and Spectrum (Trust L2)
RoCE Configuration for Onyx Switches in PCP-Based QoS Mode (Advanced Mode)
How To Configure Resilient RoCE End-to-End Using ConnectX-4 and Spectrum (No QoS)
Lossless RoCE Configuration for Onyx Switches in PCP-Based QoS Mode
Lossless RoCE Configuration for MLNX-OS Switches in DSCP-Based QoS Mode (Advanced Mode)