RDMA Over Converged Ethernet (RoCE)
On This Page
RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory-to-memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by the hardware which enables lower latency, higher throughput, and better performance compared to software-based protocols.
RoCE traffic can take advantage of IP/Ethernet L3/L2 Quality of Service (QoS). Given some of the most prevalent use cases for RDMA technology (e.g. low latency, high bandwidth), the use of QoS becomes particularly relevant in a converged environment where RoCE traffic shares the underlying network with other TCP/UDP packets. In this regard, RoCE traffic is no different than other IP flows: QoS is achieved through proper configuration of relevant mechanisms in the fabric.
RoCE Packet Structure
Configuration of IP/Ethernet L3/L2 QoS is determined by the RoCE application using the The SL component in the Address Vector.
RoCE Congestion Management
RoCE Congestion Management (RCM) relies on the mechanism defined in RFC3168 in the ECN protocol for the signaling of congestion. While ECN marks packets that arrive to their destination, the congestion notification is sent back to the source using a CNP packet, which limits the rate of the packet injection for the relevant QP.
Remote Direct Memory Access
RDMA over Converged Ethernet
As with RoCE, the underlying networks for RoCEv2 should be configured as lossless. In this context, lossless does not mean that packets are absolutely never lost.
RoCE Congestion Management
Explicit Congestion Notification
Congestion Notification Packet
Priority Flow Control
Configuring simplified RoCE in NVIDIA Onyx allows the user to select the RoCE configuration that best suits their use-case. To configure the simplified RoCE setting, configure the default mode of RoCE based on the NVIDIA recommended definitions or the advanced mode for specific DCN and use cases. There are three modes in which RoCE can be configured: lossless, semi-lossless, and lossy.
RoCE Configuration Modes
This is the most optimal and automated option and is the default mode for the command, but requires a lossless network (PFC).
In addition to the PFC control that exists in semi-lossless, it includes that following features:
Requires a one-way PFC between the host and the ToR (the fabric will remain lossy).
In addition to the elements common to all options, it includes the following:
No PFC, but has the factors common to all modes.
The following configuration is used in each of the predefined modes:
Port trust mode L3
Port sw-prio-TC mapping
Port ECN absolute threshold 150-1500 TC 3 (RoCE)
LLDP + Application TLV (RoCE)
(UDP, Protocol: 4791, Priority 3)
Enable PFC on sw-prio 3 (RoCE)
Prio 3 to roce lossless traffic pool
The RoCE command defines the switch default values for several parameters defined in details in the RoCE Parameters table, above. Changes made by the user for RoCE-related parameters will not be changed by the RoCE command when executed.
Changing buffer configuration mode to "advanced buffer management" after configuring RoCE returns the buffer configuration to its default configuration.
For more information about this feature and its potential applications, please refer to the following community posts: