image image image image image

On This Page

RoCE Overview

RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory-to-memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by the hardware which enables lower latency, higher throughput, and better performance compared to software-based protocols.

RoCE traffic can take advantage of IP/Ethernet L3/L2 Quality of Service (QoS). Given some of the most prevalent use cases for RDMA technology (e.g. low latency, high bandwidth), the use of QoS becomes particularly relevant in a converged environment where RoCE traffic shares the underlying network with other TCP/UDP packets. In this regard, RoCE traffic is no different than other IP flows: QoS is achieved through proper configuration of relevant mechanisms in the fabric.

RoCE Packet Structure

Configuration of IP/Ethernet L3/L2 QoS is determined by the RoCE application using the The SL component in the Address Vector.

RoCE Congestion Management

RoCE Congestion Management (RCM) relies on the mechanism defined in RFC3168 in the ECN protocol for the signaling of congestion. While ECN marks packets that arrive to their destination, the congestion notification is sent back to the source using a CNP packet, which limits the rate of the packet injection for the relevant QP.




Remote Direct Memory Access


RDMA over Converged Ethernet

Lossless Network

As with RoCE, the underlying networks for RoCEv2 should be configured as lossless. In this context, lossless does not mean that packets are absolutely never lost.


RoCE Congestion Management


Explicit Congestion Notification


Congestion Notification Packet

PFCPriority Flow Control

Configuring RoCE

Configuring simplified RoCE in NVIDIA Onyx allows the user to select the RoCE configuration that best suits their use-case. To configure the simplified RoCE setting, configure the default mode of RoCE based on the NVIDIA  recommended definitions or the advanced mode for specific DCN and use cases. There are three modes in which RoCE can be configured: lossless, semi-lossless, and lossy.

RoCE Configuration Modes

LosslessThis is the most optimal and automated option and is the default mode for the command, but requires a lossless network (PFC).

In addition to the PFC control that exists in semi-lossless, it includes that following features:

  • Adds traffic pool for lossless and map switch priority (3)
  • Enable PFC on priority RoCE (3) on all ports.
Semi-losslessRequires a one-way PFC  between the host and the ToR (the fabric will remain lossy).

In addition to the elements common to all options, it includes the following:

  • Micro-burst absorption (pause rx compliant, no pause propagation).
Lossy No PFC, but has the factors common to all modes.

The following configuration is used in each of the predefined modes:

RoCE Parameters 

Port trust mode L3

Port sw-prio-TC mapping

  • sw-prio 3—TC 3 (RoCE)
  • sw-prio 6—TC 6 (CNP)
  • other sw-prio—TC 0

Port ETS

  • TC 6 (CNP)—strict
  • TC 3 (RoCE)—WWR 50%
  • TC 0 (other traffic)—WWR 50%

Port ECN absolute threshold 150-1500 TC 3 (RoCE)

LLDP + Application TLV (RoCE)

(UDP, Protocol: 4791, Priority 3)

Enable PFC on sw-prio 3 (RoCE)

Prio 3 to roce lossless traffic pool

  • The RoCE command defines the switch default values for several parameters defined in details in the RoCE Parameters table, above. Changes made by the user for RoCE-related parameters will not be changed by the RoCE command when executed.
  • Changing buffer configuration mode to "advanced buffer management" after configuring RoCE returns the buffer configuration to its default configuration.

RoCE Commands

Further Information

For more information about this feature and its potential applications, please refer to the following community posts: