RDMA Over Converged Ethernet (RoCE)

RDMA over Converged Ethernet (RoCE) is a network protocol that leverages Remote Direct Memory Access (RDMA) capabilities to accelerate communications between applications hosted on clusters of servers and storage arrays. RoCE incorporates the IBTA RDMA semantics to allow devices to perform direct memory-to-memory transfers at the application level without involving the host CPU. Both the transport processing and the memory translation and placement are performed by the hardware which enables lower latency, higher throughput, and better performance compared to software-based protocols.

RoCE traffic can take advantage of IP/Ethernet L3/L2 Quality of Service (QoS). Given some of the most prevalent use cases for RDMA technology (e.g. low latency, high bandwidth), the use of QoS becomes particularly relevant in a converged environment where RoCE traffic shares the underlying network with other TCP/UDP packets. In this regard, RoCE traffic is no different than other IP flows: QoS is achieved through proper configuration of relevant mechanisms in the fabric.

RoCE Packet Structure

image2019-9-3_15-22-4.png

Configuration of IP/Ethernet L3/L2 QoS is determined by the RoCE application using the The SL component in the Address Vector.

RoCE Congestion Management

RoCE Congestion Management (RCM) relies on the mechanism defined in RFC3168 in the ECN protocol for the signaling of congestion. While ECN marks packets that arrive to their destination, the congestion notification is sent back to the source using a CNP packet, which limits the rate of the packet injection for the relevant QP.

Definitions/Abbreviation

Definitions/Abbreviation

Description

RDMA

Remote Direct Memory Access

RoCE

RDMA over Converged Ethernet

Lossless Network

As with RoCE, the underlying networks for RoCEv2 should be configured as lossless. In this context, lossless does not mean that packets are absolutely never lost.

RCM

RoCE Congestion Management

ECN

Explicit Congestion Notification

CNP

Congestion Notification Packet

PFC

Priority Flow Control

Configuring simplified RoCE in NVIDIA Onyx allows the user to select the RoCE configuration that best suits their use-case. To configure the simplified RoCE setting, configure the default mode of RoCE based on the NVIDIA recommended definitions or the advanced mode for specific DCN and use cases. There are three modes in which RoCE can be configured: lossless, semi-lossless, and lossy.

RoCE Configuration Modes

Options

Functionality

Lossless

This is the most optimal and automated option and is the default mode for the command, but requires a lossless network (PFC).

In addition to the PFC control that exists in semi-lossless, it includes that following features:

  • Adds traffic pool for lossless and map switch priority (3)

  • Enable PFC on priority RoCE (3) on all ports.

Semi-lossless

Requires a one-way PFC between the host and the ToR (the fabric will remain lossy).

In addition to the elements common to all options, it includes the following:

  • Micro-burst absorption (pause rx compliant, no pause propagation).

Lossy

No PFC, but has the factors common to all modes.

The following configuration is used in each of the predefined modes:

RoCE Parameters

Parameters

Lossy

Semi-lossless

Lossless

Port trust mode L3

image2019-9-11_11-56-37.png

image2019-9-11_11-56-37.png

image2019-9-11_11-56-37.png

Port sw-prio-TC mapping

  • sw-prio 3—TC 3 (RoCE)

  • sw-prio 6—TC 6 (CNP)

  • other sw-prio—TC 0

image2019-9-11_11-56-31.png

image2019-9-11_11-56-34.png

image2019-9-11_11-56-37.png

Port ETS

  • TC 6 (CNP)—strict

  • TC 3 (RoCE)—WWR 50%

  • TC 0 (other traffic)—WWR 50%

image2019-9-11_11-56-34.png

image2019-9-11_11-56-37.png

image2019-9-11_11-56-37.png

Port ECN absolute threshold 150-1500 TC 3 (RoCE)

image2019-9-11_11-56-34.png

image2019-9-11_11-56-37.png

image2019-9-11_11-56-37.png

LLDP + Application TLV (RoCE)

(UDP, Protocol: 4791, Priority 3)

image2019-9-11_11-56-34.png

image2019-9-11_11-56-34.png

image2019-9-11_11-56-37.png

Enable PFC on sw-prio 3 (RoCE)

image2019-9-11_11-56-37.png

image2019-9-11_11-56-37.png

Prio 3 to roce lossless traffic pool

image2019-9-11_11-56-37.png

Warning
  • The RoCE command defines the switch default values for several parameters defined in details in the RoCE Parameters table, above. Changes made by the user for RoCE-related parameters will not be changed by the RoCE command when executed.

  • Changing buffer configuration mode to "advanced buffer management" after configuring RoCE returns the buffer configuration to its default configuration.

© Copyright 2023, NVIDIA. Last updated on May 23, 2023.