If you are using the current version of Cumulus Linux, the content on this page may not be up to date. The current version of the documentation is available here. If you are redirected to the main page of the user guide, then this page may have been renamed; please search for it there.

RDMA over Converged Ethernet - RoCE

RoCE enables you to write to compute or storage elements using RDMA over an Ethernet network instead of using host CPUs. RoCE relies on ECN and PFC to operate. Cumulus Linux supports features that can enable lossless Ethernet for RoCE environments.

While Cumulus Linux can support RoCE environments, the end hosts must support the RoCE protocol.

RoCE helps you obtain a converged network, where all services run over the Ethernet infrastructure, including Infiniband apps.

Default RoCE Mode Configuration

The following table shows the default RoCE configuration for lossy and lossless mode.

ConfigurationLossy ModeLossless Mode
Port trust modeYESYES
Port switch priority to traffic class mapping
  • Switch priority 3 to traffic class 3 (RoCE)
  • Switch priority 6 to traffic class 6 (CNP)
  • Other switch priority to traffic class 0
YESYES
Port ETS:
  • Traffic class 6 (CNP) - Strict
  • Traffic class 3 (RoCE) - WRR 50%
  • Traffic class 0 (Other traffic) - WRR 50%
YESYES
Port ECN absolute threshold is 1501500 bytes for traffic class 3 (RoCE)YESYES
LLDP and Application TLV (RoCE)
(UDP, Protocol:4791, Priority: 3)
YESYES
Enable PFC on switch priority 3 (RoCE)NOYES
Switch priority 3 allocated to RoCE lossless traffic poolNOYES

Enable RDMA over Converged Ethernet lossless (with PFC and ECN)

RoCE uses the Infiniband (IB) Protocol over converged Ethernet. The IB global route header rides directly on top of the Ethernet header. The lossless Ethernet layer handles congestion hop by hop.

To configure RoCE with PFC and ECN:

cumulus@switch:~$ nv set qos roce
cumulus@switch:~$ nv config apply

NVUE defaults to roce mode lossless. The command nv set qos roce and nv set qos roce mode lossless are equivalent.

If you enable mode lossy, configuring nv set qos roce without a mode does not change the RoCE mode. To change to lossless, you must configure mode lossless.

Link pause is another way to provide lossless ethernet; however, PFC is the preferred method. PFC allows more granular control by pausing the traffic flow for a given CoS group instead of the entire link.

Enable RDMA over Converged Ethernet lossy (with ECN)

RoCEv2 requires flow control for lossless Ethernet. RoCEv2 uses the Infiniband (IB) Transport Protocol over UDP. The IB transport protocol includes an end-to-end reliable delivery mechanism and has its own sender notification mechanism.

RoCEv2 congestion management uses RFC 3168 to signal congestion experienced to the receiver. The receiver generates an RoCEv2 congestion notification packet directed to the source of the packet.

To configure RoCE with ECN:

cumulus@switch:~$ nv set qos roce mode lossy
cumulus@switch:~$ nv config apply

Remove RoCE Configuration

To remove RoCE configurations:

cumulus@switch:~$ nv unset qos roce
cumulus@switch:~$ nv config apply

Verify RoCE Configuration

You can verify RoCE configuration with NVUE nv show commands.

To show detailed information about the configured buffers, utilization and DSCP markings, run the nv show qos roce command:

cumulus@switch:mgmt:~$ nv show qos roce
                    operational  applied   description
------------------  -----------  --------  ------------------------------------------------------
enable                           on        Turn the feature 'on' or 'off'.  The default is 'off'.
mode                lossless     lossless  Roce Mode
cable-length        100          100       Cable Length(in meters) for Roce Lossless Config
congestion-control
  congestion-mode   ECN                    Congestion config mode
  enabled-tc        0,3                    Congestion config enabled Traffic Class
  max-threshold     1.43 MB                Congestion config max-threshold
  min-threshold     146.48 KB              Congestion config min-threshold
pfc
  pfc-priority      3                      switch-prio on which PFC is enabled
  rx-enabled        enabled                PFC Rx Enabled status
  tx-enabled        enabled                PFC Tx Enabled status
trust
  trust-mode        pcp,dscp               Trust Setting on the port for packet classification
 
 
RoCE PCP/DSCP->SP mapping configurations
===========================================
        pcp  dscp                     switch-prio
    --  ---  -----------------------  -----------
    0   0    0,1,2,3,4,5,6,7          0
    1   1    8,9,10,11,12,13,14,15    1
    2   2    16,17,18,19,20,21,22,23  2
    3   3    24,25,26,27,28,29,30,31  3
    4   4    32,33,34,35,36,37,38,39  4
    5   5    40,41,42,43,44,45,46,47  5
    6   6    48,49,50,51,52,53,54,55  6
    7   7    56,57,58,59,60,61,62,63  7
 
 
RoCE SP->TC mapping and ETS configurations
=============================================
        switch-prio  traffic-class  scheduler-weight
    --  -----------  -------------  ----------------
    0   0            0              DWRR-50%
    1   1            0              DWRR-50%
    2   2            0              DWRR-50%
    3   3            3              DWRR-50%
    4   4            0              DWRR-50%
    5   5            0              DWRR-50%
    6   6            6              strict-priority
    7   7            0              DWRR-50%
 
 
RoCE pool config
===================
        name                   mode     size   switch-priorities  traffic-class
    --  ---------------------  -------  -----  -----------------  -------------
    0   lossy-default-ingress  Dynamic  50.0%  0,1,2,4,5,6,7      -
    1   roce-reserved-ingress  Dynamic  50.0%  3                  -
    2   lossy-default-egress   Dynamic  50.0%  -                  0,6
    3   roce-reserved-egress   Dynamic  inf    -                  3
 
 
Exception List
=================
        description
    --  -----------

To show detailed RoCE information about a single interface, run the nv show interface <interface> qos roce status command.

cumulus@switch:mgmt:~$ nv show interface swp16 qos roce status
                    operational    applied  description
------------------  -------------  -------  ---------------------------------------------------
congestion-control
  congestion-mode   ecn, absolute           Congestion config mode
  enabled-tc        0,3                     Congestion config enabled Traffic Class
  max-threshold     1.43 MB                 Congestion config max-threshold
  min-threshold     153.00 KB               Congestion config min-threshold
pfc
  pfc-priority      3                       switch-prio on which PFC is enabled
  rx-enabled        yes                     PFC Rx Enabled status
  tx-enabled        yes                     PFC Tx Enabled status
trust
  trust-mode        pcp,dscp                Trust Setting on the port for packet classification
mode                lossless                Roce Mode
 
 
RoCE PCP/DSCP->SP mapping configurations
===========================================
          pcp  dscp  switch-prio
    ----  ---  ----  -----------
    cnp   6    48    6
    roce  3    26    3
 
 
RoCE SP->TC mapping and ETS configurations
=============================================
          switch-prio  traffic-class  scheduler-weight
    ----  -----------  -------------  ----------------
    cnp   6            6              strict priority
    roce  3            3              dwrr-50%
 
 
RoCE Pool Status
===================
        name                   mode     pool-id  switch-priorities  traffic-class  size      current-usage  max-usage
    --  ---------------------  -------  -------  -----------------  -------------  --------  -------------  ---------
    0   lossy-default-ingress  DYNAMIC  2        0,1,2,4,5,6,7      -              15.16 MB  0 Bytes        16.00 MB
    1   roce-reserved-ingress  DYNAMIC  3        3                  -              15.16 MB  7.30 MB        7.90 MB
    2   lossy-default-egress   DYNAMIC  13       -                  0,6            15.16 MB  0 Bytes        16.01 MB
    3   roce-reserved-egress   DYNAMIC  14       -                  3              inf       7.29 MB        13.47 MB

To show detailed information about current buffer utilization as well as historic RoCE byte and packet counts, run the nv show interface <interface> qos roce counters command:

cumulus@switch:mgmt:~$ nv show interface swp16 qos roce counters
                               operational   applied  description
-----------------------------  ------------  -------  ------------------------------------------------------
rx-stats
  rx-non-roce-stats
    buffer-max-usage           144 Bytes              Max Ingress Pool-buffer usage for non-RoCE traffic
    buffer-usage               0 Bytes                Current Ingress Pool-buffer usage for non-RoCE traffic
    no-buffer-discard          55                     Rx buffer discards for non-RoCE traffic
    non-roce-bytes             56.52 MB               non-roce rx bytes
    non-roce-packets           462975                 non-roce rx packets
    pg-max-usage               144 Bytes              Max PG-buffer usage for non-RoCE traffic
    pg-usage                   0 Bytes                Current PG-buffer usage for non-RoCE traffic
  rx-pfc-stats
    pause-duration             0                      Rx PFC pause duration for RoCE traffic
    pause-packets              0                      Rx PFC pause packets for RoCE traffic
  rx-roce-stats
    buffer-max-usage           0 Bytes                Max Ingress Pool-buffer usage for RoCE traffic
    buffer-usage               0 Bytes                Current Ingress Pool-buffer usage for RoCE traffic
    no-buffer-discard          0                      Rx buffer discards for RoCE traffic
    pg-max-usage               0 Bytes                Max PG-buffer usage for RoCE traffic
    pg-usage                   0 Bytes                Current PG-buffer usage for RoCE traffic
    roce-bytes                 0 Bytes                Rx RoCE Bytes
    roce-packets               0                      Rx RoCE Packets
tx-stats
  tx-cnp-stats
    buffer-max-usage           16.02 MB               Max Egress Pool-buffer usage for CNP traffic
    buffer-usage               0 Bytes                Current Egress Pool-buffer usage for CNP traffic
    cnp-bytes                  0 Bytes                Tx CNP Packet Bytes
    cnp-packets                0                      Tx CNP Packets
    tc-max-usage               0 Bytes                Max TC-buffer usage for CNP traffic
    tc-usage                   0 Bytes                Current TC-buffer usage for CNP traffic
    unicast-no-buffer-discard  0                      Tx buffer discards for CNP traffic
  tx-ecn-stats
    ecn-marked-packets         693777677344           Tx ECN marked packets
  tx-pfc-stats
    pause-duration             0                      Tx PFC pause duration for RoCE traffic
    pause-packets              0                      Tx PFC pause packets for RoCE traffic
  tx-roce-stats
    buffer-max-usage           13.47 MB               Max Egress Pool-buffer usage for RoCE traffic
    buffer-usage               7.29 MB                Current Egress Pool-buffer usage for RoCE traffic
    roce-bytes                 92824.38 GB            Tx RoCE Packet bytes
    roce-packets               803785675319           Tx RoCE Packets
    tc-max-usage               16.02 MB               Max TC-buffer usage for RoCE traffic
    tc-usage                   7.29 MB                Current TC-buffer usage for RoCE traffic
    unicast-no-buffer-discard  663060754115           Tx buffer discards for RoCE traffic

To reset the counters that the nv show interface <interface> qos roce command displays, run the nv action clear interface <interface> qos roce counters command.

Change RoCE Configuration

You can adjust RoCE settings using NVUE after you enable RoCE. To change the memory allocation for RoCE lossless mode to 60 percent:

cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy memory-percent 40
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossless memory-percent 60
cumulus@switch:mgmt:~$ nv config apply

To change the memory allocation of the RoCE lossy traffic pool to 60 percent and remap switch priority 4 to RoCE lossy traffic:

cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy switch-priority 0-3,5-7
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossy memory-percent 60
cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy memory-percent 40
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossy switch-priority 4
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 4 traffic-class 3
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 3 traffic-class 0
cumulus@switch:mgmt:~$ nv set qos mapping default-global trust both
cumulus@switch:mgmt:~$ nv set qos mapping default-global dscp 26 switch-priority 4
cumulus@switch:mgmt:~$ nv config apply

To change the RoCE lossless switch priority from switch priority 3 to switch priority 2:

cumulus@switch:mgmt:~$ nv set qos pfc default-global switch-priority 2
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 2 traffic-class 3
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 3 traffic-class 0
cumulus@switch:mgmt:~$ nv set qos mapping default-global trust both
cumulus@switch:mgmt:~$ nv set qos mapping default-global dscp 26 switch-priority 2