Networking
This switch configuration guide provides the steps necessary to configure Ethernet switches for optimal network performance for the EGX Reference Architecture. Optimal performance is achieved by configuring the RDMA over Converged Ethernet (RoCE) protocol properly, with correctly tuned buffer and congestion notification thresholds. This guide provides the commands needed to enable RoCE as well as the inter-switch LAG (Link Aggregation Group) . This guide assumes two uplinks per server node to the leaf switches.
The guide includes both greenfield and brownfield deployment steps.
Why RoCE?
RDMA over Converged Ethernet (RoCE) is a standard protocol that enables efficient data transfer from the memory of a server/storage device to another server/storage device over Ethernet networks. RDMA is typically offloaded to a NIC with hardware RDMA engine implementation for higher throughput, lower latency, and much lower CPU utilization.
Example Cabling Diagram of Servers with Ethernet
Port Allocation Example
The following port allocation is for illustration of typical connectivity and is not prescriptive. It assumes greenfield deployment. This guide assumes ports 1-16 are assigned to downlinks to servers using two switches. If using a single leaf switch, skip the peer-link connectivity and configuration between the switches. The peer-link should be sized appropriately and increased appropriately to ensure a full non-blocking deployment.
Source |
Port |
Destination |
Port |
Role |
---|---|---|---|---|
compute01 | 1 | EGX-leaf01 | 1 | Downlink |
compute01 | 2 | EGX-leaf02 | 1 | Downlink |
compute02 | 1 | EGX-leaf01 | 2 | Downlink |
compute02 | 2 | EGX-leaf02 | 2 | Downlink |
compute03 | 1 | EGX-leaf01 | 3 | Downlink |
compute03 | 2 | EGX-leaf02 | 3 | Downlink |
compute04 | 1 | EGX-leaf01 | 4 | Downlink |
compute04 | 2 | EGX-leaf02 | 4 | Downlink |
EGX-leaf01 | 31 | EGX-leaf02 | 31 | Peer-link |
EGX-leaf01 | 32 | EGX-leaf02 | 32 | Peer-link |
The following steps assume that the Ethernet switch is using the NVIDIA Cumulus Linux Network Operating System. The configuration is based on the port allocation example.
Configuration includes LAG and RoCE:
LAG is used to enable connectivity over multiple links between the leaves.
If using a single leaf switch, skip Step 1.A and start at Step 1.B.
RoCE is used to leverage offload with hardware RDMA engine implementation.
In the case of using one leaf switch, follow the steps for leaf01 only.
Greenfield
This section assumes the use of NVIDIA Cumulus Linux Network Operating System (NOS) version 4.4 and above. For brownfield deployments or any other deployment where an older version of Cumulus Linux is being used, it is recommended to install 4.4. For more information about update paths to Cumulus Linux 4.4, please see the Cumulus Linux documentation.
Paste the following commands, assuming ports swp31-32 connect the switches and ports 1-4 are downlinks connected to servers.
-
nv set interface bond0 bond member swp31-32
-
nv set interface swp1-4,bond0 bridge domain br_default vlan 111
-
nv set qos roce
-
nv config apply
-
Commands are subject to change. Make sure to check NVIDIA documentation for the latest commands.
Brownfield
To configure the switches, follow the following steps on each leaf:
Configure a LAG between the two switches for all ports connected between them.
Configure RoCE lossless on all ports:
Enable ECN marking.
For a fair sharing of switch buffer with other traffic classes, it is recommended to configure ECN on all other traffic classes.
Create a RoCE pool of 50% of memory.
Set QoS DSCP to 3.
Set a strict priority to CNPs over traffic class 6.
Optional: Enable DCBX LLDP
This is required if the adapter card relies on LLDP configuration in the switch for setting priority for PFC.
Confirm RoCE traffic is received from the server by checking the RoCE and QoS counters. For more information about the counters, see the Switch Validations section.
Confirm all downlinks to servers share the same VLAN.
Validate all interfaces are up with the command below.
nv show interface | grep up
Expected output:
+ lo up 65536 loopback IP Address: 127.0.0.1/8 + swp1 up 9216 swp bridge.domain: br_default + swp2 up 9216 swp bridge.domain: br_default + swp3 up 9216 swp bridge.domain: br_default + swp31 up 9216 swp link.stats.carrier-transitions: 2 + swp32 up 9216 swp link.stats.carrier-transitions: 2 + swp4 up 9216 swp bridge.domain: br_default
Validate connectivity among compute nodes by using ping tests.
Validate RoCE configuration with the command below.
nv show qos roce
Expected output:
operational applied description ------------------ ----------- -------- ------------------------------------------------------ enable on Turn the feature 'on' or 'off'. The default is 'off'. mode lossless lossless Roce Mode cable-length 100 100 Cable Length(in meters) for Roce Lossless Config congestion-control congestion-mode ECN Congestion config mode enabled-tc 0,3 Congestion config enabled Traffic Class max-threshold 1.43 MB Congestion config max-threshold min-threshold 146.48 KB Congestion config min-threshold pfc pfc-priority 3 switch-prio on which PFC is enabled rx-enabled enabled PFC Rx Enabled status tx-enabled enabled PFC Tx Enabled status trust trust-mode pcp,dscp Trust Setting on the port for packet classification RoCE PCP/DSCP->SP mapping configurations =========================================== pcp dscp switch-prio -- --- ----------------------- ----------- 0 0 0,1,2,3,4,5,6,7 0 1 1 8,9,10,11,12,13,14,15 1 2 2 16,17,18,19,20,21,22,23 2 3 3 24,25,26,27,28,29,30,31 3 4 4 32,33,34,35,36,37,38,39 4 5 5 40,41,42,43,44,45,46,47 5 6 6 48,49,50,51,52,53,54,55 6 7 7 56,57,58,59,60,61,62,63 7 RoCE SP->TC mapping and ETS configurations ============================================= switch-prio traffic-class scheduler-weight -- ----------- ------------- ---------------- 0 0 0 DWRR-50% 1 1 0 DWRR-50% 2 2 0 DWRR-50% 3 3 3 DWRR-50% 4 4 0 DWRR-50% 5 5 0 DWRR-50% 6 6 6 strict-priority 7 7 0 DWRR-50% RoCE pool config =================== name mode size switch-priorities traffic-class -- --------------------- ------- ----- ----------------- ------------- 0 lossy-default-ingress Dynamic 50.0% 0,1,2,4,5,6,7 - 1 roce-reserved-ingress Dynamic 50.0% 3 - 2 lossy-default-egress Dynamic 50.0% - 0,6 3 roce-reserved-egress Dynamic inf - 3 Exception List ================= description -- -----------
Validate RoCE traffic is received from compute with the command below on each downlink. This example shows the counters on swp1.
sudo /usr/lib/cumulus/mlxcmd roce --port swp1 counters
Number of RX and TX packets should be more than 0. Example for an expected output:
Port: swp1 (77056) Rx: RoCE PG packets: 121 RoCE PG bytes: 1147802 Tx: RoCE TC packets: 152 RoCE TC bytes: 1441866
Validate drops using What-Just-Happened with the command below.
what-just-happened poll
Confirm no drops for congestion reasons. For more information, use the what-just-happened documentation.
NVIDIA brings the best-accelerated computing experience to customers and accelerates the adoption of ML/AI applications in the enterprise. This deployment guide illustrated how to set up a high-performance multi-node cluster using EGX and NVIDIA Networking switches.