RDG for a Scalable, High-performance Kubernetes Cluster over NVIDIA Ethernet Fabric

Created on Oct 27, 2024

This R eference D eployment G uide ( RDG ) provides a practical and scalable Ethernet fabric deployment suitable for high-performance workloads in K8s. This fabric is based on EVPN and provides both a primary K8s network (such as Calico) and a secondary high-performance network for RDMA/DPDK/GDR, using the NVIDIA Network Operator and NVIDIA GPU Operator.

The proposed fabric configuration supports up to 480 workload servers in its maximum scale and provides a non-blocking throughput of up to 200Gbps between pods.

The information presented is written for experienced System Admins, System Engineers and Solution Architects who want to deploy a scalable, high performance capable Kubernetes cluster over the NVIDIA Ethernet fabric.

Term

Definition

Term

Definition

BGP

Border Gateway Protocol

LACP

Link Aggregation Control Protocol

CNI

Container Network Interface

LLDP

Link Layer Discovery Protocol

CUDA

Compute Unified Device Architecture

MLNX_OFED

NVIDIA OpenFabrics Enterprise Distribution for Linux (network driver)

DMA

Direct Memory Access

RDG

Reference Deployment Guide

DPDK

Data Plane Development Kit

RDMA

Remote Direct Memory Access

EVPN

Ethernet Virtual Private Network

SR-IOV

Single Root Input/Output Virtualization

EVPN-MH

EVPN Multi Homing

TOR

Top of Rack

GDR

GPUDirect RDMA

VLAN

Virtual LAN (Local Area Network)

GPU

Graphical Processing Unit

VRR

Virtual Router Redundancy

ISL

Inter-Switch Link

VTEP

Virtual Tunnel End Point

K8S

Kubernetes

VXLAN

Virtual Extensible LAN

Provisioning a Kubernetes cluster that is scalable and suitable for running high performance applications can be an extremely complicated task - you need to take into account many considerations, such as the logical and fabric design, software and hardware component selection, deployment methods, and implementation of optimizations to be able to meet the required performance needs.

This R eference D eployment G uide ( RDG ) provides a complete and practical solution for a scalable Ethernet fabric deployment suitable for high-performance workloads in K8s (Kubernetes). The solution is delivered on top of standard servers, while the NVIDIA end-to-end Ethernet infrastructure is used to handle the network communication.

In this guide, a single fabric provides both the primary K8s network and a secondary high-performance network, which handles the workload.

Using the NVIDIA Network Operator and NVIDIA GPU Operator, responsible for deploying and configuring network and GPU components in the K8s cluster, practical examples show how to accelerate workloads using technologies like RDMA, DPDK and GDR.

This document goes step by step with the solution implementation - beginning with the fabric design and K8s deployment, then continuing with the actual deployment and configuration steps, and finally concluding with a performance test to demonstrate the benefits of the solution.

Key Components and Technologies

  • NVIDIA ConnectX SmartNICs

    10/25/40/50/100/200 and 400G Ethernet Network Adapters

    The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.

    NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

  • NVIDIA LinkX Cables

    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA Spectrum Ethernet Switches

    Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.

    Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.

    NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® Linux , SONiC and NVIDIA Onyx®.

  • NVIDIA Cumulus Linux

    NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

  • Kubernetes

    Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

  • Kubespray

    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:

    • A highly available cluster
    • Composable attributes
    • Support for most popular Linux distributions

  • NVIDIA Network Operator

    The NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster. The operator automatically installs the required host networking software - bringing together all the needed components to provide high-speed network connectivity. These components include the NVIDIA networking driver, Kubernetes device plugin, CNI plugins, IP address management (IPAM) plugin and others. The NVIDIA Network Operator works in conjunction with the NVIDIA GPU Operator to deliver high-throughput, low-latency networking for scale-out, GPU computing clusters.

  • NVIDIA GPU Operator

    The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM-based monitoring, and more.

  • NVIDIA CUDA

    CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers can dramatically speed up computing applications by harnessing the power of GPUs. In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute-intensive portion of the application runs on thousands of GPU cores in parallel.

  • NVIDIA PMD

    NVIDIA Poll Mode Driver (PMD) is an open-source upstream driver embedded within dpdk.org releases. PMD is designed for fast packet processing and low latency by providing kernel bypass for receive and send, and avoiding the interrupt processing performance overhead.

  • TRex—Realistic Traffic Generator

    TRex is an open source stateful and stateless traffic generator fueled by DPDK. TRex generates layer 3 through 7 traffic and provides in one tool the capabilities provided by commercial tools. TRex can scale up to 200Gb/sec with one server.

  • RDMA

    RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.

    Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.

  • GPUDirect RDMA

    GPUDirect (GDR) RDMA provides a direct P2P (Peer-to-Peer) data path between the GPU memory directly to and from NVIDIA host networking devices. This reduces GPU-to-GPU communication latency and completely offloads the CPU, removing it from all GPU-to-GPU communications across the network.

GPUDirect_RDMA-version-1-modificationdate-1730019288350-api-v2.png

Solution Design

The logical design includes the following components:

  • 1 x Deployment node running Kubespray that deploys the Kubernetes cluster
  • 1 x Master node running all Kubernetes management components
  • 4 x Worker nodes, each with a 1 x ConnectX-7 NIC and 2 x NVIDIA A100 Tensor Core GPU
  • Single, high speed Ethernet fabric for deployment, management, tenant and high speed networks

K8S_Cluster_Full_View-version-1-modificationdate-1730018437327-api-v2.png
Info

This document covers a single K8s controller deployment scenario. For high-availability cluster deployment, refer to kubespray/docs/operations/ha-mode.md at master · kubernetes-sigs/kubespray · GitHub

Network / Fabric Design

The design of the fabric is based on a 2-level leaf-spine topology for workload and K8s management, and an additional IPMI network (not covered in this document).

This document demonstrates a minimalistic scale of one infrastructure or management and two workload racks with two servers each (as shown in the diagram below):

Network_Fabric_Updated-version-1-modificationdate-1730018434523-api-v2.png

By using the same design, the fabric can scale to accommodate up to 480 workload servers using up to 30 workload racks with up to 16 servers each. Every workload rack uses a single leaf switch (TOR). The infrastructure rack consists of a highly-available border router (EVPN-MH) that provides a connection to an external gateway or router and to a maximum of additional 15 servers.

The high-speed network consists of two logical segments:

  1. The management network and the primary K8s network (used by Calico) - VLAN10
  2. The secondary K8s network that provides RDMA to the pods - VLAN20

The fabric implements a VXLAN overlay network with a BGP EVPN control plane that enables you to "stretch" the VLANs across all the racks.
Every leaf switch has a VTEP that takes care of VXLAN encapsulation and decapsulation. Communication between the VTEPs is done by routing through the spines, controlled by a BGP control plane.

The infrastructure rack (as seen on the left in the illustration below) has two leaf switches that act as a highly available border router, providing both highly available connectivity for the infrastructure servers (the deployment server and the K8s master node) and redundant routing into and out of the cluster through a gateway node. This high availability is achieved by an EVPN-MH configuration, the use of LACP bonds, and a redundant router mechanism that uses VRR.

Below is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 200GbE to the host (30 racks, 16 servers each using 16 spines and 32 leaves).

Network_Fabric_Scale-version-1-modificationdate-1730018437130-api-v2.png
Info

With a maximum scale fabric (as shown above), there are 16 x 200Gbps links going up from each leaf to the spines and, therefore, a maximum of 16 x 200Gbps links going to servers in each rack.

Software Stack Components

Software_Stack_Final-version-1-modificationdate-1730018436690-api-v2.png
Warning

Make sure to upgrade all the NVIDIA software components to their latest released version.

Bill of Materials

BOM_Updated_3-version-2-modificationdate-1730019883707-api-v2.png

Node and Switch Definitions

These are the definitions and parameters used for deploying the demonstrated fabric:

Spines

Hostname

Router ID

Autonomous System

Downlinks

spine1 (MSN3700)

10.0.0.1/32

65100

swp1-4

spine2 (MSN3700)

10.0.0.2/32

65100

swp1-4

Leaves

Hostname

Router ID

Autonomous System

Uplinks

Downlinks

leaf1a (MSN3700)

10.0.0.101/32

65101

swp31-32

swp1-3

leaf1b (MSN3700)

10.0.0.102/32

65102

swp31-32

swp1-3

leaf2 (MSN3700)

10.0.0.103/32

65103

swp31-32

swp1-2

leaf3 (MSN3700)

10.0.0.104/32

65104

swp31-32

swp1-2

Workload Server Ports

Hostname

Rack ID

Ports

Access VLAN

Trunk VLAN

leaf2

2

swp1-2

10

20

leaf3

3

swp1-2

10

20

Border Routers (Infrastructure Rack TORs)

Hostname

Segment MAC Address

df-preference

leaf1a

44:38:39:BE:EF:AA

50000

leaf1b

44:38:39:BE:EF:AA

50000

Border VLANs

VLAN ID

Virt MAC

Virt IP

First Router IP

Second Router IP

1

00:00:5e:00:01:01

10.1.0.1/24

10.1.0.2/24

10.1.0.3/24

Infrastructure Server Ports

Hostname

Ports

Bond

Access VLAN

leaf1a, leaf1b

swp1

bond1

1

leaf1a, leaf1b

swp2

bond2

10

leaf1a, leaf1b

swp3

bond3

10

Hosts

Rack

Server Type

Server Name

Switch Port

IP and NICs

Default Gateway

Rack1

(Infrastructure)

External Gateway

gateway

swp1

bond0 (enp204s0f0np0, enp204s0f1np1)

10.1.0.254/24

-

Rack1

(Infrastructure)

Deployment Node

depserver

swp2

bond0 (enp203s0f0np0, enp203s0f1np1)

10.10.0.250/16

10.10.0.1

Rack1

(Infrastructure)

Master Node

node1

swp3

bond0 (enp203s0f0np0, enp203s0f1np1)

10.10.1.1/16

10.10.0.1

Rack2

(Workload)

Worker Node

node2

swp1

enp63s0f0np0

10.10.1.2/16

10.10.0.1

Rack2

(Workload)

Worker Node

node3

swp2

enp63s0f0np0

10.10.1.3/16

10.10.0.1

Rack3

(Workload)

Worker Node

node4

swp1

enp63s0f0np0

10.10.1.4/16

10.10.0.1

Rack3

(Workload)

Worker Node

node5

swp2

enp63s0f0np0

10.10.1.5/16

10.10.0.1


Wiring

This is the wiring principal for the workload racks:

  • Each server in the racks is wired to the leaf (or "TOR") switch

  • Every leaf is wired to all the spines

image2021-6-1_9-58-41-version-1-modificationdate-1730018455367-api-v2.png

This is the wiring principal for the infrastructure rack:

  • Each server in the racks is wired to two leaf switches (or "TORs")

  • Every leaf is wired to all the spines

Master_Node_Wiring-version-1-modificationdate-1730018438843-api-v2.png


Fabric Configuration

Updating Cumulus Linux

As a best practice, make sure to use the latest released Cumulus Linux NOS version.

For information on how to upgrade Cumulus Linux, refer to the Cumulus Linux User Guide.

Configuring the Cumulus Linux Switch

Switch configuration is as follows:

Spine1 Console

Copy
Copied!
            

nv set interface lo ip address 10.0.0.1/32 nv set interface swp1-4 nv set router bgp autonomous-system 65100 nv set router bgp router-id 10.0.0.1 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp1-4 peer-group underlay nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv config apply

Spine2 Console

Copy
Copied!
            

nv set interface lo ip address 10.0.0.2/32 nv set interface swp1-4 nv set router bgp autonomous-system 65100 nv set router bgp router-id 10.0.0.2 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp1-4 peer-group underlay nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv config apply

Leaf1A Console

Copy
Copied!
            

nv set interface lo ip address 10.0.0.101/32 nv set interface swp1-3,swp31-32 nv set interface bond1 bond member swp1 nv set interface bond2 bond member swp2 nv set interface bond3 bond member swp3 nv set interface bond1-3 bond lacp-bypass on nv set interface bond1-3 link mtu 8950 nv set interface bond1-3 bridge domain br_default nv set interface bond1 bridge domain br_default access 1 nv set interface bond2 bridge domain br_default access 10 nv set interface bond3 bridge domain br_default access 10 nv set interface vlan10 ip address 10.10.0.2/16 nv set interface vlan10 ip vrr address 10.10.0.1/16 nv set interface vlan10 ip vrr state up nv set interface vlan1 ip address 10.1.0.2/24 nv set interface vlan1 ip vrr address 10.1.0.1/24 nv set interface vlan1 ip vrr state up nv set vrf RED nv set bridge domain br_default vlan 1 vni 1 nv set bridge domain br_default vlan 10 vni 10 nv set interface vlan1 ip vrf RED nv set interface vlan10 ip vrf RED nv set nve vxlan source address 10.0.0.101 nv set nve vxlan arp-nd-suppress on nv set vrf RED evpn vni 4001 nv set evpn enable on nv set router bgp autonomous-system 65101 nv set router bgp router-id 10.0.0.101 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp31-32 peer-group underlay nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf RED router bgp autonomous-system 65101 nv set vrf RED router bgp router-id 10.0.0.101 nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn nv set vrf RED router static 0.0.0.0/0 via 10.1.0.254 nv set vrf RED router bgp address-family ipv4-unicast redistribute static nv set evpn multihoming enable on nv set interface bond1 evpn multihoming segment local-id 1 nv set interface bond2 evpn multihoming segment local-id 2 nv set interface bond3 evpn multihoming segment local-id 3 nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:AA nv set interface bond1-3 evpn multihoming segment df-preference 50000 nv set interface swp31-32 evpn multihoming uplink on nv config apply

Leaf1B Console

Copy
Copied!
            

nv set interface lo ip address 10.0.0.102/32 nv set interface swp1-3,swp31-32 nv set interface bond1 bond member swp1 nv set interface bond2 bond member swp2 nv set interface bond3 bond member swp3 nv set interface bond1-3 bond lacp-bypass on nv set interface bond1-3 link mtu 8950 nv set interface bond1-3 bridge domain br_default nv set interface bond1 bridge domain br_default access 1 nv set interface bond2 bridge domain br_default access 10 nv set interface bond3 bridge domain br_default access 10 nv set interface vlan10 ip address 10.10.0.3/16 nv set interface vlan10 ip vrr address 10.10.0.1/16 nv set interface vlan10 ip vrr state up nv set interface vlan1 ip address 10.1.0.3/24 nv set interface vlan1 ip vrr address 10.1.0.1/24 nv set interface vlan1 ip vrr state up nv set vrf RED nv set bridge domain br_default vlan 1 vni 1 nv set bridge domain br_default vlan 10 vni 10 nv set interface vlan1 ip vrf RED nv set interface vlan10 ip vrf RED nv set nve vxlan source address 10.0.0.102 nv set nve vxlan arp-nd-suppress on nv set vrf RED evpn vni 4001 nv set evpn enable on nv set router bgp autonomous-system 65102 nv set router bgp router-id 10.0.0.102 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp31-32 peer-group underlay nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf RED router bgp autonomous-system 65102 nv set vrf RED router bgp router-id 10.0.0.102 nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn nv set vrf RED router static 0.0.0.0/0 via 10.1.0.254 nv set vrf RED router bgp address-family ipv4-unicast redistribute static nv set evpn multihoming enable on nv set interface bond1 evpn multihoming segment local-id 1 nv set interface bond2 evpn multihoming segment local-id 2 nv set interface bond3 evpn multihoming segment local-id 3 nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:AA nv set interface bond1-3 evpn multihoming segment df-preference 50000 nv set interface swp31-32 evpn multihoming uplink on nv config apply

Leaf2 Console

Copy
Copied!
            

nv set interface lo ip address 10.0.0.103/32 nv set interface swp1-2,swp31-32 nv set interface swp1-2 link mtu 8950 nv set interface swp1-2 bridge domain br_default untagged 10 nv set interface swp1-2 bridge domain br_default vlan 20 nv set bridge domain br_default vlan 10,20 nv set interface vlan10 ip address 10.10.0.4/16 nv set interface vlan10 ip vrr address 10.10.0.1/16 nv set interface vlan10 ip vrr state up nv set interface vlan20 vlan 20 nv set vrf RED nv set bridge domain br_default vlan 10 vni 10 nv set bridge domain br_default vlan 20 vni 20 nv set interface vlan10 ip vrf RED nv set interface vlan20 ip vrf RED nv set nve vxlan source address 10.0.0.103 nv set nve vxlan arp-nd-suppress on nv set vrf RED evpn vni 4001 nv set evpn enable on nv set router bgp autonomous-system 65103 nv set router bgp router-id 10.0.0.103 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp31-32 peer-group underlay nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf RED router bgp autonomous-system 65103 nv set vrf RED router bgp router-id 10.0.0.103 nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on nv config apply

Leaf3 Console

Copy
Copied!
            

nv set interface lo ip address 10.0.0.104/32 nv set interface swp1-2,swp31-32 nv set interface swp1-2 link mtu 8950 nv set interface swp1-2 bridge domain br_default untagged 10 nv set interface swp1-2 bridge domain br_default vlan 20 nv set bridge domain br_default vlan 10,20 nv set interface vlan10 ip address 10.10.0.5/16 nv set interface vlan10 ip vrr address 10.10.0.1/16 nv set interface vlan10 ip vrr state up nv set interface vlan20 vlan 20 nv set vrf RED nv set bridge domain br_default vlan 10 vni 10 nv set bridge domain br_default vlan 20 vni 20 nv set interface vlan10 ip vrf RED nv set interface vlan20 ip vrf RED nv set nve vxlan source address 10.0.0.104 nv set nve vxlan arp-nd-suppress on nv set vrf RED evpn vni 4001 nv set evpn enable on nv set router bgp autonomous-system 65104 nv set router bgp router-id 10.0.0.104 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp31-32 peer-group underlay nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf RED router bgp autonomous-system 65104 nv set vrf RED router bgp router-id 10.0.0.104 nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on nv config apply

To verify that the configuration is correct, check Validating the fabric below.

Connecting the Infrastructure Servers

Infrastructure servers (deployment and K8s master servers) are placed in the infrastructure rack.

This requires the following additional configuration steps:

  1. Add the ports connected to the servers to a bond.

  2. Place the bond in the relevant VLAN.

In the example, the servers connect to ports swp2 and swp3 on both leaf switches (Leaf1A and Leaf1B) and use VLAN10 created on the border leaf switches. The commands on both Leaf1A and Leaf1B are shown above.

The server side is configured with netplan (depserver as an example):

Depserver bond configuration

Copy
Copied!
            

bonds: bond0: mtu: 8950 addresses: [10.10.0.250/16] interfaces: - enp203s0f0np0 - enp203s0f1np1 parameters: mode: 802.3ad transmit-hash-policy: layer3+4 mii-monitor-interval: 1

Connecting an External Gateway to the Infrastructure Rack

The example connects an external gateway machine (10.1.0.254/24) over an LACP bond to swp1 of both border leaf switches (through VLAN1).

This gateway is used to access any external network (such as the Internet). The configuration commands on both border leaf switches are shown above.

This document does not provide the gateway configuration.

Host Configuration

Warning

Make sure that the BIOS settings on the worker node servers have SR-IOV enabled and that the servers are tuned for maximum performance.

Warning

All worker nodes must have the same PCIe placement for the NIC and must show the same interface name.

Network Configuration on the Worker Nodes

Set an IP address for network connectivity to the servers and set the MTU on the physical port of the server to allow for optimized throughput.

The fabric uses a VXLAN overlay, therefore, use the maximum MTU of 9216 on the core links (leaf to spine links) and an MTU of 8950 on the edge links (server links), making sure that the VXLAN header added to the packets do not cause fragmentation.

In addition, you need to configure the gateway for the high-speed network; the VRR address (10.10.0.1/16).

To configure the IP, MTU, and gateway on the server ports, edit the netplan configuration file (perform the following example on node2 on all of the worker nodes):

Node2 netplan configuration

Copy
Copied!
            

network: ethernets: enp63s0f0np0: dhcp4: false mtu: 8950 addresses: [10.10.1.2/16] routes: - to: default via: 10.10.0.1 version: 2

Apply the configuration:

Worker Node Console

Copy
Copied!
            

root@node2:~# netplan apply


Installing and Updating the OS

Make sure the Ubuntu Server 22.04 operating system is installed on all servers with OpenSSH server packages and create a non-root user account with sudo privileges without a password.

Also, make sure to assign the correct network configuration to the hosts (IP addresses, default gateway, DNS server, NTP server) and to create bonds on the nodes in the infrastructure rack (master node and deployment node).

Update the Ubuntu software packages with the following commands:

Update Ubuntu software

Copy
Copied!
            

# apt-get update -y # apt-get upgrade -y # reboot


Non-root User Account Prerequisites

The example creates a non-root user account with sudo privileges without a password (on each of the hosts in the deployment):

Server Console

Copy
Copied!
            

# sed -i 's/SHELL=\/bin\/sh/SHELL=\/bin\/bash/g' /etc/default/useradd # useradd -mG 27 user # passwd user # su - user $ echo "$USER ALL=(ALL:ALL) NOPASSWD: ALL" | sudo tee "/etc/sudoers.d/$USER"


Installing rdma-core and Setting RDMA to "Exclusive Mode"

On all the worker nodes, install the rdma-core package:

Worker Node Console

Copy
Copied!
            

# apt install rdma-core -y

Set netns to exclusive mode to provide namespace isolation on the high-speed interface. With this configuration, each pod can only see and access its own virtual functions.

Create the following file:

Worker Node Console

Copy
Copied!
            

# vi /etc/modprobe.d/ib_core.conf   # Set netns to exclusive mode for namespace isolation options ib_core netns_mode=0

Run the following commands:

Worker Node Console

Copy
Copied!
            

# update-initramfs -u # reboot

After the node comes back, check that netns mode is set to exclusive mode:

Worker Node Console

Copy
Copied!
            

# rdma system netns exclusive copy-on-fork on

Info

You can use the "rdma link" command to identify the name assigned to the high-speed interface, for example:

# rdma link

link rocep63s0f0/1 state ACTIVE physical_state LINK_UP netdev enp63s0f0np0


NIC Firmware Upgrade

NVIDIA recommends that you upgrade the NIC firmware on the worker nodes to the latest released version.

Make sure to use the root account:

Worker Node Console

Copy
Copied!
            

$ sudo su -

Make sure to download the "mlxup" program to each Worker Node and install the latest firmware for the NIC (requires Internet connectivity - check the official download page)

Worker Node Console

Copy
Copied!
            

# wget https://www.mellanox.com/downloads/firmware/mlxup/4.28.0/SFX/linux_x64/mlxup # chmod 755 mlxup # lspci | grep Mellanox 3f:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] 3f:00.1 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7] # ./mlxup -d 0000:3f:00.0 -u --online # reboot

To verify that the firmware is updated, rerun the commands after reboot. The output looks similar to the following:

Worker Node Console

Copy
Copied!
            

./mlxup -d 0000:3f:00.0 -u --online Querying Mellanox devices firmware ...   Device #1: ----------   Device Type: ConnectX7 Part Number: MCX713106AC-VEA_Ax Description: NVIDIA ConnectX-7 HHHL Adapter Card; 200GbE; Dual-port QSFP112; PCIe 5.0 x16; Crypto Enabled; Secure Boot Enabled PSID: MT_0000000841 PCI Device Name: 0000:3f:00.0 Base GUID: b83fd2030018683a Base MAC: b83fd218683a Versions: Current Available FW 28.41.1000 28.41.1000 PXE 3.7.0400 N/A UEFI 14.34.0012 N/A   Status: Up to date

The K8s cluster in this solution is installed using Kubespray with a non-root user account from the Deployment Node.

SSH Private Key and SSH Passwordless Login

Log into the Deployment Node as a deployment user (in this case - user) and create an SSH private key to configure the password-less authentication on your computer:

Deployment Node Console

Copy
Copied!
            

$ sudo su - user $ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Created directory '/home/user/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@depl-node The key's randomart image is: +---[RSA 2048]----+ |      ...+oo+o..o| |      .oo   .o. o| |     . .. . o  +.| |   E  .  o +  . +| |    .   S = +  o | |     . o = + o  .| |      . o.o +   o| |       ..+.*. o+o| |        oo*ooo.++| +----[SHA256]-----+

Run the following commands to copy your SSH public key, such as ~/.ssh/id_rsa.pub, to all nodes in your deployment. The example shows node1 in the deployment.

Deployment Node Console

Copy
Copied!
            

$ ssh-copy-id 10.10.1.1 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub" The authenticity of host '10.10.1.1 (10.10.1.1)' can't be established. ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys user@10.10.1.1's password:   Number of key(s) added: 1   Now try logging into the machine, with:   "ssh 'user@10.10.1.1'" and check to make sure that only the key(s) you wanted were added.

To verify that you have password-less SSH connectivity to all nodes in your deployment, run the following command:

Deployment Node Console

Copy
Copied!
            

$ ssh user@10.10.1.1


Kubespray Deployment and Configuration

To install dependencies for running Kubespray with Ansible on the Deployment server, run following commands:

Deployment Node Console

Copy
Copied!
            

$ cd ~ $ sudo apt -y install python3-pip jq python3.10-venv $ git clone https://github.com/kubernetes-sigs/kubespray.git $ cd kubespray $ python3 -m venv .venv $ source .venv/bin/activate $ python3 -m pip install --upgrade pip $ pip install -U -r requirements.txt $ pip install ruamel-yaml

Create a new cluster configuration. The default folder for subsequent commands is ~/kubespray.

Replace the IP addresses below with the IP addresses of your nodes:

Deployment Node Console

Copy
Copied!
            

$ cp -rfp inventory/sample inventory/mycluster $ declare -a IPS=(10.10.1.1 10.10.1.2 10.10.1.3 10.10.1.4 10.10.1.5) $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

The inventory/mycluster/hosts.yaml file is created.

Review and change the host configuration in the file. The following is an example for this deployment:

inventory/mycluster/hosts.yaml

Copy
Copied!
            

$ vi inventory/mycluster/hosts.yaml   all: hosts: node1: ansible_host: 10.10.1.1 ip: 10.10.1.1 access_ip: 10.10.1.1 node2: ansible_host: 10.10.1.2 ip: 10.10.1.2 access_ip: 10.10.1.2 node3: ansible_host: 10.10.1.3 ip: 10.10.1.3 access_ip: 10.10.1.3 node4: ansible_host: 10.10.1.4 ip: 10.10.1.4 access_ip: 10.10.1.4 node5: ansible_host: 10.10.1.5 ip: 10.10.1.5 access_ip: 10.10.1.5 children: kube_control_plane: hosts: node1: kube_node: hosts: node2: node3: node4: node5: etcd: hosts: node1: k8s_cluster: children: kube_control_plane: kube_node: calico_rr: hosts: {}

Note

In the example deployment, there is 1 master node (node1) and 4 worker nodes (node2-5) so configure the hosts.yaml to be as follows:

  • kube_control_plane: node1

  • kube_node: node2-5

  • etcd: node1

Review and change the cluster installation parameters in the inventory/mycluster/group_vars/all/all.yml and inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml files.

In the inventory/mycluster/group_vars/all/all.yml file, remove the comment from the following line to enable Kubelet to serve on a read-only API (for metrics exposure) with no authentication or authorization:

Deployment Node Console

Copy
Copied!
            

$ sed -i 's/#\ kube_read_only_port:/kube_read_only_port:/g' inventory/mycluster/group_vars/all/all.yml

In the inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml file, set the value of kube_version to v1.29.0, set the container_manager to containerd, and make sure multi_networking is set to false - kube_network_plugin_multus: false (the plugin is installed later as part of NVIDIA network operator):

Deployment Node Console

Copy
Copied!
            

$ vi inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml   … ## Change this to use another Kubernetes version, e.g. a current beta release kube_version: v1.29.0 … ## Container runtime ## docker for docker, crio for cri-o and containerd for containerd. ## Default: containerd container_manager: containerd … # Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni kube_network_plugin_multus: false

In the inventory/mycluster/group_vars/all/etcd.yml file, set the etcd_deployment_type to host:

Deployment Node Console

Copy
Copied!
            

$ vi inventory/mycluster/group_vars/all/etcd.yml   ...   ## Settings for etcd deployment type # Set this to docker if you are using container_manager: docker etcd_deployment_type: host


Deploying the cluster using Kubespray Ansible Playbook

To start the deployment process, run the following command:

Deployment Node Console

Copy
Copied!
            

$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete. Make sure there are no errors.

A successful result looks similar to the following:

kubespray_successfull_run_cut-version-1-modificationdate-1730018440077-api-v2.png

Note

Now that the K8s cluster is deployed, connect to the K8s Master Node for the following sections and use the root account (where the K8s cluster credentials are stored).


K8s Deployment Verification

Below is an output example of a K8s cluster with the deployment information and with default Kubespray configuration using the Calico K8s CNI plugin.

To ensure that the K8s cluster is installed correctly, run the following commands:

Master Node Console

Copy
Copied!
            

root@node1:~# kubectl get nodes -o wide   NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane 2m8s v1.29.0 10.10.1.1 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node2 Ready <none> 93s v1.29.0 10.10.1.2 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node3 Ready <none> 92s v1.29.0 10.10.1.3 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node4 Ready <none> 93s v1.29.0 10.10.1.4 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16 node5 Ready <none> 93s v1.29.0 10.10.1.5 <none> Ubuntu 22.04.4 LTS 5.15.0-113-generic containerd://1.7.16   root@node1:~# kubectl get pods -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-kube-controllers-68485cbf9c-6sf4h 1/1 Running 0 62s 10.233.102.143 node1 <none> <none> calico-node-fxpxl 1/1 Running 0 79s 10.10.1.2 node2 <none> <none> calico-node-k6qzp 1/1 Running 0 79s 10.10.1.5 node5 <none> <none> calico-node-mh4pp 1/1 Running 0 79s 10.10.1.4 node4 <none> <none> calico-node-mslh4 1/1 Running 0 79s 10.10.1.3 node3 <none> <none> calico-node-ngnxx 1/1 Running 0 79s 10.10.1.1 node1 <none> <none> coredns-69db55dd76-qq5mw 1/1 Running 0 51s 10.233.75.23 node2 <none> <none> coredns-69db55dd76-qrl6q 1/1 Running 0 54s 10.233.102.129 node1 <none> <none> dns-autoscaler-6f4b597d8c-5cmgz 1/1 Running 0 52s 10.233.102.130 node1 <none> <none> kube-apiserver-node1 1/1 Running 1 2m15s 10.10.1.1 node1 <none> <none> kube-controller-manager-node1 1/1 Running 2 2m15s 10.10.1.1 node1 <none> <none> kube-proxy-2hfcg 1/1 Running 0 98s 10.10.1.3 node3 <none> <none> kube-proxy-444mg 1/1 Running 0 98s 10.10.1.2 node2 <none> <none> kube-proxy-52ctj 1/1 Running 0 98s 10.10.1.4 node4 <none> <none> kube-proxy-7g9xv 1/1 Running 0 98s 10.10.1.1 node1 <none> <none> kube-proxy-zg6t2 1/1 Running 0 98s 10.10.1.5 node5 <none> <none> kube-scheduler-node1 1/1 Running 1 2m14s 10.10.1.1 node1 <none> <none> nginx-proxy-node2 1/1 Running 0 101s 10.10.1.2 node2 <none> <none> nginx-proxy-node3 1/1 Running 0 101s 10.10.1.3 node3 <none> <none> nginx-proxy-node4 1/1 Running 0 102s 10.10.1.4 node4 <none> <none> nginx-proxy-node5 1/1 Running 0 102s 10.10.1.5 node5 <none> <none> nodelocaldns-7tnjx 1/1 Running 0 52s 10.10.1.2 node2 <none> <none> nodelocaldns-qkm5t 1/1 Running 0 52s 10.10.1.4 node4 <none> <none> nodelocaldns-rhd9g 1/1 Running 0 52s 10.10.1.5 node5 <none> <none> nodelocaldns-tg5pm 1/1 Running 0 52s 10.10.1.3 node3 <none> <none> nodelocaldns-wlwkn 1/1 Running 0 52s 10.10.1.1 node1 <none> <none>


NVIDIA Network Operator Installation

NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components and enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

You need to provision and configure several components. Perform all operator configuration and installation steps from the K8S master node with the root user account.

Prerequisites

Install helm on the K8S master node:

Master Node Console

Copy
Copied!
            

# wget https://get.helm.sh/helm-v3.15.1-linux-amd64.tar.gz # tar -zxvf helm-v3.15.1-linux-amd64.tar.gz # mv linux-amd64/helm /usr/local/bin/helm

Label the worker nodes:

Master Node Console

Copy
Copied!
            

# for i in $(seq 2 5); do kubectl label nodes node$i node-role.kubernetes.io/worker=; done node/node2 labeled node/node3 labeled node/node4 labeled node/node5 labeled # kubectl get nodes NAME STATUS ROLES AGE VERSION node1 Ready control-plane 12d v1.29.0 node2 Ready worker 12d v1.29.0 node3 Ready worker 12d v1.29.0 node4 Ready worker 12d v1.29.0 node5 Ready worker 12d v1.29.0

Note

K8s Worker Node labeling is required for a proper installation of the NVIDIA Network Operator.


Deployment

Add the NVIDIA Network Operator Helm repository:

Master Node Console

Copy
Copied!
            

# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia # helm repo update

Install the operator with custom values; use a configuration file to override some of the default values.

Generate the values.yaml file:

Master Node Console

Copy
Copied!
            

# helm show values nvidia/network-operator --version v24.4.0 > values.yaml

Edit the values.yaml file to enable SR-IOV support, the secondary network for K8S pods, and to install the MLNX_OFED driver as part of the operator deployment (necessary for GDR):

values.yaml

Copy
Copied!
            

... nfd: enabled: true ... sriovNetworkOperator: enabled: true ...   # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: true env: - name: UNLOAD_STORAGE_MODULES value: "true" ...   rdmaSharedDevicePlugin: deploy: false ...   sriovDevicePlugin: deploy: false ...   secondaryNetwork: deploy: true cniPlugins: deploy: true  ...  multus: deploy: true ... ipamPlugin: deploy: true

Deploy the operator:

Master Node Console

Copy
Copied!
            

# helm install --wait network-operator nvidia/network-operator -n nvidia-network-operator --create-namespace --version v24.4.0 -f ./values.yaml

After deployment, the SRIOV Network Operator is configured, and SriovNetworkNodePolicy and SriovNetwork are deployed.

You can speed up the deployment b efore you deploy the operator, by configuring SriovNetworkNodePool and setting the maxUnavailable parameter to 2 instead of 1 to drain more than 1 node at a time:

sriovnetwork-pool-config.yaml

Copy
Copied!
            

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkPoolConfig metadata: name: worker namespace: nvidia-network-operator spec: maxUnavailable: 2 nodeSelector: matchLabels: node-role.kubernetes.io/worker: ""

Apply the file:

Master Node Console

Copy
Copied!
            

# kubectl apply -f sriovnetwork-pool-config.yaml

Create the configuration files and apply them.

sriovnetwork-node-policy.yaml configuration file example:

sriovnetwork-node-policy.yaml

Copy
Copied!
            

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-1 namespace: nvidia-network-operator spec: deviceType: netdevice mtu: 8950 nicSelector: vendor: "15b3" pfNames: ["enp63s0f0np0"] nodeSelector: feature.node.kubernetes.io/pci-15b3.present: "true" numVfs: 8 priority: 90 isRdma: true resourceName: sriov_rdma

sriovnetwork.yaml configuration file example:

sriovnetwork.yaml

Copy
Copied!
            

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: "sriov20" namespace: nvidia-network-operator spec: vlan: 20 spoofChk: "off" networkNamespace: "default" resourceName: "sriov_rdma"  capabilities: '{ "mac": true }'  ipam: |- { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.20.0/24" } metaPlugins : | { "type": "rdma" }

Apply the configuration files described above:

Master Node Console

Copy
Copied!
            

# kubectl apply -f sriovnetwork-node-policy.yaml # kubectl apply -f sriovnetwork.yaml

Wait for all required pods to be spawned:

Master Node Console

Copy
Copied!
            

# kubectl get pod -n nvidia-network-operator NAME READY STATUS RESTARTS AGE cni-plugins-ds-bqpc5 1/1 Running 0 8h cni-plugins-ds-c98p7 1/1 Running 0 8h cni-plugins-ds-jrxss 1/1 Running 0 8h cni-plugins-ds-z65q4 1/1 Running 0 8h kube-multus-ds-fdfpq 1/1 Running 0 8h kube-multus-ds-kq6hr 1/1 Running 0 8h kube-multus-ds-lw666 1/1 Running 0 8h kube-multus-ds-nx5tb 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-47t7q 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-8hsl8 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-rhq7v 1/1 Running 0 8h mofed-ubuntu22.04-7d7f9f998-ds-vmjxr 1/1 Running 0 8h network-operator-5b75d4455d-tdgqm 1/1 Running 0 8h network-operator-node-feature-discovery-master-568478db7d-k8l55 1/1 Running 0 8h network-operator-node-feature-discovery-worker-8r94l 1/1 Running 0 8h network-operator-node-feature-discovery-worker-bm6sm 1/1 Running 0 8h network-operator-node-feature-discovery-worker-d67xg 1/1 Running 0 8h network-operator-node-feature-discovery-worker-pnrn9 1/1 Running 0 8h network-operator-node-feature-discovery-worker-rgfrg 1/1 Running 0 8h network-operator-sriov-network-operator-6478f68965-tqlbb 1/1 Running 0 8h sriov-device-plugin-2nz4d 1/1 Running 0 8h sriov-device-plugin-8x64x 1/1 Running 0 8h sriov-device-plugin-vw7mh 1/1 Running 0 8h sriov-device-plugin-x4fnx 1/1 Running 0 8h sriov-device-plugin-zxlc8 1/1 Running 0 8h sriov-network-config-daemon-2w42j 1/1 Running 0 8h sriov-network-config-daemon-4t7bb 1/1 Running 0 8h sriov-network-config-daemon-fvl66 1/1 Running 0 8h sriov-network-config-daemon-gvjgh 1/1 Running 0 8h sriov-network-config-daemon-srbhs 1/1 Running 0 8h whereabouts-87wmm 1/1 Running 0 8h whereabouts-kkg9q 1/1 Running 0 8h whereabouts-qk4v2 1/1 Running 0 8h whereabouts-trx2q 1/1 Running 0 8h

Verify that a network attachment definition is created for the network and that the allocatable resources now include sriov_rdma identical to the number of VFs:

Master Node Console

Copy
Copied!
            

# kubectl get net-attach-def NAME AGE sriov20 13m   # kubectl describe net-attach-def sriov20 Name: sriov20 Namespace: default Labels: <none> Annotations: k8s.v1.cni.cncf.io/resourceName: nvidia.com/sriov_rdma API Version: k8s.cni.cncf.io/v1 Kind: NetworkAttachmentDefinition Metadata: Creation Timestamp: 2024-07-07T13:15:08Z Generation: 1 Resource Version: 5071113 UID: 3da65cc7-eab6-4cc6-8a0a-0be000c5ea2d Spec: Config: { "cniVersion": "0.3.1", "name": "sriov20", "plugins": [ { "type": "sriov", "vlan": 20, "spoofchk": "off", "vlanQoS": 0, "capabilities": { "mac": true }, "logLevel": "info", "ipam": { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.20.0/24" } }, { "type": "rdma" } ] }   # for i in $(seq 2 5); do kubectl get node node$i -o json | jq '.status.allocatable."nvidia.com/sriov_rdma"'; done "8" "8" "8" "8"

NVIDIA GPU Operator Installation

The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision the GPU. These components include the NVIDIA drivers (to enable CUDA), the Kubernetes device plugin for the GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others. For information on platform support and getting started, visit the official documentation repository .

Prerequisites

Install Helm on the K8S master node (done previously).

Deployment

Add the NVIDIA GPU Operator Helm repository (same as with Network Operator):

Master Node Console

Copy
Copied!
            

# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia # helm repo update

Verify that NFD is running on the cluster (enabled through NVIDIA Network Operator). The output should be true for all the nodes:

Master Node Console

Copy
Copied!
            

# kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))' true true true true true

Deploy the GPU Operator, enable GPUDirect RDMA, and disable the NFD plugin as it is already running in the cluster:

Master Node Console

Copy
Copied!
            

# helm install --wait gpu-operator -n nvidia-gpu-operator --create-namespace nvidia/gpu-operator --set nfd.enabled=false --set driver.rdma.enabled=true NAME: gpu-operator LAST DEPLOYED: Wed Jun 19 10:40:35 2024 NAMESPACE: nvidia-gpu-operator STATUS: deployed REVISION: 1 TEST SUITE: None

Wait for all required pods to be spawned:

Master Node Console

Copy
Copied!
            

# kubectl get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-2mx2x 1/1 Running 0 11m gpu-feature-discovery-gz5lm 1/1 Running 0 7m23s gpu-feature-discovery-vxfvp 1/1 Running 0 14m gpu-feature-discovery-wfhhl 1/1 Running 0 4m19s gpu-operator-7bbf8bb6b7-6mnrl 1/1 Running 0 20d nvidia-container-toolkit-daemonset-cg4h6 1/1 Running 0 11m nvidia-container-toolkit-daemonset-d9xr5 1/1 Running 0 7m23s nvidia-container-toolkit-daemonset-fqx7n 1/1 Running 0 14m nvidia-container-toolkit-daemonset-qj2rg 1/1 Running 0 4m19s nvidia-cuda-validator-8nmqs 0/1 Completed 0 5m51s nvidia-cuda-validator-dk9q2 0/1 Completed 0 13m nvidia-cuda-validator-mtmn8 0/1 Completed 0 2m44s nvidia-cuda-validator-zb9lc 0/1 Completed 0 9m45s nvidia-dcgm-exporter-227m9 1/1 Running 0 11m nvidia-dcgm-exporter-7lptj 1/1 Running 0 7m23s nvidia-dcgm-exporter-7pfvv 1/1 Running 0 4m19s nvidia-dcgm-exporter-cmg9x 1/1 Running 0 14m nvidia-device-plugin-daemonset-njjc7 1/1 Running 0 14m nvidia-device-plugin-daemonset-nnqgs 1/1 Running 0 11m nvidia-device-plugin-daemonset-p2hqd 1/1 Running 0 4m19s nvidia-device-plugin-daemonset-zqmbh 1/1 Running 0 7m23s nvidia-driver-daemonset-2vc5m 2/2 Running 0 8m11s nvidia-driver-daemonset-gst7x 2/2 Running 0 15m nvidia-driver-daemonset-hpw6m 2/2 Running 0 12m nvidia-driver-daemonset-xbm7n 2/2 Running 0 5m4s nvidia-mig-manager-5nph5 1/1 Running 0 7m23s nvidia-mig-manager-84txd 1/1 Running 0 14m nvidia-mig-manager-clfzv 1/1 Running 0 4m19s nvidia-mig-manager-npl2x 1/1 Running 0 11m nvidia-operator-validator-4h5rc 1/1 Running 0 11m nvidia-operator-validator-8krdh 1/1 Running 0 4m19s nvidia-operator-validator-8m7nk 1/1 Running 0 14m nvidia-operator-validator-g9qwj 1/1 Running 0 7m23s

Verify that the allocatable resources now include gpu, and that the NVIDIA kernel modules are loaded successfully on the worker nodes (in addition to the regular kernel modules, the nvidia-peermem kernel module must be loaded to enable GDR) :

Master Node Console

Copy
Copied!
            

# for i in $(seq 2 5); do kubectl get node node$i -o json | jq '.status.allocatable."nvidia.com/gpu"'; done "2" "2" "2" "2"

Depserver Node Console

Copy
Copied!
            

user@depserver:~/kubespray$ ansible -m shell -a "lsmod | grep nvidia" -i inventory/mycluster/hosts.yaml kube_node node5 | CHANGED | rc=0 >> nvidia_peermem 16384 0 nvidia_modeset 1343488 0 nvidia_uvm 4644864 4 nvidia 54018048 45 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm drm 622592 7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm node2 | CHANGED | rc=0 >> nvidia_peermem 16384 0 nvidia_modeset 1343488 0 nvidia_uvm 4644864 4 nvidia 54018048 45 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm drm 622592 7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm node3 | CHANGED | rc=0 >> nvidia_peermem 16384 0 nvidia_modeset 1343488 0 nvidia_uvm 4644864 4 nvidia 54018048 45 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm drm 622592 7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm node4 | CHANGED | rc=0 >> nvidia_peermem 16384 0 nvidia_modeset 1343488 0 nvidia_uvm 4644864 4 nvidia 54018048 45 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm drm 622592 7 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,ttm

Verify deployment and that you can reach link speed performance on the high speed network by using various tests:

  1. RDMA

  2. Iperf TCP

  3. GPUDirect RDMA

  4. DPDK

Each of the tests are described thoroughly. At the end of each test, you'll see the achieved performance, proving link speed performance.

Note
  • Make sure that the servers are tuned for maximum performance (not covered in this document).

  • You must enable and configure IOMMU in passthrough mode (mainly relevant for Iperf and DPDK tests).

  • Make sure to implement each one of the optimizations described below to achieve maximum performance.

Optimizing Worker Nodes for Performance

Before starting the different tests and to accommodate performance-sensitive applications, optimize the worker nodes for better performance by enabling pod scheduling on exclusive cores that are mapped to the same NUMA node of the NIC. Also, enable IOMMU and set it to passthrough mode for better performance.

Configuring CPU and Topology Manager Policies

Drain the worker node and make the node unschedulable (the example uses node3):

Master Node Console

Copy
Copied!
            

# kubectl drain node3 --delete-emptydir-data --force --ignore-daemonsets

On the worker node, make sure to use the root account:

Worker Node Console

Copy
Copied!
            

$ sudo su -

Check to which NUMA node the NIC is wired:

Worker Node Console

Copy
Copied!
            

# cat /sys/class/net/enp63s0f0np0/device/numa_node 0

In this example, the NIC is wired to NUMA node 0.

Check the NUMA nodes of the CPU and which cores are in NUMA node 0:

Worker Node Console

Copy
Copied!
            

# lscpu | grep NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47

In this example, the cores that are in NUMA node 0 are: 0-23.

Configure the kubelet service on the worker node (using the kubelet-config.yaml file):

  • The "cpuManagerPolicy" attribute specifies the selected CPU manger policy (which can be either "none" or "static").

  • The "reservedSystemCPUs" attribute defines an explicit CPU set for OS system daemons and Kubernetes system daemons. (To move the system daemons and the Kubernetes daemons to the explicit CPU set defined by this option, use other mechanisms outside Kubernetes).

  • The "topologyManagerPolicy" attribute specifies the selected policy for the topology manager ("none", "best-effort", "restricted", or "single-numa-node").

Set the cpuManagerPolicy to static, which allows containers in Guaranteed pods with integer CPU requests access to exclusive CPUs on the node. Reserve some cores for the system using the reservedSystemCPUs option (kubelet requires a CPU reservation greater than zero to be made when the static policy is enabled), and make sure they belong to NUMA 1 (because the NIC in the example is wired to NUMA node 0, use cores from NUMA 0 if the NIC is wired to NUMA node 1). Also, define the topology to be single-numa-node so it only allows a pod to be admitted if all requested CPUs and devices can be allocated from exactly one NUMA node:

/etc/kubernetes/kubelet-config.yaml

Copy
Copied!
            

... # enable CPU Manager and Topology Manager cpuManagerPolicy: static cpuManagerReconcilePeriod: 10s reservedSystemCPUs: "44,45,46,47" topologyManagerPolicy: single-numa-node featureGates: CPUManager: true ...

When you change reservedSystemCPUs or cpuManagerPolicy, delete the /var/lib/kubelet/cpu_manager_state file and restart the kubelet service:

Worker Node Console

Copy
Copied!
            

# systemctl stop kubelet # rm -f /var/lib/kubelet/cpu_manager_state # systemctl restart kubelet

Reenable scheduling on the specified worker node:

Master Node Console

Copy
Copied!
            

# kubectl uncordon node3

To verify that the configuration is a success, schedule a pod in Guaranteed QoS class (make sure to schedule it on the specific node you just configured):

nginx.yaml

Copy
Copied!
            

apiVersion: v1 kind: Pod metadata: labels: app: nginx name: nginx spec:  affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - node3   containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "2" requests: memory: "200Mi" cpu: "2"

Apply the configuration file above and verify it is running on the desired node:

Master Node Console

Copy
Copied!
            

# kubectl apply -f nginx.yaml pod/nginx created # kubectl get pods -o wide | grep nginx nginx 1/1 Running 0 3m57s 10.233.71.34 node3 <none> <none>

ssh to the worker node on which it is running and issue the following commands:

Worker Node Console

Copy
Copied!
            

# ssh node3 # crictl ps | grep nginx c0fa7b5edac77 fffffc90d343c 4 minutes ago Running nginx 0 325e71bdb981d nginx # crictl inspect c0fa7b5edac77 | jq '.status.resources.linux.cpusetCpus' "0-1"

The output displays exclusive cores (0-1), which are located in NUMA node 0 as the NIC in the example.

Enable IOMMU in Passthrough Mode

Modify the GRUB_CMDLINE_LINUX_DEFAULT parameter in the /etc/default/grub file and configure IOMMU in passthrough mode:

Worker Node Console

Copy
Copied!
            

# vi /etc/default/grub ... GRUB_CMDLINE_LINUX_DEFAULT="iommu=pt" ...

Apply the new configuration to grub by running the command below and reboot the host (remember to enable IOMMU on the BIOS in case it is disabled):

Worker Node Console

Copy
Copied!
            

# update-grub # reboot

When the worker nodes are back up, verify that the IOMMU configuration is applied correctly by running the following ansible command from the depserver:

Depserver Node Console

Copy
Copied!
            

user@depserver:~/kubespray$ ansible -m shell -a "dmesg | grep 'type: Passthrough'" -i inventory/mycluster/hosts.yaml kube_node --become node2 | CHANGED | rc=0 >> [ 3.051710] iommu: Default domain type: Passthrough (set via kernel command line) node3 | CHANGED | rc=0 >> [ 3.067741] iommu: Default domain type: Passthrough (set via kernel command line) node4 | CHANGED | rc=0 >> [ 3.174857] iommu: Default domain type: Passthrough (set via kernel command line) node5 | CHANGED | rc=0 >> [    3.190489] iommu: Default domain type: Passthrough (set via kernel command line)

Performance Tests

RoCE Bandwidth Test

    1. Create a test DaemonSet using the following YAML to create a pod on every node that you can use to test RDMA connectivity and performance over the high-speed network.

      Notice that the YAML includes an annotation referencing the required network ("sriov20") and has resource requests for a single SRIOV virtual function ("nvidia,com/sriov_rdma").

      The container image specified below must include NVIDIA user space drivers and perftest.

      example-daemon.yaml

      Copy
      Copied!
                  

      apiVersion: apps/v1 kind: DaemonSet metadata: name: example-daemon labels: app: example-dae spec: selector: matchLabels: app: example-dae template: metadata: labels: app: example-dae annotations: k8s.v1.cni.cncf.io/networks: sriov20 spec: containers: - image: <container_image> name: example-dae-pod securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' requests: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' command: - sleep - inf

    2. Apply the resource.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl apply -f example-daemon.yaml

    3. Validate that the daemon set is running successfully. You should see four pods running, one on each worker node.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES example-daemon-l52tb 1/1 Running 0 6s 10.233.75.54 node2 <none> <none> example-daemon-p7xt8 1/1 Running 0 6s 10.233.97.173 node5 <none> <none> example-daemon-phmcm 1/1 Running 0 6s 10.233.71.20 node3 <none> <none> example-daemon-pvjcs 1/1 Running 0 6s 10.233.74.86 node4 <none> <none>  

  1. Now that the test DaemonSet is running, run a performance test to check the RDMA performance between two pods running on two different worker nodes.

    1. Connect to one of the pods in the DaemonSet.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it example-daemon-l52tb -- bash

    2. From within the container, check its IP address on the high-speed network interface (net1) and see that it is recognizable as an RDMA device.

      First pod console

      Copy
      Copied!
                  

      root@example-daemon-l52tb:/# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether ba:c6:9c:98:99:e0 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.75.54/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::b8c6:9cff:fe98:99e0/64 scope link valid_lft forever preferred_lft forever 65: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000 link/ether 0e:4a:41:e5:71:9c brd ff:ff:ff:ff:ff:ff inet 192.168.20.16/24 brd 192.168.20.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::c4a:41ff:fee5:719c/64 scope link valid_lft forever preferred_lft forever   root@example-daemon-l52tb:/# rdma link link rocep63s0f0v4/1 state ACTIVE physical_state LINK_UP netdev net1

    3. Start the ib_write_bw server side.

      First pod console

      Copy
      Copied!
                  

      root@example-daemon-l52tb:/# ib_write_bw -a -F -q 4 --report_gbits ************************************ * Waiting for client to connect... * ************************************

    1. Using another console window, reconnect to the master node and connect to the second pod in the deployment.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it example-daemon-p7xt8 -- bash

    2. Repeat the steps above to verify it has a high speed network interface that is recognizable as a RDMA device.

  2. From within the container, start the ib_write_bw client (use the IP address from the server side container).

    Verify that the maximum bandwidth between containers reaches more than 190 Gb/s.

    Second pod console

    Copy
    Copied!
                

    root@example-daemon-p7xt8:/# ib_write_bw -a -F -q 4 --report_gbits 192.168.20.16 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rocep63s0f0v7 Number of qps : 4 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x01cc PSN 0x5283ae RKey 0x048f07 VAddr 0x007f7f2b956000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:21 local address: LID 0000 QPN 0x01cd PSN 0x9fcf00 RKey 0x048f07 VAddr 0x007f7f2c156000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:21 local address: LID 0000 QPN 0x01ce PSN 0x76a44a RKey 0x048f07 VAddr 0x007f7f2c956000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:21 local address: LID 0000 QPN 0x01cf PSN 0x7d0ed1 RKey 0x048f07 VAddr 0x007f7f2d156000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:21 remote address: LID 0000 QPN 0x016c PSN 0xf537cb RKey 0x030f07 VAddr 0x007ffaae24a000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:16 remote address: LID 0000 QPN 0x016d PSN 0x748d59 RKey 0x030f07 VAddr 0x007ffaaea4a000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:16 remote address: LID 0000 QPN 0x016e PSN 0x1ba62f RKey 0x030f07 VAddr 0x007ffaaf24a000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:16 remote address: LID 0000 QPN 0x016f PSN 0x8e9b52 RKey 0x030f07 VAddr 0x007ffaafa4a000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:16 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 20000 0.045390 0.045313 2.832047 4 20000 0.091348 0.091268 2.852139 8 20000 0.18 0.18 2.853169 16 20000 0.37 0.36 2.850147 32 20000 0.73 0.73 2.849719 64 20000 1.46 1.46 2.854070 128 20000 2.93 2.93 2.861795 256 20000 5.85 5.85 2.854513 512 20000 11.62 11.61 2.833429 1024 20000 33.96 25.34 3.093360 2048 20000 67.56 57.63 3.517562 4096 20000 134.61 119.12 3.635178 8192 20000 192.58 187.25 2.857281 16384 20000 195.09 191.90 1.464081 32768 20000 193.87 193.85 0.739461 65536 20000 194.66 194.65 0.371261 131072 20000 195.18 195.18 0.186135 262144 20000 193.91 191.18 0.091160 524288 20000 195.69 195.69 0.046655 1048576 20000 195.80 195.80 0.023341 2097152 20000 195.84 195.84 0.011673 4194304 20000 195.87 195.87 0.005837 8388608 20000 195.88 195.88 0.002919 ---------------------------------------------------------------------------------------  

iperf TCP Test

  1. Create a test DaemonSet using the YAML from the previous example to create a pod on every node that you can use to test TCP connectivity and performance over the high-speed network. Note that the container image specified in the test must include iperf.

    Note

    The example test above uses an iperf3 version (3.16) that supports multi thread and parallel client streams.

    If you are using an older version for your testing, start multiple iperf3 servers, each on a different port and bind it to a different core to achieve best performance.

    1. Connect to one of the pods in the DaemonSet.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it example-daemon-tv626 -- bash

    2. From within the container, check its IP address on the high-speed network interface (net1).

      First pod console

      Copy
      Copied!
                  

      root@example-daemon-tv626:/# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if90: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether da:18:68:be:70:18 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.75.36/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::d818:68ff:febe:7018/64 scope link valid_lft forever preferred_lft forever 84: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000 link/ether 96:7f:6a:f0:54:5b brd ff:ff:ff:ff:ff:ff inet 192.168.20.18/24 brd 192.168.20.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::947f:6aff:fef0:545b/64 scope link valid_lft forever preferred_lft forever

    3. Start an iperf3 server listener (make sure to bind it to the high speed interface).

      First Pod Console

      Copy
      Copied!
                  

      root@example-daemon-tv626:/# iperf3 -s -B 192.168.20.18 ----------------------------------------------------------- Server listening on 5201 (test #1) -----------------------------------------------------------

    1. Use another console window to reconnect to the master node and connect to the second pod in the deployment.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it example-daemon-n7kc4 -- bash

    2. From within the container, check its IP address on the high-speed network interface (net1).

      Second pod console

      Copy
      Copied!
                  

      root@example-daemon-n7kc4:/# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if84: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether 86:14:2d:e7:80:73 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.74.65/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::8414:2dff:fee7:8073/64 scope link valid_lft forever preferred_lft forever 75: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000 link/ether ee:f1:64:93:54:c9 brd ff:ff:ff:ff:ff:ff inet 192.168.20.21/24 brd 192.168.20.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::ecf1:64ff:fe93:54c9/64 scope link valid_lft forever preferred_lft forever

  2. Start the iperf3 client and verify that it reaches above 190 Gb/s.

    Second pod console

    Copy
    Copied!
                

    root@example-daemon-tv626:/# iperf3 -c 192.168.20.18 -B 192.168.20.21 -P 8 -t 30 -i 10 Connecting to host 192.168.20.18, port 5201 [ 5] local 192.168.20.21 port 49789 connected to 192.168.20.18 port 5201 [ 7] local 192.168.20.21 port 42701 connected to 192.168.20.18 port 5201 [ 9] local 192.168.20.21 port 40905 connected to 192.168.20.18 port 5201 [ 11] local 192.168.20.21 port 36625 connected to 192.168.20.18 port 5201 [ 13] local 192.168.20.21 port 47409 connected to 192.168.20.18 port 5201 [ 15] local 192.168.20.21 port 52191 connected to 192.168.20.18 port 5201 [ 17] local 192.168.20.21 port 50621 connected to 192.168.20.18 port 5201 [ 19] local 192.168.20.21 port 55499 connected to 192.168.20.18 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.01 sec 26.9 GBytes 23.1 Gbits/sec 449 2.61 MBytes [ 7] 0.00-10.01 sec 32.3 GBytes 27.7 Gbits/sec 183 1.82 MBytes [ 9] 0.00-10.01 sec 27.0 GBytes 23.1 Gbits/sec 451 2.96 MBytes [ 11] 0.00-10.01 sec 32.4 GBytes 27.8 Gbits/sec 305 2.93 MBytes [ 13] 0.00-10.01 sec 29.6 GBytes 25.4 Gbits/sec 171 1.85 MBytes [ 15] 0.00-10.01 sec 24.2 GBytes 20.7 Gbits/sec 170 3.05 MBytes [ 17] 0.00-10.01 sec 25.9 GBytes 22.3 Gbits/sec 74 2.98 MBytes [ 19] 0.00-10.01 sec 28.5 GBytes 24.5 Gbits/sec 453 2.61 MBytes [SUM] 0.00-10.01 sec 227 GBytes 195 Gbits/sec 2256 - - - - - - - - - - - - - - - - - - - - - - - - - ... ... ... - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.01 sec 80.4 GBytes 23.0 Gbits/sec 995 sender [ 5] 0.00-30.01 sec 80.4 GBytes 23.0 Gbits/sec receiver [ 7] 0.00-30.01 sec 92.1 GBytes 26.4 Gbits/sec 637 sender [ 7] 0.00-30.01 sec 92.1 GBytes 26.4 Gbits/sec receiver [ 9] 0.00-30.01 sec 82.4 GBytes 23.6 Gbits/sec 781 sender [ 9] 0.00-30.01 sec 82.4 GBytes 23.6 Gbits/sec receiver [ 11] 0.00-30.01 sec 95.2 GBytes 27.3 Gbits/sec 801 sender [ 11] 0.00-30.01 sec 95.2 GBytes 27.3 Gbits/sec receiver [ 13] 0.00-30.01 sec 88.5 GBytes 25.3 Gbits/sec 580 sender [ 13] 0.00-30.01 sec 88.5 GBytes 25.3 Gbits/sec receiver [ 15] 0.00-30.01 sec 81.1 GBytes 23.2 Gbits/sec 674 sender [ 15] 0.00-30.01 sec 81.1 GBytes 23.2 Gbits/sec receiver [ 17] 0.00-30.01 sec 80.5 GBytes 23.1 Gbits/sec 691 sender [ 17] 0.00-30.01 sec 80.5 GBytes 23.1 Gbits/sec receiver [ 19] 0.00-30.01 sec 82.8 GBytes 23.7 Gbits/sec 1049 sender [ 19] 0.00-30.01 sec 82.8 GBytes 23.7 Gbits/sec receiver [SUM] 0.00-30.01 sec 683 GBytes 196 Gbits/sec 6208 sender [SUM] 0.00-30.01 sec 683 GBytes 196 Gbits/sec receiver   iperf Done.

GPUDirect RDMA Test

Note
  • Performing an optimal GPUDirect RDMA Benchmark test requires a server with PCIe Bridges. The network adapter and GPU used in this test must be located under the same PCIe Bridge device and associated with the same CPU NUMA Node.

    • You can use the "lspci -tv" command to display the device hierarchy and verify that the adapter or GPU PCI devices are hosted under the same PCIe Bridge.

    • You can use "lspci -vvv -s <PCI_Device_ID>" to identify the NUMA node associated with the adapter or GPU PCI devices.

  • In the servers used for this test, the Network-RDMA device (ConnectX-7) and GPU device (PCIe A100) share NUMA Node 0 and are connected under the same PCIe Bridge device.

  • For the GPUDirect RDMA benchmark test described in this section, the NVIDIA Network Operator and NVIDIA GPU Operator are installed with the appropriate drivers - MLNX_OFED and nvidia-peermem. Without them (and CUDA), GPUDirect RDMA does not work!

  • Some of the configurations applied in this section are not persistent; you must reapply the configuration after a server or instance reboot.

  1. Prepare the setup for running GDR by applying the following steps on all of the worker nodes.

    1. Install the mstflint tool to be able to perform firmware changes on your adapter.

      Worker Node Console

      Copy
      Copied!
                  

      # apt install -y mstflint

    2. Set the Advanced PCI settings firmware parameter on your adapter to true and reboot the host for the change to take effect.

      Worker Node Console

      Copy
      Copied!
                  

      # mstconfig -d 3f:00.0 -y set ADVANCED_PCI_SETTINGS=1 # reboot

    3. Increase the adapter’s maximum accumulated read requests and reboot the host.

      Worker Node Console

      Copy
      Copied!
                  

      # mstconfig -d 3f:00.0 -y set MAX_ACC_OUT_READ=44 # reboot

      Note
      • The value of 44 maximum requests used in the example above is a best practice value for a 200Gb/s test over a server with a PCIe Gen4 CPU.

      • In some cases, you might have to increase the PCIe MaxReadReq size of the network device to 4KB using the setpci command to further optimize the bandwidth test results.

    4. Verify that the adapter firmware parameters have been applied.

      Worker Node Console

      Copy
      Copied!
                  

      # mstconfig -d 3f:00.0 q | egrep "ADVANCED_PCI_SETTINGS|MAX_ACC_OUT_READ" MAX_ACC_OUT_READ 44 ADVANCED_PCI_SETTINGS True(1)

    5. Set the PCIe MaxReadReq size of the adapter to 4KB (the default is 512 bytes) and verify that the changes have been applied.

      Worker Node Console

      Copy
      Copied!
                  

      # setpci -s 3f:00.0 68.w=5000:f000 # lspci -s 3f:00.0 -vvv | grep MaxReadReq MaxPayload 256 bytes, MaxReadReq 4096 bytes

    6. Disable ACS on all PCIe bridge devices in the system.

      Note
      • IO virtualization (also known as, VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCIe point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. Make sure that ACS is disabled on the PCIe. A Value of <flag> with ‘+' means enabled, while '-’ means disabled. Make sure all ACS flags are disabled.

      • In many server architectures, there are multiple chained PCIe Bridge devices serving a bulk of PCIe slots. The adapter and GPU might be connected to different sub devices in this PCIe bridge chain.

      • The provided script disables ACS on all PCIe Bridge devices in the system.

      • This step is not persistent and has to be re-applied every time you reboot the server.

      Check the adapter.

      Worker Node Console

      Copy
      Copied!
                  

      # lspci -s 3f:00.0 -vvv | grep ACSCtl ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

      If enabled, run the following command:

      Worker Node Console

      Copy
      Copied!
                  

      # setpci -s 3f:00.0 f2a.w=0000

      Run the following script to ensure that the PCI Access Control List for all PCI bridges is disabled.

      Worker Node Console

      Copy
      Copied!
                  

      # for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do # skip if it doesn't support ACS sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1 if [ $? -ne 0 ]; then continue fi sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000 done

  2. Set the GPU clock speed to the maximum value - the example uses A100, in which the max allowed clock is 1410. Use the nvidia-smi command in the driver containers.

    1. List all the driver pods.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl get pods -n nvidia-gpu-operator | grep driver nvidia-driver-daemonset-8ngqz 2/2 Running 8 (5h20m ago) 13h nvidia-driver-daemonset-mt44z 2/2 Running 12 (5h15m ago) 13h nvidia-driver-daemonset-ncc8x 2/2 Running 11 (4h40m ago) 13h nvidia-driver-daemonset-nw52t 2/2 Running 8 (5h21m ago) 13h

    2. In each one of the pods, adjust the clock speed to 1410 (the example uses indexes 0 and 1 because there are two GPUs in each worker).

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it nvidia-driver-daemonset-8ngqz -n nvidia-gpu-operator -- nvidia-smi -i 0 -lgc 1410 GPU clocks set to "(gpuClkMin 1410, gpuClkMax 1410)" for GPU 00000000:3C:00.0 All done. # kubectl exec -it nvidia-driver-daemonset-8ngqz -n nvidia-gpu-operator -- nvidia-smi -i 1 -lgc 1410 GPU clocks set to "(gpuClkMin 1410, gpuClkMax 1410)" for GPU 00000000:40:00.0 All done. ...

    3. Verify the new value is set correctly with the dcgm-exporter endpoint. Use the service Cluster IP for query and DCGM_FI_DEV_SM_CLOCK metric.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl get svc -n nvidia-gpu-operator NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE gpu-operator ClusterIP 10.233.29.218 <none> 8080/TCP 7d21h nvidia-dcgm-exporter ClusterIP 10.233.24.139 <none> 9400/TCP 7d21h   # for i in $(seq 1 4); do curl -s 10.233.24.139:9400/metrics | grep "DCGM_FI_DEV_SM_CLOCK{"; done DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-e4031089-4e6f-da56-ba01-fe13a26bd050",device="nvidia0",modelName="NVIDIA A100-PCIE-40GB",Hostname="node2",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410 DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-c25937b2-2a78-bff9-b213-2359d296300c",device="nvidia1",modelName="NVIDIA A100-PCIE-40GB",Hostname="node2",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410 DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-9f469254-473d-22ed-c524-57fe28c44f91",device="nvidia0",modelName="NVIDIA A100-PCIE-40GB",Hostname="node3",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410 DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-dd1fa1c5-fe82-35fb-4df5-55fc19793488",device="nvidia1",modelName="NVIDIA A100-PCIE-40GB",Hostname="node3",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410 DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-516c8879-9187-bebe-5705-a08a852fc1ba",device="nvidia0",modelName="NVIDIA A100-PCIE-40GB",Hostname="node5",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410 DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-b422c6d8-5b48-1a26-7d9a-d4a439cdf152",device="nvidia1",modelName="NVIDIA A100-PCIE-40GB",Hostname="node5",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410 DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-89373171-3f52-c598-ddc0-3b79a6e5cb17",device="nvidia0",modelName="NVIDIA A100-PCIE-40GB",Hostname="node4",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410 DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-72a692f2-7851-74a2-6ddc-35c81f162821",device="nvidia1",modelName="NVIDIA A100-PCIE-40GB",Hostname="node4",DCGM_FI_DRIVER_VERSION="550.54.15"} 1410

  3. Create a DaemonSet using the following yaml file. The example uses the mellanox/cuda-perftest image to be able to use GDR-enabled ib_write_bw.

    cudaperf-daemon.yaml

    Copy
    Copied!
                

    apiVersion: apps/v1 kind: DaemonSet metadata: name: cudaperf-daemon labels: app: cudaperf-dae spec: selector: matchLabels: app: cudaperf-dae template: metadata: labels: app: cudaperf-dae annotations: k8s.v1.cni.cncf.io/networks: sriov20 spec: containers: - image: mellanox/cuda-perftest:latest name: cudaperf-dae-pod securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' nvidia.com/gpu: '1' requests: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' nvidia.com/gpu: '1' command: - sleep - inf

  4. Verify the adapter bandwidth with GDR.

    1. Connect to one of the pods in the daemon set.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it cudaperf-daemon-8krhz -- bash

    2. Make sure that the network adapter and the GPU are connected on the same PCIe switch (look for either PIX or PXB output).

      First Pod Console

      Copy
      Copied!
                  

      root@cudaperf-daemon-8krhz:~# nvidia-smi topo -m GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX 0-7 0 N/A NIC0 PIX X   Legend:   X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks   NIC Legend:   NIC0: rocep63s0f0v5

    3. Start the ib_write_bw server side with the --use_cuda flag.

      First Pod Console

      Copy
      Copied!
                  

      root@cudaperf-daemon-8krhz:~# ib_write_bw -a -F --report_gbits -q 4 --use_cuda 0   ************************************ * Waiting for client to connect... * ************************************

    4. Connect to another pod in the daemon set.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it cudaperf-daemon-xdchn -- bash

    5. Verify GPU and network adapter affinity.

      Second Pod Console

      Copy
      Copied!
                  

      root@cudaperf-daemon-xdchn:~# nvidia-smi topo -m GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX 0-7 0 N/A NIC0 PIX X   Legend:   X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks   NIC Legend:   NIC0: rocep63s0f0v2

    6. Start the ib_write_bw client with the --use_cuda flag and the IP address of the first pod, and verify that the speed is above 190 Gb/sec.

      Second Pod Console

      Copy
      Copied!
                  

      root@cudaperf-daemon-xdchn:~# ib_write_bw -a -F --report_gbits -q 4 --use_cuda 0 192.168.20.23 initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 40:00   Picking device No. 0 [pid = 20, dev = 0] device name = [NVIDIA A100-PCIE-40GB] creating CUDA Ctx making it the current CUDA Ctx cuMemAlloc() of a 67108864 bytes GPU buffer allocated GPU buffer address at 00007fbe28000000 pointer=0x7fbe28000000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rocep63s0f0v2 Number of qps : 4 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x00f0 PSN 0xe967be RKey 0x010f08 VAddr 0x007fbe2a000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 local address: LID 0000 QPN 0x00f1 PSN 0x8549d0 RKey 0x010f08 VAddr 0x007fbe2a800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 local address: LID 0000 QPN 0x00f2 PSN 0x42ceda RKey 0x010f08 VAddr 0x007fbe2b000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 local address: LID 0000 QPN 0x00f3 PSN 0x688e21 RKey 0x010f08 VAddr 0x007fbe2b800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 remote address: LID 0000 QPN 0x0270 PSN 0x6fe017 RKey 0x070f08 VAddr 0x007ff498000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 remote address: LID 0000 QPN 0x0271 PSN 0xf27db5 RKey 0x070f08 VAddr 0x007ff498800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 remote address: LID 0000 QPN 0x0272 PSN 0x7da55b RKey 0x070f08 VAddr 0x007ff499000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 remote address: LID 0000 QPN 0x0273 PSN 0x19c90e RKey 0x070f08 VAddr 0x007ff499800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 20000 0.045310 0.045232 2.827029 4 20000 0.091674 0.088883 2.777584 8 20000 0.18 0.18 2.866542 16 20000 0.37 0.37 2.863911 32 20000 0.73 0.73 2.864631 64 20000 1.47 1.47 2.864966 128 20000 2.93 2.93 2.863358 256 20000 5.87 5.87 2.864869 512 20000 11.74 11.73 2.864219 1024 20000 23.47 23.45 2.862455 2048 20000 46.85 46.82 2.857363 4096 20000 93.96 93.86 2.864305 8192 20000 187.75 187.71 2.864201 16384 20000 195.78 195.70 1.493081 32768 20000 195.77 195.73 0.746669 65536 20000 195.77 195.76 0.373390 131072 20000 195.56 195.09 0.186055 262144 20000 195.78 195.78 0.093354 524288 20000 195.83 195.83 0.046690 1048576 20000 195.84 195.84 0.023346 2097152 20000 195.82 195.81 0.011671 4194304 20000 198.88 198.88 0.005927 8388608 20000 195.84 195.84 0.002918 --------------------------------------------------------------------------------------- deallocating RX GPU buffer 00007fbe28000000 destroying current CUDA Ctx

    7. Summary view in the first pod.

      First Pod Console

      Copy
      Copied!
                  

      root@cudaperf-daemon-8krhz:~# ib_write_bw -a -F --report_gbits -q 4 --use_cuda 0   ************************************ * Waiting for client to connect... * ************************************ initializing CUDA Listing all CUDA devices in system: CUDA device 0: PCIe address is 40:00   Picking device No. 0 [pid = 22, dev = 0] device name = [NVIDIA A100-PCIE-40GB] creating CUDA Ctx making it the current CUDA Ctx cuMemAlloc() of a 67108864 bytes GPU buffer allocated GPU buffer address at 00007ff496000000 pointer=0x7ff496000000 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rocep63s0f0v5 Number of qps : 4 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 3 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0270 PSN 0x6fe017 RKey 0x070f08 VAddr 0x007ff498000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 local address: LID 0000 QPN 0x0271 PSN 0xf27db5 RKey 0x070f08 VAddr 0x007ff498800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 local address: LID 0000 QPN 0x0272 PSN 0x7da55b RKey 0x070f08 VAddr 0x007ff499000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 local address: LID 0000 QPN 0x0273 PSN 0x19c90e RKey 0x070f08 VAddr 0x007ff499800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:23 remote address: LID 0000 QPN 0x00f0 PSN 0xe967be RKey 0x010f08 VAddr 0x007fbe2a000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 remote address: LID 0000 QPN 0x00f1 PSN 0x8549d0 RKey 0x010f08 VAddr 0x007fbe2a800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 remote address: LID 0000 QPN 0x00f2 PSN 0x42ceda RKey 0x010f08 VAddr 0x007fbe2b000000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 remote address: LID 0000 QPN 0x00f3 PSN 0x688e21 RKey 0x010f08 VAddr 0x007fbe2b800000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:18 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 8388608 20000 195.84 195.84 0.002918 ---------------------------------------------------------------------------------------

DPDK

Note
  • To perform the DPDK test, enable Huge Pages on the worker nodes. Kubernetes supports the allocation and consumption of pre-allocated HugePages by applications in a Pod. The nodes automatically discover and report all HugePages resources as schedulable resources. For additional information on K8s HugePages management, see here .

  • Performing an optimal DPDK Benchmark test requires IOMMU in passthrough mode.

  • TRex v3.03 is required to reach link speed performance for 200 Gb/sec adapters.

  1. Prepare the setup to run the DPDK test by enabling huge pages (IOMMU in passthrough mode). In addition, perform adapter firmware tuning to achieve the best results.

    1. Modify the GRUB_CMDLINE_LINUX_DEFAULT parameter in the /etc/default/grub file . The setting below allocates 1GB * 16 pages = 16GB and 2MB * 2048 pages= 4GB HugePages on boot time (the example uses only the 1GB pages).

      Worker Node Console

      Copy
      Copied!
                  

      # vi /etc/default/grub ... GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048 iommu=pt" ...

      Apply the new configuration to grub by running the command below and reboot the host.

      Worker Node Console

      Copy
      Copied!
                  

      # update-grub # reboot

    2. Verify that the new configuration is applied correctly. Check the huge pages allocation through the master node.

      Master Node Console

      Copy
      Copied!
                  

      # for i in $(seq 2 5); do kubectl get node node$i -o json | jq '.status.allocatable."hugepages-1Gi", .status.allocatable."hugepages-2Mi"'; done "16Gi" "4Gi" "16Gi" "4Gi" "16Gi" "4Gi" "16Gi" "4Gi"

    3. Enable relax ordering and CQE Compression for the ConnectX-7 adapter and reboot the host for the changes to take effect.

      Worker Node Console

      Copy
      Copied!
                  

      # mstconfig -d 3f:00.0 -y set PCI_WR_ORDERING=1 CQE_COMPRESSION=1 # reboot

    4. Verify that the changes have been applied correctly.

      Worker Node Console

      Copy
      Copied!
                  

      # mstconfig -d 3f:00.0 q | egrep "PCI_WR_ORDERING|CQE_COMPRESSION" CQE_COMPRESSION AGGRESSIVE(1) PCI_WR_ORDERING force_relax(1)

  2. DPDK traffic emulation is shown in the Testbed Flow Diagram below. The traffic is pushed from the TRex pod through the SRIOV VF net1 interface to the TestPMD pod through the SRIOV network interface net1. The testPMD pod swaps the mac-address and reroutes ingress traffic through the same net1 interface to the same interface on TRex pod.

    TestPMD_TRex_flow_final-version-1-modificationdate-1730018437960-api-v2.png

  3. Create a sample pod dpdk-testpmd.yaml.

    Note
    • TestPMD and TRex image creation is not covered here; however, a brief description is included.

    • TestPMD (based on Ubuntu base image):

      • RDMA Core userspace components: rdma-core, ibverbs-utils.

      • MLNX_OFED: mlnx-ofed-dpdk.

    • TRex (based on CentOS base image):

      • RDMA Core userspace components: rdma-core-devel, libibverbs, libibverbs-devel.

      • Additional packages: hostname, iproute, net-tools, ethtool, nmap, iputils, perf, numactl, sysstat, htop.

    dpdk-testpmd.yaml

    Copy
    Copied!
                

    apiVersion: v1 kind: Pod metadata: name: dpdk-testpmd labels: app: dpdk-testpmd annotations: k8s.v1.cni.cncf.io/networks: '[ { "name": "sriov20", "mac": "40:00:00:00:00:01" } ]' spec: containers: - image: <dpdk_testpmd_container-image> name: dpdk-testpmd-pod securityContext: capabilities: add: ["IPC_LOCK"] volumeMounts: - mountPath: /mnt/huge name: hugepage resources: limits: memory: 16Gi cpu: 8 hugepages-1Gi: 2Gi nvidia.com/sriov_rdma: '1'      requests: memory: 16Gi cpu: 8 hugepages-1Gi: 2Gi nvidia.com/sriov_rdma: '1' command: ["sleep", "infinity"] volumes: - name: hugepage emptyDir: medium: HugePages

    Note

    The example assigns a specific MAC address (40:00:00:00:00:01) to the pod for convenience (this is possible because the sriovnetwork is created with MAC capability.

    1. Apply the following yaml file.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl apply -f dpdk-testpmd.yaml

    2. Verify that the pod is running successfully.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl get pods NAME READY STATUS RESTARTS AGE dpdk-testpmd 1/1 Running 0 51m

    3. Connect to the pod.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it dpdk-testpmd -- bash

      Within the container, check the available network interfaces and to which PCI slot it is connected.

      TestPMD Pod Console

      Copy
      Copied!
                  

      root@dpdk-testpmd:~# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0@if79: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000 link/ether ea:22:b8:49:88:f1 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.75.19/32 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::e822:b8ff:fe49:88f1/64 scope link valid_lft forever preferred_lft forever 74: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000 link/ether 40:00:00:00:00:01 brd ff:ff:ff:ff:ff:ff permaddr 56:80:e7:14:2b:de inet 192.168.20.23/24 brd 192.168.20.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::4200:ff:fe00:1/64 scope link valid_lft forever preferred_lft forever root@dpdk-testpmd:~# rdma link link rocep63s0f0v7/1 state ACTIVE physical_state LINK_UP netdev net1 root@dpdk-testpmd:~# ls -l /sys/class/net/ | grep net1 lrwxrwxrwx 1 root root 0 Jul 23 12:58 net1 -> ../../devices/pci0000:20/0000:20:03.1/0000:25:00.0/0000:26:08.0/0000:3d:00.0/0000:3e:00.0/0000:3f:01.1/net/net1

    4. Because the example uses a single-numa-node topology and deploys a pod in Guaranteed QoS class, it is bound to specific CPU cores on the host. In another console, SSH to the worker node on which the TestPMD is running to determine to which cores it is assigned.

      Worker Node Console

      Copy
      Copied!
                  

      # crictl ps | grep dpdk 8ef92f6d4fcb9 2fc8e5351239d 56 minutes ago Running dpdk-testpmd-pod 0 b6d3df7a1b00f dpdk-testpmd # crictl inspect 8ef92f6d4fcb9 | jq '.status.resources.linux.cpusetCpus' "0-7"

    5. Start the TestPMD application with the following command.

      TestPMD Pod Console

      Copy
      Copied!
                  

      root@dpdk-testpmd:~# dpdk-testpmd -l 0-7 -a 3f:01.1,mprq_en=1,rxqs_min_mprq=1,mprq_log_stride_num=9 -- --burst=64 --txd=2048 --rxd=2048 --mbcache=512 --rxq=8 --txq=8 --nb-cores=4 --rss-udp --forward-mode=macswap -a -i ... ... ... Configuring Port 0 (socket 0) mlx5_net: Port 0 Rx queue 0 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) mlx5_net: Port 0 Rx queue 1 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) mlx5_net: Port 0 Rx queue 2 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) mlx5_net: Port 0 Rx queue 3 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) mlx5_net: Port 0 Rx queue 4 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) mlx5_net: Port 0 Rx queue 5 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) mlx5_net: Port 0 Rx queue 6 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) mlx5_net: Port 0 Rx queue 7 size of a stride for Multi-Packet RQ is adjusted to match the mbuf size (1646) Port 0: 00:00:00:00:00:00 Checking link statuses... Done Start automatic packet forwarding macswap packet forwarding - ports=1 - cores=4 - streams=8 - NUMA support enabled, MP allocation mode: native Logical Core 1 (socket 0) forwards packets on 2 streams: RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00 RX P=0/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00 Logical Core 2 (socket 0) forwards packets on 2 streams: RX P=0/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00 RX P=0/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00 Logical Core 3 (socket 0) forwards packets on 2 streams: RX P=0/Q=4 (socket 0) -> TX P=0/Q=4 (socket 0) peer=02:00:00:00:00:00 RX P=0/Q=5 (socket 0) -> TX P=0/Q=5 (socket 0) peer=02:00:00:00:00:00 Logical Core 4 (socket 0) forwards packets on 2 streams: RX P=0/Q=6 (socket 0) -> TX P=0/Q=6 (socket 0) peer=02:00:00:00:00:00 RX P=0/Q=7 (socket 0) -> TX P=0/Q=7 (socket 0) peer=02:00:00:00:00:00 macswap packet forwarding packets/burst=64 nb forwarding cores=4 - nb forwarding ports=1 port 0: RX queue number: 8 Tx queue number: 8 Rx offloads=0x0 Tx offloads=0x0 RX queue: 0 RX desc=2048 - RX free threshold=64 RX threshold registers: pthresh=0 hthresh=0 wthresh=0 RX Offloads=0x0 TX queue: 0 TX desc=2048 - TX free threshold=0 TX threshold registers: pthresh=0 hthresh=0 wthresh=0 TX offloads=0x0 - TX RS bit threshold=0 testpmd>

      Note
      • DPDK applications split command line arguments into arguments for the DPDK Environmental Abstraction Layer (EAL), which can be used by any DPDK application running on Linux and arguments for the application itself (TestPMD in our case). The two sets of arguments are separated using the standard convention of --.

      • Some of the EAL command line options used:

        • -l: List of cores to run on (0-7 in the examples).

        • -a: PCI device to use (3f:01.1 in the examples).

      • Some of the TestPMD command line options used:

        • --rxq/txq: Number of RX/TX queues per port.

        • --rxd/txd: Number of descriptors in the RX/TX rings.

        • --nb-cores: Number of forwarding cores (the examples do not use all of the listed cores).

        • -a: Start forwarding on initialization.

        • -i: Interactive mode.

  4. Deploy the TRex pod. Create two ConfigMaps, one for configuration and one for the test file.

    1. Create the trex-config.yaml ConfigMap.

      trex-config.yaml

      Copy
      Copied!
                  

      apiVersion: v1 kind: ConfigMap metadata: name: trex-config data: trex_cfg.yaml : | - port_limit: 2 version: 3 interfaces: - "{PCIDEVICE_1}" - "{PCIDEVICE_2}" port_bandwidth_gb: 200 port_info: - dest_mac: 40:00:00:00:00:01 src_mac: 30:00:00:00:00:01 - dest_mac: 40:00:00:00:00:01 src_mac: 30:00:00:00:00:02 platform: master_thread_id: {MASTER_CPU} latency_thread_id: {LATENCY_CPU} dual_if: - socket: 0 threads: [{CPUS}]

      Note
      • The examples assign constant static MAC addresses for TRex interfaces due to MAC capability in the sriovnetwork.yaml. This is done for convenience purposes only.

      • The variables that are not statically configured and will be known when the pod itself is deployed:

        • PCIDEVICE_1/2: The PCI slot of the VFs that will be assigned to the pod (TRex requires at least two ports to start - the example simulates the traffic using only one of them).

        • MASTER_CPU: One of the CPUs from the assigned CPUs to the pod that will be used for master/UI.

        • LATENCY_CPU: One of the CPUs from the assigned CPUs to the pod that will be used for latency measurement.

        • CPUS: The rest of the CPUs from the assigned CPUs to the pod that aren't master/latency.

      • The example statically configures the socket to 0 because the network adapter is located there and the example uses single-numa-topology.

    2. Create the trex-test.yaml ConfigMap.

      trex-test.yaml

      Copy
      Copied!
                  

      apiVersion: v1 kind: ConfigMap metadata: name: trex-test data: testpmd.py : | from trex_stl_lib.api import *   class STLS1(object):   def create_stream (self):   pkt = Ether()/IP(src="16.0.0.1",dst="48.0.0.1")/UDP(dport=12)/({PAYLOAD_SIZE}*'x')   vm = STLScVmRaw( [ STLVmFlowVar(name="v_port", min_value=4337, max_value=5337, size=2, op="inc"), STLVmWrFlowVar(fv_name="v_port", pkt_offset= "UDP.sport" ), STLVmFixChecksumHw(l3_offset="IP",l4_offset="UDP",l4_type=CTRexVmInsFixHwCs.L4_TYPE_UDP),   ] )   return STLStream(packet = STLPktBuilder(pkt = pkt ,vm = vm ) , mode = STLTXCont(pps = 8000000) )     def get_streams (self, direction = 0, **kwargs): # create 1 stream return [ self.create_stream() ]     # dynamic load - used for trex console or simulator def register(): return STLS1()

      Note

      PAYLOAD_SIZE: As the name implies, defines the UDP payload size.

    3. Apply the configuration files described above.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl apply -f trex-config.yaml # kubectl apply -f trex-test.yaml

    4. Create the trex.yaml Pod configuration file.

      trex.yaml

      Copy
      Copied!
                  

      apiVersion: v1 kind: Pod metadata: name: trex labels: app: trex annotations: k8s.v1.cni.cncf.io/networks: '[ { "name": "sriov20", "mac": "30:00:00:00:00:01" }, { "name": "sriov20", "mac": "30:00:00:00:00:02" } ]' spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - dpdk-testpmd topologyKey: kubernetes.io/hostname containers: - image: <trex_container_image> name: trex securityContext: capabilities: add: ["IPC_LOCK", "SYS_RESOURCE", "NET_RAW", "NET_ADMIN"] volumeMounts: - name: trex-config mountPath: /opt/templates/ - name: trex-test mountPath: /opt/tests/ - mountPath: /mnt/huge name: hugepages - name: modules mountPath: /lib/modules      resources: limits: memory: 1Gi cpu: 16 hugepages-1Gi: 8Gi nvidia.com/sriov_rdma: '2' requests: memory: 1Gi cpu: 16 hugepages-1Gi: 8Gi nvidia.com/sriov_rdma: '2'      command: ["/bin/bash", "-c", "sleep INF"] volumes: - name: modules hostPath: path: /lib/modules - name: trex-config configMap: name: trex-config - name: trex-test configMap: name: trex-test - name: hugepages emptyDir: medium: HugePages

    5. Apply the configuration file described above.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl apply -f trex.yaml

    6. Connect to the TRex pod.

      Master Node Console

      Copy
      Copied!
                  

      # kubectl exec -it trex -- bash

    7. Update the configuration file within the container. Based on the previous shown methods, determine on which cores the container is running and which PCI devices are attached.

      TRex Pod Console

      Copy
      Copied!
                  

      [root@trex trex]# cp /opt/templates/trex_cfg.yaml /etc/trex_cfg.yaml [root@trex trex]# cp /opt/tests/testpmd.py ./ [root@trex trex]# vi /etc/trex_cfg.yaml - port_limit: 2 version: 2 interfaces: - "3f:00.6" - "3f:00.5" port_bandwidth_gb: 200 port_info: - dest_mac: 40:00:00:00:00:01 src_mac: 30:00:00:00:00:01 - dest_mac: 40:00:00:00:00:01 src_mac: 30:00:00:00:00:02 platform: master_thread_id: 8 latency_thread_id: 23 dual_if: - socket: 0 threads: [9,10,11,12,13,14,15,16,17,18,19,20,21,22] [root@trex trex]# vi testpmd.py ... pkt = Ether()/IP(src="16.0.0.1",dst="48.0.0.1")/UDP(dport=12)/(1472*'x') ...

    8. Start TRex traffic generator using the following command.

      TRex Pod Console

      Copy
      Copied!
                  

      [root@trex trex]# ./t-rex-64 --no-ofed-check --no-hw-flow-stat -i -c 14

    9. In another console, connect to the same TRex pod and run trex-console for visualization and traffic generation to the TestPMD pod.

      TRex 2nd Pod Console

      Copy
      Copied!
                  

      # kubectl exec -it trex -- bash [root@trex trex]# ./trex-console   Using 'python3' as Python interpeter     Connecting to RPC server on localhost:4501 [SUCCESS]     Connecting to publisher server on localhost:4500 [SUCCESS]     Acquiring ports [0, 1]: [SUCCESS]     Server Info:   Server version: v3.03 @ STL Server mode: Stateless Server CPU: 14 x AMD EPYC 7F72 24-Core Processor Ports count: 2 x 200Gbps @ ConnectX Family mlx5Gen Virtual Function   -=TRex Console v3.0=-   Type 'help' or '?' for supported actions   trex>tui

  5. Start to generate traffic to the TestPMD pod and verify you can reach line rate performance.

    TRex 2nd Pod Console

    Copy
    Copied!
                

    tui> start -f testpmd.py -m 100% -p 0 Global Statistics   connection : localhost, Port 4501 total_tx_L2 : 197.15 Gbps version : STL @ v3.03 total_tx_L1 : 199.74 Gbps cpu_util. : 89.85% @ 14 cores (14 per dual port) total_rx : 197.02 Gbps rx_cpu_util. : 0.0% / 0 pps total_pps : 16.23 Mpps async_util. : 0% / 16.34 bps drop_rate : 0 bps total_cps. : 0 cps queue_full : 438,273,395 pkts   Port Statistics   port | 0 | 1 | total -----------+-------------------+-------------------+------------------ owner | root | root | link | UP | UP | state | TRANSMITTING | IDLE | speed | 200 Gb/s | 200 Gb/s | CPU util. | 89.85% | 0.0% | -- | | | Tx bps L2 | 197.15 Gbps | 0 bps | 197.15 Gbps Tx bps L1 | 199.74 Gbps | 0 bps | 199.74 Gbps Tx pps | 16.23 Mpps | 0 pps | 16.23 Mpps Line Util. | 99.87 % | 0 % | --- | | | Rx bps | 197.02 Gbps | 0 bps | 197.02 Gbps Rx pps | 16.22 Mpps | 0 pps | 16.22 Mpps ---- | | |

Validating the fabric

To validate the fabric, assign IP addresses to the servers. Each stretched VLAN acts as a local subnet to all the servers connected to it, so all the servers connected to the same VLAN must have IP addresses in the same subnet.

You can ARP between them and verify that they see each other on the same local subnet.

This example shows node1:

Master Node Console

Copy
Copied!
            

$ sudo -i # for i in $(seq 2 5); do arping -I bond0 -c 2 10.10.1.$i; done ARPING 10.10.1.2 60 bytes from b8:3f:d2:18:68:1a (10.10.1.2): index=0 time=160.720 usec 60 bytes from b8:3f:d2:18:68:1a (10.10.1.2): index=1 time=128.830 usec   --- 10.10.1.2 statistics --- 2 packets transmitted, 2 packets received, 0% unanswered (0 extra) rtt min/avg/max/std-dev = 0.129/0.145/0.161/0.016 ms ARPING 10.10.1.3 60 bytes from b8:3f:d2:18:68:4a (10.10.1.3): index=0 time=152.170 usec 60 bytes from b8:3f:d2:18:68:4a (10.10.1.3): index=1 time=135.760 usec   --- 10.10.1.3 statistics --- 2 packets transmitted, 2 packets received, 0% unanswered (0 extra) rtt min/avg/max/std-dev = 0.136/0.144/0.152/0.008 ms ARPING 10.10.1.4 60 bytes from b8:3f:d2:18:68:3a (10.10.1.4): index=0 time=124.230 usec 60 bytes from b8:3f:d2:18:68:3a (10.10.1.4): index=1 time=126.120 usec   --- 10.10.1.4 statistics --- 2 packets transmitted, 2 packets received, 0% unanswered (0 extra) rtt min/avg/max/std-dev = 0.124/0.125/0.126/0.001 ms ARPING 10.10.1.5 60 bytes from 94:6d:ae:5f:6a:f8 (10.10.1.5): index=0 time=121.210 usec 60 bytes from 94:6d:ae:5f:6a:f8 (10.10.1.5): index=1 time=133.270 usec   --- 10.10.1.5 statistics --- 2 packets transmitted, 2 packets received, 0% unanswered (0 extra) rtt min/avg/max/std-dev = 0.121/0.127/0.133/0.006 ms

You can also validate on the switches that:

1) The IP addresses of the VTEPs are propagated by BGP to all the leaf switches successfully and all the server IP addresses are learned.

Repeat the following command on the leaf switches:

Leaf Switch Console

Copy
Copied!
            

cumulus@leaf1a:mgmt:~$ sudo vtysh   Hello, this is FRRouting (version 8.4.3). Copyright 1996-2005 Kunihiro Ishiguro, et al.   leaf1a# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, A - Babel, D - SHARP, F - PBR, f - OpenFabric, Z - FRR, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure   B>* 10.0.0.1/32 [20/0] via fe80::1e34:daff:feb4:a768, swp31, weight 1, 2d05h24m B>* 10.0.0.2/32 [20/0] via fe80::1e34:daff:feb4:a968, swp32, weight 1, 2d05h24m C>* 10.0.0.101/32 is directly connected, lo, 2d05h25m B>* 10.0.0.102/32 [20/0] via fe80::1e34:daff:feb4:a768, swp31, weight 1, 2d05h24m * via fe80::1e34:daff:feb4:a968, swp32, weight 1, 2d05h24m B>* 10.0.0.103/32 [20/0] via fe80::1e34:daff:feb4:a768, swp31, weight 1, 2d05h24m * via fe80::1e34:daff:feb4:a968, swp32, weight 1, 2d05h24m B>* 10.0.0.104/32 [20/0] via fe80::1e34:daff:feb4:a768, swp31, weight 1, 2d05h24m * via fe80::1e34:daff:feb4:a968, swp32, weight 1, 2d05h24m   leaf1a# show ip route vrf RED Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, A - Babel, D - SHARP, F - PBR, f - OpenFabric, Z - FRR, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure   VRF RED: S>* 0.0.0.0/0 [1/0] via 10.1.0.254, vlan1, weight 1, 2d05h25m K * 0.0.0.0/0 [255/8192] unreachable (ICMP unreachable), 2d05h25m C * 10.1.0.0/24 [0/1024] is directly connected, vlan1-v0, 2d05h25m C>* 10.1.0.0/24 is directly connected, vlan1, 2d05h25m C * 10.10.0.0/16 [0/1024] is directly connected, vlan10-v0, 2d05h25m C>* 10.10.0.0/16 is directly connected, vlan10, 2d05h25m B>* 10.10.1.2/32 [20/0] via 10.0.0.103, vlan3159_l3 onlink, weight 1, 00:00:01 B>* 10.10.1.3/32 [20/0] via 10.0.0.103, vlan3159_l3 onlink, weight 1, 00:00:01 B>* 10.10.1.4/32 [20/0] via 10.0.0.104, vlan3159_l3 onlink, weight 1, 00:00:01 B>* 10.10.1.5/32 [20/0] via 10.0.0.104, vlan3159_l3 onlink, weight 1, 00:00:01

2) The ARP entries were successfully propagated by EVPN (the example verifies from leaf3), and examine the router MAC addresses corresponding to all remote VTEPs for symmetric routing.

Leaf3 Switch Console

Copy
Copied!
            

cumulus@leaf3:mgmt:~$ sudo vtysh sudo vtysh   Hello, this is FRRouting (version 8.4.3). Copyright 1996-2005 Kunihiro Ishiguro, et al.   leaf3# show evpn arp-cache vni 10 Number of ARPs (local and remote) known for this VNI: 12 Flags: I=local-inactive, P=peer-active, X=peer-proxy Neighbor Type Flags State MAC Remote ES/VTEP Seq #'s fe80::90d3:4eff:fe88:6553 remote active 92:d3:4e:88:65:53 03:44:38:39:be:ef:aa:00:00:03 0/761638 10.10.1.3 remote active b8:3f:d2:18:68:4a 10.0.0.103 0/761637 10.10.0.250 remote active ce:72:b9:66:cb:f1 03:44:38:39:be:ef:aa:00:00:02 0/0 10.10.1.5 local active 94:6d:ae:5f:6a:f8 761636/0 10.10.1.1 remote active 92:d3:4e:88:65:53 03:44:38:39:be:ef:aa:00:00:03 0/761638 fe80::966d:aeff:fe5f:6af8 local active 94:6d:ae:5f:6a:f8 761636/0 fe80::ba3f:d2ff:fe18:683a local active b8:3f:d2:18:68:3a 761637/0 10.10.1.4 local active b8:3f:d2:18:68:3a 761637/0 10.10.1.2 remote active b8:3f:d2:18:68:1a 10.0.0.103 0/761636 fe80::cc72:b9ff:fe66:cbf1 remote active ce:72:b9:66:cb:f1 03:44:38:39:be:ef:aa:00:00:02 0/0 fe80::ba3f:d2ff:fe18:681a remote active b8:3f:d2:18:68:1a 10.0.0.103 0/761636 fe80::ba3f:d2ff:fe18:684a remote active b8:3f:d2:18:68:4a 10.0.0.103 0/761637   leaf3# show evpn rmac vni all   VNI 4001 #RMACs 3   RMAC Remote VTEP 1c:34:da:b4:ae:fd 10.0.0.101 1c:34:da:b4:a8:fd 10.0.0.103 1c:34:da:b4:ac:fd 10.0.0.102

3) EVPN-MH is configured and functioning properly on the infrastructure rack leaf switches: Global information, Ethernet Segment information, Ethernet Segment learned for each VNI and BGP Ethernet Segment information.

Border Router Switch Console

Copy
Copied!
            

cumulus@leaf1a:mgmt:~$ nv show evpn multihoming operational applied ------------------- ----------- ------- enable on mac-holdtime 1080 1080 neighbor-holdtime 1080 1080 startup-delay 180 180 ead-evi-route rx on tx on segment df-preference 32767 startup-delay-timer --:--:-- uplink-count 2 uplink-active 2 cumulus@leaf1a:mgmt:~$ nv show evpn multihoming esi   ESInterface - Local interface, NHG - Nexthop group ID, DFPref - Designated forwarder preference, VNICnt - ESI EVPN instances, MacCnt - Mac entries using this ES as destination, RemoteVTEPs - Remote tunnel Endpoint   ESI ESInterface NHG DFPref VNICnt MacCnt Flags RemoteVTEPs ----------------------------- ----------- --------- ------ ------ ------ ----- ----------- 03:44:38:39:be:ef:aa:00:00:01 bond1 536870913 50000 1 2 local 10.0.0.102 03:44:38:39:be:ef:aa:00:00:02 bond2 536870914 50000 1 1 local 10.0.0.102 03:44:38:39:be:ef:aa:00:00:03 bond3 536870915 50000 1 1 local 10.0.0.102   cumulus@leaf1a:mgmt:~$ sudo vtysh   Hello, this is FRRouting (version 8.4.3). Copyright 1996-2005 Kunihiro Ishiguro, et al.     leaf1a# show evpn es-evi Type: L local, R remote VNI ESI Type 10 03:44:38:39:be:ef:aa:00:00:02 L 10 03:44:38:39:be:ef:aa:00:00:03 L 1 03:44:38:39:be:ef:aa:00:00:01 L   leaf1a# show bgp l2vpn evpn es ES Flags: B - bypass, L local, R remote, I inconsistent VTEP Flags: E ESR/Type-4, A active nexthop ESI Flags RD #VNIs VTEPs 03:44:38:39:be:ef:aa:00:00:01 LR 10.0.0.101:3 1 10.0.0.102(EA) 03:44:38:39:be:ef:aa:00:00:02 LR 10.0.0.101:4 1 10.0.0.102(EA) 03:44:38:39:be:ef:aa:00:00:03 LR 10.0.0.101:5 1 10.0.0.102(EA)

Done!

GZ-version-1-modificationdate-1739267330290-api-v2.jpg


Guy Zilberman

Guy Zilberman is a solution architect at NVIDIA's Networking Solution s Labs, bringing extensive experience from several leadership roles in cloud computing. He specializes in designing and implementing solutions for cloud and containerized workloads, leveraging NVIDIA's advanced networking technologies. His work primarily focuses on open-source cloud infrastructure, with expertise in platforms such as Kubernetes (K8s) and OpenStack.


VR-version-2-modificationdate-1697457967017-api-v2.jpg


Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

SD-version-2-modificationdate-1697457893220-api-v2.jpg


Shachar Dor

Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management.

Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company.

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality. NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice. Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete. NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

© Copyright 2024, NVIDIA. Last updated on Sep 12, 2023.