RDG for DPDK Applications on SR-IOV Enabled Kubernetes Cluster with NVIDIA Network Operator

Created on July 7, 2021.

Scope

The following R eference D eployment G uide ( RDG ) explains how to build a high performing Kubernetes (K8s) cluster with containerd container runtime that is capable of running DPDK-based applications over NVIDIA Networking end-to-end Ethernet infrastructure.

This RDG describes a solution with multiple servers connected to a single switch that provides secondary network for the Kubernetes cluster. A more complex scale-out network topology of multiple L2 domains is beyond the scope of this document.

Abbreviations and Acronyms

Term

Definition

Term

Definition

CNI

Container Network Interface

LLDP

Link Layer Discovery Protocol

CR

Custom Resources

NFD

Node Feature Discovery

CRD

Custom Resources Definition

OCI

Open Container Initiative

CRI

Container Runtime Interface

PF

Physical Function

DHCP

Dynamic Host Configuration Protocol

QSG

Quick Start Guide

DNS

Domain Name System

RDG

Reference Deployment Guide

DP

Device Plugin

RDMA

Remote Direct Memory Access

DPDK

Data Plane Development Kit

RoCE

RDMA over Converged Ethernet

EVPN

Ethernet VPN

SR-IOV

Single Root Input Output Virtualization

HWE

Hardware Enablement

VF

Virtual Function

IPAM

IP Address Management

VPN

Virtual Private Network

K8s

Kubernetes

VXLAN

Virtual eXtensible Local Area Network

Introduction

Provisioning Kubernetes cluster with containerd container runtime for running DPDK-based workloads may become an extremely complicated task.
Proper design and software and hardware component selection may become a gating task toward successful deployment.
This guide provides a complete solution cycle including technology overview, design, component selection, and deployment steps.
The solution will be delivered on top of standard servers over the NVIDIA end-to-end Ethernet infrastructure.
In this document, we will be using the new NVIDIA Network Operator which is in charge of deploying and configuring SRIOV Device Plugin and SRIOV CNI. These components allow to run DPDK workloads on a Kubernetes Worker Node.

References

Solution Architecture

Key Components and Technologies

  • NVIDIA ConnectX SmartNICs
    10/25/40/50/100/200 and 400G Ethernet Network Adapters
    The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
    NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

  • NVIDIA LinkX Cables

    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA Spectrum Ethernet Switches

    Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
    Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.
    NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® Linux , SONiC and NVIDIA Onyx®.

  • NVIDIA Cumulus Linux

    NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

  • RDMA

    RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.

    Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.

  • Kubernetes
    Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

  • Kubespray
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:

    • A highly available cluster

    • Composable attributes

    • Support for most popular Linux distributions

  • NVIDIA Network Operator

    An analog to the NVIDIA GPU Operator, the NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

  • What is containerd?
    An industry-standard container runtime with an emphasis on simplicity, robustness and portability. containerd is available as a daemon for Linux and Windows. It manages the complete container lifecycle of its host system, from image transfer and storage to container execution and supervision to low-level storage to network attachments and beyond.

  • NVIDIA PMDs

    NVIDIA Poll Mode Driver (PMD) is an open-source upstream driver, embedded within dpdk.org releases, designed for fast packet processing and low latency by providing kernel bypass for receive and send and by avoiding the interrupt processing performance overhead.

  • TRex—Realistic Traffic Generator
    TRex is an open source stateful and stateless traffic generator fueled by DPDK. It generates L3-7 traffic and provides in one tool capabilities provided by commercial tools. TRex can scale up to 200Gb/sec with one server.

Logical Design

The logical design includes the following parts:

  1. Deployment node running Kubespray that deploys Kubernetes clusters.

  2. K8s master node running all Kubernetes management components.

  3. K8s worker nodes.

  4. TRex server.

  5. High-speed Ethernet fabric for DPDK tenant network

  6. Deployment and K8s management network.

image2021-8-10_9-28-26.png

Fabric Design

The high-performance network is a secondary network for Kubernetes cluster and required the L2 network topology.

This RDG describes a solution with multiple servers connected to a single switch that provides secondary network for the Kubernetes cluster.

A more complex scale-out network topology of multiple L2 domains is beyond the scope of this document.

Software Stack Components

image2021-6-22_9-58-52.png

Bill of Materials

The following hardware setup is utilized in this guide.

image2021-8-10_9-34-31.png

Important

The above table does not contain the management network connectivity components.

Deployment and Configuration

Wiring

On each K8s worker node and TRex server, the first port of each NVIDIA Network Adapter is wired to the NVIDIA switch in high-performance fabric using NVIDIA LinkX DAC cables.

image2021-8-10_9-36-6.png

Warning

Deployment and Management network is part of IT infrastructure and is not covered in this guide.

Fabric

Prerequisites

  • High-performance Ethernet fabric

    • Single switch

      NVIDIA SN2100

    • Switch OS

      Cumulus Linux v4.2.1

  • Deployment and management network

    DNS and DHCP network services and network topology are part of the IT infrastructure. The component installation and configuration are not covered in this guide.

Network Configuration

Below are the server names with their relevant network configurations.

Server/Switch type

Server/Switch name

IP and NICS

High-speed network

Management network

1/25 GbE

Deployment node

depserver

ens4f0: DHCP

192.168.222.110

Master node

node1

ens4f0: DHCP

192.168.222.111

Worker node

node2

ens2f0: no IP set

ens4f0: DHCP

192.168.222.101

Worker node

node3

ens2f0: no IP set

ens4f0: DHCP

192.168.222.102

TRex server

node4

ens2f0: no IP set

ens2f1: no IP set

ens4f0: DHCP

192.168.222.103

High-speed switch

leaf01

mgmt0: From DHCP

192.168.222.201

Warning

ensXf0 high-speed network interfaces do not require additional configuration.

Fabric Configuration

This solution is based on Cumulus Linux v4.2.1 switch operation system.

Intermediate-level Linux knowledge is assumed for this guide. Familiarity with basic text editing, Linux file permissions, and process monitoring is required. A variety of text editors are pre-installed, including vi and nano.

Networking engineers who are unfamiliar with Linux concepts should refer to this reference guide to compare the Cumulus Linux CLI and configuration options and their equivalent Cisco Nexus 3000 NX-OS commands and settings. There is also a series of short videos with an introduction to Linux and Cumulus-Linux-specific concepts.

A Greenfield deployment is assumed for this guide. Please refer to the following guide for Upgrading Cumulus Linux.

Fabric configuration steps:

  1. Administratively enable all physical ports.

  2. Create a bridge and configure one or more front panel ports as members of the bridge.

  3. Commit configuration.

Switch configuration steps.

Switch console

Copy
Copied!
            

Linux swx-mld-l03 4.19.0-cl-1-amd64 #1 SMP Cumulus 4.19.94-1+cl4.2.1u1 (2020-08-28) x86_64   Welcome to NVIDIA Cumulus (R) Linux (R)   For support and online technical documentation, visit http://www.cumulusnetworks.com/support   The registered trademark Linux (R) is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.   cumulus@leaf01:mgmt:~$ net add interface swp1-16 cumulus@leaf01:mgmt:~$ net add bridge bridge ports swp1-16 cumulus@leaf01:mgmt:~$ net commit

To view link status, use the net show interface all command. The following examples show the output of ports in admin down, down, and up modes.

Switch console

Copy
Copied!
            

cumulus@leaf01:mgmt:~$ net show interface all State Name Spd MTU Mode LLDP Summary ----- ------ ---- ----- ---------- ------------------------------- ------------------------ UP lo N/A 65536 Loopback IP: 127.0.0.1/8 lo IP: ::1/128 UP eth0 1G 1500 Mgmt mgmt-xxx-xxx-xxx-xxx (8) Master: mgmt(UP) eth0 IP: 192.168.222.201/24(DHCP) UP swp1 100G 9216 Access/L2 Master: bridge(UP) UP swp2 100G 9216 Access/L2 node2 (0c:42:a1:2b:74:ae) Master: bridge(UP) UP swp3 100G 9216 Access/L2 Master: bridge(UP) UP swp4 100G 9216 Access/L2 node3 (0c:42:a1:24:05:4a) Master: bridge(UP) UP swp5 100G 9216 Access/L2 Master: bridge(UP) UP swp6 100G 9216 Access/L2 node4 (0c:42:a1:24:05:1a) Master: bridge(UP) UP swp7 100G 9216 Access/L2 Master: bridge(UP) UP swp8 100G 9216 Access/L2 node4 (0c:42:a1:24:05:1b) Master: bridge(UP) DN swp9 N/A 9216 Access/L2 Master: bridge(UP) DN swp10 N/A 9216 Access/L2 Master: bridge(UP) DN swp11 N/A 9216 Access/L2 Master: bridge(UP) DN swp12 N/A 9216 Access/L2 Master: bridge(UP) DN swp13 N/A 9216 Access/L2 Master: bridge(UP) DN swp14 N/A 9216 Access/L2 Master: bridge(UP) DN swp15 N/A 9216 Access/L2 Master: bridge(UP) DN swp16 N/A 9216 Access/L2 Master: bridge(UP) UP bridge N/A 9216 Bridge/L2 UP mgmt N/A 65536 VRF IP: 127.0.0.1/8 mgmt IP: ::1/128  

Nodes Configuration

General Prerequisites:

  • Hardware

    All the K8s worker nodes have the same hardware specification (see BoM for details).

  • Host BIOS

    Verify that SR-IOV supported server platform is being used and review the BIOS settings in the server platform vendor documentation to enable SR-IOV in the BIOS.

  • Host OS

    Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages.

  • Experience with Kubernetes

    Familiarization with the Kubernetes Cluster architecture is essential.

Important

Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.

All worker nodes must have the same PCIe placement for the NIC and expose the same interface name.

Host OS Prerequisites

Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages and create a non-root depuser account with sudo privileges without password.

Update the Ubuntu software packages by running the following commands:

Server console

Copy
Copied!
            

$ sudo apt-get update $ sudo apt-get upgrade -y $ sudo reboot

In this solution we added the following line to the EOF /etc/sudoers:

Server console

Copy
Copied!
            

$ sudo vim /etc/sudoers #includedir /etc/sudoers.d #K8s cluster deployment user with sudo privileges without password depuser ALL=(ALL) NOPASSWD:ALL

NIC Firmware Upgrade

It is recommended to upgrade the NIC firmware on the worker nodes to the latest released version.
Download mlxup firmware update and query utility to each worker node and update the NIC firmware.
The most recent version of mlxup can be downloaded from the official download page. m lxup can download and update the NIC firmware to the latest firmware over the Internet.
The utility execution required sudo privileges:

Worker Node console

Copy
Copied!
            

# wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup # chmod +x mlxup # ./mlxup -online -u

RDMA Subsystem Configuration

RDMA subsystem configuration is required on each worker node.

  1. Instal LLDP Daemon and RDMA Core Userspace Libraries and Daemons.

    Worker Node console

    Copy
    Copied!
                

    # apt install -y lldpd rdma-core

    LLDPD is a daemon able to receive and send LLDP frames. The Link Layer Discovery Protocol (LLDP) is a vendor-neutral Layer 2 protocol that allows a network device to advertise its identity and capabilities on the local network.

  2. Identify the name of the RDMA-capable interface for high-performance K8s network.

    In this guide, ens2f0 network interface for high-performance K8s network was chosen and will be activated by NVIDIA Network Operator deployment:

    Worker Node console

    Copy
    Copied!
                

    # rdma link link rocep7s0f0/1 state DOWN physical_state DISABLED netdev ens2f0 link rocep7s0f1/1 state DOWN physical_state DISABLED netdev ens2f1 link rocep131s0f0/1 state ACTIVE physical_state LINK_UP netdev ens4f0 link rocep131s0f1/1 state DOWN physical_state DISABLED netdev ens4f1

  3. Set RDMA subsystem network namespace mode to exclusive mode.
    RDMA subsystem network namespace mode ( netns parameter in ib_core module) in exclusive mode allows network namespace isolation for RDMA workloads on the worker node servers. Please create /etc/modprobe.d/ib_core.conf configuration file to change ib_core module parameters:

    /etc/modprobe.d/ib_core.conf

    Copy
    Copied!
                

    # Set netns to exclusive mode for namespace isolation options ib_core netns_mode=0

    Then re-generate the initial RAM disks and reboot servers:

    Worker Node console

    Copy
    Copied!
                

    # update-initramfs -u # reboot

    After the server comes back, check netns mode:

    Worker Node console

    Copy
    Copied!
                

    # rdma system   netns exclusive

K8s Cluster Deployment and Configuration

The Kubernetes cluster in this solution will be installed using Kubespray with a non-root depuser account from the deployment node.

SSH Private Key and SSH Passwordless Login

Log in to the deployment node as a deployment user (in this case, depuser) and create an SSH private key for configuring the passwordless authentication on your computer by running the following commands:

Deployment Node console

Copy
Copied!
            

$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/depuser/.ssh/id_rsa): Created directory '/home/depuser/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/depuser/.ssh/id_rsa Your public key has been saved in /home/depuser/.ssh/id_rsa.pub The key fingerprint is: SHA256:IfcjdT/spXVHVd3n6wm1OmaWUXGuHnPmvqoXZ6WZYl0 depuser@depserver The key's randomart image is: +---[RSA 3072]----+ | *| | .*| | . o . . o=| | o + . o +E| | S o .**O| | . .o=OX=| | . o%*.| | O.o.| | .*.ooo| +----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in the deployment by running the following command (example):

Deployment Node console

Copy
Copied!
            

$ ssh-copy-id depuser@192.168.222.111 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/depuser/.ssh/id_rsa.pub" The authenticity of host '192.168.222.111 (192.168.222.111)' can't be established. ECDSA key fingerprint is SHA256:6nhUgRlt9gY2Y2ofukUqE0ltH+derQuLsI39dFHe0Ag. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys depuser@192.168.222.111's password:   Number of key(s) added: 1   Now try logging into the machine, with: "ssh 'depuser@192.168.222.111'" and check to make sure that only the key(s) you wanted were added.

Verify that you have passwordless SSH connectivity to all nodes in your deployment by running the following command (example):

Deployment Node console

Copy
Copied!
            

$ ssh depuser@192.168.222.111

Kubespray Deployment and Configuration

General Setting

To install dependencies for running Kubespray with Ansible on the deployment node, please run following commands:

Deployment Node console

Copy
Copied!
            

$ cd ~ $ sudo apt -y install python3-pip jq $ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.15.0.tar.gz $ tar -zxf v2.15.0.tar.gz $ cd kubespray-2.15.0 $ sudo pip3 install -r requirements.txt

Warning

The default folder for subsequent commands is ~/kubespray-2.15.0.

Deployment Customization

Create a new cluster configuration and host configuration file .
Replace the IP addresses below with your nodes' IP addresses:

Deployment Node console

Copy
Copied!
            

$ cp -rfp inventory/sample inventory/mycluster $ declare -a IPS=(192.168.222.111 192.168.222.101 192.168.222.102) $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:

inventory/mycluster/hosts.yaml

Copy
Copied!
            

all: hosts: node1: ansible_host: 192.168.222.111 ip: 192.168.222.111 access_ip: 192.168.222.111 node2: ansible_host: 192.168.222.101 ip: 192.168.222.101 access_ip: 192.168.222.101 node3: ansible_host: 192.168.222.102 ip: 192.168.222.102 access_ip: 192.168.222.102 children: kube-master: hosts: node1: kube-node: hosts: node2: node3: etcd: hosts: node1: k8s-cluster: children: kube-master: kube-node: calico-rr: hosts: {}

Review and change cluster installation parameters in the files:

  • inventory/mycluster/group_vars/all/all.yml

  • inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml set a d efault Kubernetes CNI by setting the desired kube_network_plugin value (default : calico ) parameter.

inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

Copy
Copied!
            

...   # Choose network plugin (cilium, calico, contiv, weave or flannel. Use cni for generic cni plugin) # Can also be set to 'cloud', which lets the cloud provider setup appropriate routing kube_network_plugin: calico   # Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni kube_network_plugin_multus: false   ...

Choice container runtime

In this guide containerd was chosen as the default container runtime in K8s cluster deployment because docker will be deprecated soon.

To use the containerd container runtime, set the following variables:

  1. In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml :

    inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

    Copy
    Copied!
                

    ...   ## Container runtime ## docker for docker, crio for cri-o and containerd for containerd. container_manager: containerd   ...

  2. In inventory/mycluster/group_vars/all/all.yml:

    inventory/mycluster/group_vars/all/all.yml

    Copy
    Copied!
                

    ...   ## Experimental kubeadm etcd deployment mode. Available only for new deployment etcd_kubeadm_enabled: true   ...

  3. In inventory/mycluster/group_vars/etcd.yml:

    inventory/mycluster/group_vars/etcd.yml

    Copy
    Copied!
                

    ...   ## Settings for etcd deployment type etcd_deployment_type: host   ...

Deploying the Cluster Using KubeSpray Ansible Playbook

Run the following line to start the deployment process:

Deployment Node console

Copy
Copied!
            

$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete, please make sure no errors are encountered.

A successful result should look something like the following:

Deployment Node console

Copy
Copied!
            

... PLAY RECAP *********************************************************************************************************************************************************************************** localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 node1 : ok=554 changed=81 unreachable=0 failed=0 skipped=1152 rescued=0 ignored=2 node2 : ok=360 changed=42 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1 node3 : ok=360 changed=42 unreachable=0 failed=0 skipped=632 rescued=0 ignored=1   Sunday 11 July 2021 22:36:04 +0000 (0:00:00.053) 0:06:51.785 ************ =============================================================================== kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------------------------------------------- 37.24s kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 28.29s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 16.57s kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 14.23s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 11.06s download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 9.18s download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 8.61s kubernetes-apps/ansible : Kubernetes Apps | Start Resources --------------------------------------------------------------------------------------------------------------------------- 7.02s container-engine/crictl : download_file | Download item ------------------------------------------------------------------------------------------------------------------------------- 5.78s download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 5.52s Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------ 5.24s download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 4.89s download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 4.81s kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates ---------------------------------------------------------------------------------------------------------------- 4.68s reload etcd --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.65s download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 4.24s kubernetes/preinstall : Get current calico cluster version ---------------------------------------------------------------------------------------------------------------------------- 3.70s network_plugin/calico : Start Calico resources ---------------------------------------------------------------------------------------------------------------------------------------- 3.42s container-engine/crictl : extract_file | Unpacking archive ---------------------------------------------------------------------------------------------------------------------------- 3.35s kubernetes-apps/cluster_roles : Apply workaround to allow all nodes with cert O=system:nodes to register ------------------------------------------------------------------------------ 3.32s

K8s Cluster Customization

Now that the K8S cluster is deployed, connect to the K8S master node with the root user account in order to customize deployment.

  1. Label the worker nodes.

    Master Node console

    Copy
    Copied!
                

    # kubectl label nodes node2 node-role.kubernetes.io/worker= # kubectl label nodes node3 node-role.kubernetes.io/worker=

K8S Cluster Deployment Verification

Following is an output example of K8s cluster deployment information using the Calico CNI plugin.

To ensure that the Kubernetes cluster is installed correctly, run the following commands:

Master Node console

Copy
Copied!
            

# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready master 44m v1.19.7 192.168.222.111 <none> Ubuntu 20.04.2 LTS 5.4.0-72-generic containerd://1.4.4 node2 Ready worker 42m v1.19.7 192.168.222.101 <none> Ubuntu 20.04.2 LTS 5.4.0-72-generic containerd://1.4.4 node3 Ready worker 42m v1.19.7 192.168.222.102 <none> Ubuntu 20.04.2 LTS 5.4.0-72-generic containerd://1.4.4   # kubectl -n kube-system get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-kube-controllers-8b5ff5d58-ph86x 1/1 Running 0 43m 192.168.222.101 node2 <none> <none> calico-node-l48qg 1/1 Running 0 43m 192.168.222.102 node3 <none> <none> calico-node-ldx7w 1/1 Running 0 43m 192.168.222.111 node1 <none> <none> calico-node-x9bh5 1/1 Running 0 43m 192.168.222.101 node2 <none> <none> coredns-85967d65-pslmm 1/1 Running 0 27m 10.233.96.1 node2 <none> <none> coredns-85967d65-qp2rl 1/1 Running 0 43m 10.233.90.230 node1 <none> <none> dns-autoscaler-5b7b5c9b6f-8wb67 1/1 Running 0 43m 10.233.90.229 node1 <none> <none> etcd-node1 1/1 Running 0 45m 192.168.222.111 node1 <none> <none> kube-apiserver-node1 1/1 Running 0 45m 192.168.222.111 node1 <none> <none> kube-controller-manager-node1 1/1 Running 0 45m 192.168.222.111 node1 <none> <none> kube-proxy-6p4rm 1/1 Running 0 44m 192.168.222.101 node2 <none> <none> kube-proxy-8bj6s 1/1 Running 0 44m 192.168.222.111 node1 <none> <none> kube-proxy-dj4l8 1/1 Running 0 44m 192.168.222.102 node3 <none> <none> kube-scheduler-node1 1/1 Running 0 45m 192.168.222.111 node1 <none> <none> nginx-proxy-node2 1/1 Running 0 44m 192.168.222.101 node2 <none> <none> nginx-proxy-node3 1/1 Running 0 44m 192.168.222.102 node3 <none> <none> nodelocaldns-8b6kf 1/1 Running 0 43m 192.168.222.102 node3 <none> <none> nodelocaldns-kzmmh 1/1 Running 0 43m 192.168.222.101 node2 <none> <none> nodelocaldns-zh9fz 1/1 Running 0 43m 192.168.222.111 node1 <none> <none>

NVIDIA Network Operator Installation for K8S Cluster

NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components in order to enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

To make it work, several components need to be provisioned and configured. All operator configuration and installation steps should be performed from the K8S master node with the root user account.

Prerequisites

  1. Install Helm.

    Master Node console

    Copy
    Copied!
                

    # snap install helm --classic

  2. Install additional RDMA CNI plugin
    RDMA CNI plugin allows network namespace isolation for RDMA workloads in a containerized environment.
    Deploy CNI's using the following YAML files:

    Master Node console

    Copy
    Copied!
                

    # kubectl apply -f https://raw.githubusercontent.com/Mellanox/rdma-cni/master/deployment/rdma-cni-daemonset.yaml

    To ensure the plugin is installed correctly, run the following command:

    Master Node console

    Copy
    Copied!
                

    # kubectl -n kube-system get pods -o wide | egrep "rdma"   kube-rdma-cni-ds-5zl8d 1/1 Running 0 11m 192.168.222.102 node3 <none> <none> kube-rdma-cni-ds-q74n5 1/1 Running 0 11m 192.168.222.101 node2 <none> <none> kube-rdma-cni-ds-rnqkr 1/1 Running 0 11m 192.168.222.111 node1 <none> <none>

Deployment

Add the NVIDIA Network Operator Helm repository:

Master Node console

Copy
Copied!
            

# helm repo add mellanox https://mellanox.github.io/network-operator # helm repo update

Create the values.yaml file in user home folder (e xample):

values.yaml

Copy
Copied!
            

nfd: enabled: true   sriovNetworkOperator: enabled: true   # NicClusterPolicy CR values: deployCR: true ofedDriver: deploy: false   nvPeerDriver: deploy: false   rdmaSharedDevicePlugin: deploy: false   sriovDevicePlugin: deploy: false   secondaryNetwork: deploy: true cniPlugins: deploy: true image: containernetworking-plugins repository: mellanox version: v0.8.7 imagePullSecrets: [] multus: deploy: true image: multus repository: nfvpe version: v3.6 imagePullSecrets: [] config: '' ipamPlugin: deploy: true image: whereabouts repository: mellanox version: v0.3 imagePullSecrets: []

Deploy the operator:

Master Node console

Copy
Copied!
            

# helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name   NAME: network-operator LAST DEPLOYED: Sun Jul 11 23:06:54 2021 NAMESPACE: network-operator STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Get Network Operator deployed resources by running the following commands:   $ kubectl -n network-operator get pods $ kubectl -n mlnx-network-operator-resources get pods

To ensure that the Operator is deployed correctly, run the following commands:

Master Node console

Copy
Copied!
            

# kubectl -n network-operator get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-operator-1627211751-5bd467cbd9-2hwqx 1/1 Running 0 29h 10.233.90.5 node1 <none> <none> network-operator-1627211751-node-feature-discovery-master-dgs69 1/1 Running 0 29h 10.233.90.6 node1 <none> <none> network-operator-1627211751-node-feature-discovery-worker-7n6gs 1/1 Running 0 29h 10.233.90.3 node1 <none> <none> network-operator-1627211751-node-feature-discovery-worker-sjdxw 1/1 Running 1 29h 10.233.96.7 node2 <none> <none> network-operator-1627211751-node-feature-discovery-worker-vzpvg 1/1 Running 1 29h 10.233.92.5 node3 <none> <none> network-operator-1627211751-sriov-network-operator-5f869696sdzp 1/1 Running 0 29h 10.233.90.4 node1 <none> <none>

High-Speed Network Configuration

After installing the operator, please check the SriovNetworkNodeState CRs to see all SRIOV-enabled devices in your node.
In our deployment has been chosen network interface with name ens2f0. To review the interface status please use following command:

Master Node console

Copy
Copied!
            

# kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io node2 -o yaml   ...   status: interfaces: - deviceID: 101d driver: mlx5_core linkSpeed: 100000 Mb/s linkType: ETH mac: 0c:42:a1:2b:74:ae mtu: 1500 name: ens2f0 pciAddress: "0000:07:00.0" totalvfs: 8 vendor: 15b3 - deviceID: 101d driver: mlx5_core linkType: ETH mac: 0c:42:a1:2b:74:af mtu: 1500 name: ens2f1 pciAddress: "0000:07:00.1" totalvfs: 8 vendor: 15b3   ...

Create SriovNetworkNodePolicy CR policy.yaml file, by specifying chosen interface in the 'nicSelector' (in this example, for the ens2f0 interface):

policy.yaml

Copy
Copied!
            

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" resourceName: mlnx2f0 priority: 98 mtu: 9000 numVfs: 8 nicSelector: vendor: "15b3" pfNames: [ "ens2f0" ] deviceType: netdevice isRdma: true

Deploy policy.yaml:

Master Node console

Copy
Copied!
            

# kubectl apply -f policy.yaml

Create a SriovNetwork CR network.yaml file which refers to the 'resourceName' defined in SriovNetworkNodePolicy (in this example, referencing the mlnx2f0 resource and set 192.168.101.0/24 as CIDR range for the high-speed network):

network.yaml

Copy
Copied!
            

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: "netmlnx2f0" namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.101.0/24" } vlan: 0 networkNamespace: "default" spoofChk: "off" resourceName: "mlnx2f0" linkState: "enable" metaPlugins: | { "type": "rdma" }

Deploy network.yaml:

Master Node console

Copy
Copied!
            

# kubectl apply -f network.yaml

Validating the Deployment

Check if the deployment is finished successfully:

Master Node console

Copy
Copied!
            

# kubectl -n nvidia-network-operator-resources get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cni-plugins-ds-f548q 1/1 Running 1 30m 192.168.222.101 node2 <none> <none> cni-plugins-ds-qw7hx 1/1 Running 1 30m 192.168.222.102 node3 <none> <none> kube-multus-ds-cjbf9 1/1 Running 1 30m 192.168.222.102 node3 <none> <none> kube-multus-ds-rgc95 1/1 Running 1 30m 192.168.222.101 node2 <none> <none> whereabouts-gwr7p 1/1 Running 1 30m 192.168.222.101 node2 <none> <none> whereabouts-n29nq 1/1 Running 1 30m 192.168.222.102 node3 <none> <none>

Check deployed network:

Master Node console

Copy
Copied!
            

# kubectl get network-attachment-definitions.k8s.cni.cncf.io NAME AGE netmlnx2f0 4m56s

Check worker node resources:

Master Node console

Copy
Copied!
            

# kubectl describe nodes node2   ...   Addresses: InternalIP: 192.168.222.101 Hostname: node2 Capacity: cpu: 24 ephemeral-storage: 229698892Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 264030604Ki nvidia.com/mlnx2f0: 8 pods: 110 Allocatable: cpu: 23900m ephemeral-storage: 211690498517 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 242694540Ki nvidia.com/mlnx2f0: 8 pods: 110   ...

Manage HugePages

Kubernetes supports the allocation and consumption of pre-allocated HugePages by applications in a Pod. The nodes will automatically discover and report all HugePages resources as schedulable resources. For get additional information K8s HugePages management, please refer here.

In order to allocate, HugePages needs to modify GRUB_CMDLINE_LINUX_DEFAULT parameter in /etc/default/grub. This setting, below, allocates 1GB * 16 pages = 16GB and 2MB * 2048 pages= 4GB HugePages on boot time:

/etc/default/grub

Copy
Copied!
            

...   GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048"   ...

Run update-grub to apply the config to grub and reboot server:

Worker Node console

Copy
Copied!
            

# update-grub # reboot

After the server comes back, check hugepages allocation from master node by command:

Master Node console

Copy
Copied!
            

# kubectl describe nodes node2 ... Capacity: cpu: 24 ephemeral-storage: 229698892Ki hugepages-1Gi: 16Gi hugepages-2Mi: 4Gi memory: 264030604Ki nvidia.com/mlnx2f0: 8 pods: 110 Allocatable: cpu: 23900m ephemeral-storage: 211690498517 hugepages-1Gi: 16Gi hugepages-2Mi: 4Gi memory: 242694540Ki nvidia.com/mlnx2f0: 8 pods: 110 ...

Enable CPU and Topology Management

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

  • Require as much CPU time as possible

  • Are sensitive to processor cache misses

  • Are low-latency network applications

  • Coordinate with other processes and benefit from sharing a single processor cache

Topology Manager uses topology information from collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and Pod resources requested. In order to extract the best performance, optimizations related to CPU isolation and memory and device locality are required.

Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

Important

To use Topology Manager, CPU Manager with static policy must be used.

For additional information, please refer to Control Topology Management Policies on a node and Control Topology Management Policies on a node.

In order to enable CPU Manager and Topology Manager, please add following lines to kubelet configuration file /etc/kubernetes/kubelet-config.yaml:

/etc/kubernetes/kubelet-config.yaml

Copy
Copied!
            

... cpuManagerPolicy: static cpuManagerReconcilePeriod: 10s topologyManagerPolicy: single-numa-node featureGates: CPUManager: true TopologyManager: true

Due to changes in cpuManagerPolicy, remove /var/lib/kubelet/cpu_manager_state and restart kubelet service on each affected K8s worker node.

Worker Node console

Copy
Copied!
            

# rm -f /var/lib/kubelet/cpu_manager_state # service kubelet restart

Application

DPDK traffic emulation is shown in Testbed Flow Diagram below. The traffic will be pushed from Trex Server via ens2f0 interface to TestPMD POD via SRIOV network interface net1. TestPMD POD will swap mac-address and re-routes ingress traffic via the same interface net1 to the same interface on Trex Server.

image2021-8-10_9-47-56.png

Verification

  1. Create a sample deployment test-deployment.yaml (container image should include InfiniBand userspace drivers and performance tools):

    test-deployment.yaml

    Copy
    Copied!
                

    apiVersion: apps/v1 kind: Deployment metadata: name: mlnx-inbox-pod labels: app: sriov spec: replicas: 2 selector: matchLabels: app: sriov template: metadata: labels: app: sriov annotations: k8s.v1.cni.cncf.io/networks: netmlnx2f0 spec: containers: - image: < Container image > name: mlnx-inbox-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: cpu: 4 nvidia.com/mlnx2f0: 1 limits: cpu: 4 nvidia.com/mlnx2f0: 1 command: - sh - -c - sleep inf

  2. Deploy the sample deployment.

    Master Node console

    Copy
    Copied!
                

    # kubectl apply -f test-deployment.yaml

  3. Verify the deployment is running.

    Master Node console

    Copy
    Copied!
                

    # kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mlnx-inbox-pod-599dc445c8-72x6g 1/1 Running 0 12s 10.233.96.5 node2 <none> <none> mlnx-inbox-pod-599dc445c8-v5lnx 1/1 Running 0 12s 10.233.92.4 node3 <none> <none>

  4. Check available network interfaces in POD.

    Master Node console

    Copy
    Copied!
                

    # kubectl exec -it mlnx-inbox-pod-599dc445c8-72x6g -- bash   root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# rdma link link rocep7s0f0v2/1 state ACTIVE physical_state LINK_UP netdev net1   root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ip a s 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 4: eth0@if208: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 12:51:ab:b3:ef:26 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.96.5/32 brd 10.233.96.5 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::1051:abff:feb3:ef26/64 scope link valid_lft forever preferred_lft forever 201: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 02:40:7d:5e:5f:af brd ff:ff:ff:ff:ff:ff inet 192.168.101.2/24 brd 192.168.101.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::40:7dff:fe5e:5faf/64 scope link valid_lft forever preferred_lft forever

  5. Run synthetic RDMA benchmark tests with ib_write_bw bandwidth and latency test using RDMA write transactions.

    Server

    ib_write_bw -F -d $IB_DEV_NAME --report_gbits

    Client

    ib_write_bw -F $SERVER_IP -d $IB_DEV_NAME --report_gbits

    Please open two consoles to K8s master node—one for the server apps side and the second for the client apps side.

    In a first console (server side) to K8s master node, run the following commands:

    Master Node console

    Copy
    Copied!
                

    # kubectl exec -it mlnx-inbox-pod-599dc445c8-72x6g -- bash root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ip a s net1 | grep inet inet 192.168.101.2/24 brd 192.168.101.255 scope global net1 inet6 fe80::40:7dff:fe5e:5faf/64 scope link root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# rdma link link rocep7s0f0v2/1 state ACTIVE physical_state LINK_UP netdev net1 root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ib_write_bw -F -d rocep7s0f0v2 --report_gbits   ************************************ * Waiting for client to connect... * ************************************

    In a second console (client side) to K8s master node, run the following commands:

    Master Node console

    Copy
    Copied!
                

    # kubectl exec -it mlnx-inbox-pod-599dc445c8-v5lnx -- bash root@mlnx-inbox-pod-599dc445c8-v5lnx:/tmp# rdma link link rocep7s0f0v3/1 state ACTIVE physical_state LINK_UP netdev net1 root@mlnx-inbox-pod-599dc445c8-v5lnx:/tmp# ib_write_bw -F -d rocep7s0f0v3 192.168.101.2 --report_gbits --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rocep7s0f0v3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 2 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x01f2 PSN 0x75e7cf RKey 0x050e26 VAddr 0x007f51e51b9000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:01 remote address: LID 0000 QPN 0x00f2 PSN 0x13427f RKey 0x010e26 VAddr 0x007f1ecaac8000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:02 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 5000 94.26 92.87 0.169509 ---------------------------------------------------------------------------------------

TRex Server Deployment

In our guide used TRex package v2.87.
For detailed TRex installation and configuration guide, please refer to TRex Documentation.

TRex Installation and configuration steps done with the root user account.

Prerequisites

For the TRex server, a standard server with installed RDMA subsystem has been used.

Activate the network interfaces that been used by TRex application with netplan.
In our deployment, interfaces ens2f0 and ens2f1 are used:

/etc/netplan/00-installer-config.yaml

Copy
Copied!
            

# This is the network config written by 'subiquity' network: ethernets: ens4f0: dhcp4: true dhcp-identifier: mac ens2f0: {} ens2f1: {} version: 2

Then re-apply netplan and check link status for ens2f0/ens2f1 network interfaces.

TRex server console

Copy
Copied!
            

# netplan apply # rdma link link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens2f0 link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens2f1 link mlx5_2/1 state ACTIVE physical_state LINK_UP netdev ens4f0 link mlx5_3/1 state DOWN physical_state DISABLED netdev ens4f1  

Updated MTU size for interfaces ens2f0 and ens2f1.

TRex server console

Copy
Copied!
            

# ip link set ens2f0 mtu 9000 # ip link set ens2f1 mtu 9000

Installation

Create TRex working directory and obtaining the TRex package.

TRex server console

Copy
Copied!
            

# cd /tmp # wget https://trex-tgn.cisco.com/trex/release/v2.87.tar.gz --no-check-certificate # mkdir /scratch # cd /scratch # tar -zxf /tmp/v2.87.tar.gz # chmod 777 -R /scratch

First-Time Scripts

The next step will continue from folder /scratch/v2.87.

Run TRex configuration script in interactive mode. Follow the instructions on the screen to create a basic config file /etc/trex_cfg.yaml :

TRex server console

Copy
Copied!
            

# ./dpdk_setup_ports.py -i

The /etc/trex_cfg.yaml configuration file is created. Later we'll change it to suit our setup.

Appendix

Performance Testing

Below, a performance test is shown of DPDK traffic emulation between TRex traffic generator and TESTPMD application running on the K8s worker node, in accordance with the Testbed diagram presented above.

Prerequisites

Before starting the test, update TRex configuration file /etc/trex_cfg.yaml with a mac-address of the high-performance interface from the TESTPMD pod. Below are the steps to complete this update.

  1. Run pod on K8s cluster with TESTPMD apps according to below presented YAML configuration file testpmd-inbox.yaml (container image should include InfiniBand userspace drivers and TESTPMD apps):

    testpmd-inbox.yaml

    Copy
    Copied!
                

    apiVersion: apps/v1 kind: Deployment metadata: name: test-deployment labels: app: test spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test annotations: k8s.v1.cni.cncf.io/networks: netmlnx2f0 spec: containers: - image: < container image > name: test-pod securityContext: capabilities: add: [ "IPC_LOCK" ] volumeMounts: - mountPath: /hugepages name: hugepage resources: requests: hugepages-1Gi: 2Gi memory: 16Gi cpu: 8 nvidia.com/mlnx2f0: 1 limits: hugepages-1Gi: 2Gi memory: 16Gi cpu: 8 nvidia.com/mlnx2f0: 1 command: - sh - -c - sleep inf volumes: - name: hugepage emptyDir: medium: HugePages

    Deploy the deployment with the following command:

    Master Node console

    Copy
    Copied!
                

    # kubectl apply -f testpmd-inbox.yaml

  2. Get the network information from the deployed pod by running the following:

    Master Node console

    Copy
    Copied!
                

    # kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-deployment-676476c78d-glbfs 1/1 Running 0 30s 10.233.92.5 node3 <none> <none>   # kubectl exec -it test-deployment-676476c78d-glbfs -- ip a s net1 193: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 32:f9:3f:e3:dc:89 brd ff:ff:ff:ff:ff:ff inet 192.168.101.3/24 brd 192.168.101.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::30f9:3fff:fee3:dc89/64 scope link valid_lft forever preferred_lft forever

  3. Update TRex configuration file /etc/trex_cfg.yaml with mac-address if the NET1 network interface 32:f9:3f:e3:dc:89:

    /etc/trex_cfg.yaml

    Copy
    Copied!
                

    ### Config file generated by dpdk_setup_ports.py ###   - version: 2 interfaces: ['07:00.0', '0d:00.0'] port_info: - dest_mac: 32:f9:3f:e3:dc:89 # MAC OF NET1 INTERFACE src_mac: 0c:42:a1:24:05:1a - dest_mac: 32:f9:3f:e3:dc:89 # MAC OF NET1 INTERFACE src_mac: 0c:42:a1:24:05:1b   platform: master_thread_id: 0 latency_thread_id: 12 dual_if: - socket: 0 threads: [1,2,3,4,5,6,7,8,9,10,11]

DPDK Emulation Test

  1. Run TESTPMD apps in container:

    Master Node console

    Copy
    Copied!
                

    # kubectl exec -it test-deployment-676476c78d-glbfs -- bash root@test-deployment-676476c78d-glbfs:/tmp# dpdk-testpmd -c 0x1fe -m 1024 -w $PCIDEVICE_NVIDIA_COM_MLNX2F0 -- --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=8 --txq=8 --nb-cores=4 --rss-udp --forward-mode=macswap -a -i ... testpmd>

    Warning

    Specific TESTPMD parameters:

    $PCIDEVICE_NVIDIA_COM_MLNX2F0 - system variable PCI address of NET1

    More information about additional TESTPMD parameters:
    https://doc.dpdk.org/guides/testpmd_app_ug/run_app.html?highlight=testpmd
    https://doc.dpdk.org/guides/linux_gsg/linux_eal_parameters.html

  2. Run TRex traffic generator on TRex server:

    TRex server console

    Copy
    Copied!
                

    # cd /scratch/v2.87/ # ./t-rex-64 -v 7 -i -c 11 --no-ofed-check

    Open second screen to TRex server and create a traffic generation file mlnx-trex.py in folder /scratch/v2.87:

    mlnx-trex.py

    Copy
    Copied!
                

    from trex_stl_lib.api import *   class STLS1(object):       def create_stream (self):                  pkt = Ether()/IP(src="16.0.0.1",dst="48.0.0.1")/UDP(dport=12)/(22*'x')                            vm = STLScVmRaw( [                                 STLVmFlowVar(name="v_port",                                                 min_value=4337,                                                   max_value=5337,                                                   size=2, op="inc"),                                 STLVmWrFlowVar(fv_name="v_port",                                             pkt_offset= "UDP.sport" ),                                 STLVmFixChecksumHw(l3_offset="IP",l4_offset="UDP",l4_type=CTRexVmInsFixHwCs.L4_TYPE_UDP),                               ]                         )           return STLStream(packet = STLPktBuilder(pkt = pkt ,vm = vm ) ,                                 mode = STLTXCont(pps = 8000000) )         def get_streams (self, direction = 0, **kwargs):         # create 1 stream         return [ self.create_stream() ]     # dynamic load - used for trex console or simulator def register():     return STLS1()

    After run TRex console and generate traffic to TESTPMD pod:

    TRex server console

    Copy
    Copied!
                

    # cd /scratch/v2.87/ # ./trex-console Using 'python3' as Python interpeter   Connecting to RPC server on localhost:4501 [SUCCESS]   Connecting to publisher server on localhost:4500 [SUCCESS]   Acquiring ports [0, 1]: [SUCCESS]   Server Info: Server version: v2.87 @ STL Server mode: Stateless Server CPU: 11 x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz Ports count: 2 x 100Gbps @ MT2892 Family [ConnectX-6 Dx]   -=TRex Console v3.0=-   Type 'help' or '?' for supported actions   trex> tui<enter> ... tui> start -f mlnx-trex.py -m 45mpps -p 0 ... Global Statistitcs   connection : localhost, Port 4501 total_tx_L2 : 23.9 Gbps version : STL @ v2.87 total_tx_L1 : 30.93 Gbps cpu_util. : 82.88% @ 11 cores (11 per dual port) total_rx : 25.31 Gbps rx_cpu_util. : 0.0% / 0 pps total_pps : 44.84 Mpps async_util. : 0.05% / 11.22 Kbps drop_rate : 0 bps total_cps. : 0 cps queue_full : 0 pkts ...

    Summary

    From the above test, it is evident that the desired traffic is 45mpps with SR-IOV network port in POD.

    Warning

    In order to get better results, additional application tuning is required for Trex and TESTPMD.

    Done!

Authors

ID-2.jpg

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

me.JPG

Amir Zeidner

For the past several years, Amir has worked as a Solutions Architect primarily in the Telco space, leading advanced solutions to answer 5G, NFV, and SDN networking infrastructures requirements. Amir’s expertise in data plane acceleration technologies, such as Accelerated Switching and Network Processing (ASAP²) and DPDK, together with a deep knowledge of open source cloud-based infrastructures, allows him to promote and deliver unique end-to-end NVIDIA Networking solutions throughout the Telco world.

Related Documents

Last updated on Sep 12, 2023.