NVIDIA Docs Hub NVIDIA Networking Networking Solutions RDG for DPDK Applications on SR-IOV Enabled Kubernetes Cluster with NVIDIA Network Operator

RDG for DPDK Applications on SR-IOV Enabled Kubernetes Cluster with NVIDIA Network Operator

Created on July 7, 2021.

Scope

The following R eference D eployment G uide ( RDG ) explains how to build a high performing Kubernetes (K8s) cluster with containerd container runtime that is capable of running DPDK-based applications over NVIDIA Networking end-to-end Ethernet infrastructure.

This RDG describes a solution with multiple servers connected to a single switch that provides secondary network for the Kubernetes cluster. A more complex scale-out network topology of multiple L2 domains is beyond the scope of this document.

Abbreviations and Acronyms

Term	Definition	Term	Definition
CNI	Container Network Interface	LLDP	Link Layer Discovery Protocol
CR	Custom Resources	NFD	Node Feature Discovery
CRD	Custom Resources Definition	OCI	Open Container Initiative
CRI	Container Runtime Interface	PF	Physical Function
DHCP	Dynamic Host Configuration Protocol	QSG	Quick Start Guide
DNS	Domain Name System	RDG	Reference Deployment Guide
DP	Device Plugin	RDMA	Remote Direct Memory Access
DPDK	Data Plane Development Kit	RoCE	RDMA over Converged Ethernet
EVPN	Ethernet VPN	SR-IOV	Single Root Input Output Virtualization
HWE	Hardware Enablement	VF	Virtual Function
IPAM	IP Address Management	VPN	Virtual Private Network
K8s	Kubernetes	VXLAN	Virtual eXtensible Local Area Network

Introduction

Provisioning Kubernetes cluster with containerd container runtime for running DPDK-based workloads may become an extremely complicated task.
Proper design and software and hardware component selection may become a gating task toward successful deployment.
This guide provides a complete solution cycle including technology overview, design, component selection, and deployment steps.
The solution will be delivered on top of standard servers over the NVIDIA end-to-end Ethernet infrastructure.
In this document, we will be using the new NVIDIA Network Operator which is in charge of deploying and configuring SRIOV Device Plugin and SRIOV CNI. These components allow to run DPDK workloads on a Kubernetes Worker Node.

References

Solution Architecture

Key Components and Technologies

NVIDIA ConnectX SmartNICs
10/25/40/50/100/200 and 400G Ethernet Network Adapters
The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

NVIDIA LinkX Cables
The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

NVIDIA Spectrum Ethernet Switches
Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.
NVIDIA combines the benefits of NVIDIA Spectrum^™ switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus^® Linux , SONiC and NVIDIA Onyx^®.

NVIDIA Cumulus Linux
NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

RDMA
RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.

Kubernetes
Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

Kubespray
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
- A highly available cluster
- Composable attributes
- Support for most popular Linux distributions

NVIDIA Network Operator
An analog to the NVIDIA GPU Operator, the NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

What is containerd?
An industry-standard container runtime with an emphasis on simplicity, robustness and portability. containerd is available as a daemon for Linux and Windows. It manages the complete container lifecycle of its host system, from image transfer and storage to container execution and supervision to low-level storage to network attachments and beyond.
NVIDIA PMDs
NVIDIA Poll Mode Driver (PMD) is an open-source upstream driver, embedded within dpdk.org releases, designed for fast packet processing and low latency by providing kernel bypass for receive and send and by avoiding the interrupt processing performance overhead.
TRex—Realistic Traffic Generator
TRex is an open source stateful and stateless traffic generator fueled by DPDK. It generates L3-7 traffic and provides in one tool capabilities provided by commercial tools. TRex can scale up to 200Gb/sec with one server.

Logical Design

The logical design includes the following parts:

Deployment node running Kubespray that deploys Kubernetes clusters.
K8s master node running all Kubernetes management components.
K8s worker nodes.
TRex server.
High-speed Ethernet fabric for DPDK tenant network
Deployment and K8s management network.

Fabric Design

The high-performance network is a secondary network for Kubernetes cluster and required the L2 network topology.

This RDG describes a solution with multiple servers connected to a single switch that provides secondary network for the Kubernetes cluster.

A more complex scale-out network topology of multiple L2 domains is beyond the scope of this document.

Software Stack Components

Bill of Materials

The following hardware setup is utilized in this guide.

Important

The above table does not contain the management network connectivity components.

Deployment and Configuration

Wiring

On each K8s worker node and TRex server, the first port of each NVIDIA Network Adapter is wired to the NVIDIA switch in high-performance fabric using NVIDIA LinkX DAC cables.

Warning

Deployment and Management network is part of IT infrastructure and is not covered in this guide.

Fabric

Prerequisites

High-performance Ethernet fabric
- Single switch
  NVIDIA SN2100
- Switch OS
  Cumulus Linux v4.2.1
Deployment and management network
DNS and DHCP network services and network topology are part of the IT infrastructure. The component installation and configuration are not covered in this guide.

Network Configuration

Below are the server names with their relevant network configurations.

Server/Switch type	Server/Switch name	IP and NICS
Server/Switch type	Server/Switch name	High-speed network	Management network 1/25 GbE
Deployment node	depserver		ens4f0: DHCP 192.168.222.110
Master node	node1		ens4f0: DHCP 192.168.222.111
Worker node	node2	ens2f0: no IP set	ens4f0: DHCP 192.168.222.101
Worker node	node3	ens2f0: no IP set	ens4f0: DHCP 192.168.222.102
TRex server	node4	ens2f0: no IP set ens2f1: no IP set	ens4f0: DHCP 192.168.222.103
High-speed switch	leaf01		mgmt0: From DHCP 192.168.222.201

Warning

ensXf0 high-speed network interfaces do not require additional configuration.

Fabric Configuration

This solution is based on Cumulus Linux v4.2.1 switch operation system.

Intermediate-level Linux knowledge is assumed for this guide. Familiarity with basic text editing, Linux file permissions, and process monitoring is required. A variety of text editors are pre-installed, including vi and nano.

Networking engineers who are unfamiliar with Linux concepts should refer to this reference guide to compare the Cumulus Linux CLI and configuration options and their equivalent Cisco Nexus 3000 NX-OS commands and settings. There is also a series of short videos with an introduction to Linux and Cumulus-Linux-specific concepts.

A Greenfield deployment is assumed for this guide. Please refer to the following guide for Upgrading Cumulus Linux.

Fabric configuration steps:

Administratively enable all physical ports.
Create a bridge and configure one or more front panel ports as members of the bridge.
Commit configuration.

Switch configuration steps.

Switch console

Copy
Copied!

            
            Linux swx-mld-l03 4.19.0-cl-1-amd64 #1 SMP Cumulus 4.19.94-1+cl4.2.1u1 (2020-08-28) x86_64
 
Welcome to NVIDIA Cumulus (R) Linux (R)
 
For support and online technical documentation, visit
http://www.cumulusnetworks.com/support
 
The registered trademark Linux (R) is used pursuant to a sublicense from LMI,
the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide
basis.
 
cumulus@leaf01:mgmt:~$ net add interface swp1-16
cumulus@leaf01:mgmt:~$ net add bridge bridge ports swp1-16
cumulus@leaf01:mgmt:~$ net commit

To view link status, use the net show interface all command. The following examples show the output of ports in admin down, down, and up modes.

Switch console

Copy
Copied!

            
            cumulus@leaf01:mgmt:~$ net show State  Name    Spd   MTU    Mode -----  ------  ----  -----  ---------- UP     lo      N/A   65536  Loopback lo UP     eth0    1G    1500   Mgmt eth0 UP     swp1    100G  9216   Access/L2 UP     swp2    100G  9216   Access/L2 UP     swp3    100G  9216   Access/L2 UP     swp4    100G  9216   Access/L2 UP     swp5    100G  9216   Access/L2 UP     swp6    100G  9216   Access/L2 UP     swp7    100G  9216   Access/L2 UP     swp8    100G  9216   Access/L2 DN     swp9    N/A   9216   Access/L2 DN     swp10   N/A   9216   Access/L2 DN     swp11   N/A   9216   Access/L2 DN     swp12   N/A   9216   Access/L2 DN     swp13   N/A   9216   Access/L2 DN     swp14   N/A   9216   Access/L2 DN     swp15   N/A   9216   Access/L2 DN     swp16   N/A   9216   Access/L2 UP     bridge  N/A   9216   Bridge/L2 UP     mgmt    N/A   65536  VRF mgmt  

interface all LLDP                             Summary -------------------------------  ------------------------ IP: 127.0.0.1/8 IP: ::1/128 mgmt-xxx-xxx-xxx-xxx (8)         Master: mgmt(UP) IP: 192.168.222.201/24(DHCP) Master: bridge(UP) node2 (0c:42:a1:2b:74:ae)        Master: bridge(UP) Master: bridge(UP) node3 (0c:42:a1:24:05:4a)        Master: bridge(UP) Master: bridge(UP) node4 (0c:42:a1:24:05:1a)        Master: bridge(UP) Master: bridge(UP) node4 (0c:42:a1:24:05:1b)        Master: bridge(UP) Master: bridge(UP) Master: bridge(UP) Master: bridge(UP) Master: bridge(UP) Master: bridge(UP) Master: bridge(UP) Master: bridge(UP) Master: bridge(UP) IP: 127.0.0.1/8 IP: ::1/128

Nodes Configuration

General Prerequisites:

Hardware
All the K8s worker nodes have the same hardware specification (see BoM for details).
Host BIOS
Verify that SR-IOV supported server platform is being used and review the BIOS settings in the server platform vendor documentation to enable SR-IOV in the BIOS.
Host OS
Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages.
Experience with Kubernetes
Familiarization with the Kubernetes Cluster architecture is essential.

Important

Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.

All worker nodes must have the same PCIe placement for the NIC and expose the same interface name.

Host OS Prerequisites

Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages and create a non-root depuser account with sudo privileges without password.

Update the Ubuntu software packages by running the following commands:

Server console

Copy
Copied!

            
            $ sudo apt-get update
$ sudo apt-get upgrade -y
$ sudo reboot

In this solution we added the following line to the EOF /etc/sudoers:

Server console

Copy
Copied!

            
            $ sudo vim /etc/sudoers
#includedir /etc/sudoers.d
#K8s cluster deployment user with sudo privileges without password
depuser ALL=(ALL) NOPASSWD:ALL

NIC Firmware Upgrade

It is recommended to upgrade the NIC firmware on the worker nodes to the latest released version.
Download mlxup firmware update and query utility to each worker node and update the NIC firmware.
The most recent version of mlxup can be downloaded from the official download page. m lxup can download and update the NIC firmware to the latest firmware over the Internet.
The utility execution required sudo privileges:

Worker Node console

Copy
Copied!

            
            # wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup
# chmod +x mlxup
# ./mlxup -online -u

RDMA Subsystem Configuration

RDMA subsystem configuration is required on each worker node.

Instal LLDP Daemon and RDMA Core Userspace Libraries and Daemons.
Worker Node console

Copy

Copied!
```
            
            # apt install -y lldpd rdma-core
        
```
LLDPD is a daemon able to receive and send LLDP frames. The Link Layer Discovery Protocol (LLDP) is a vendor-neutral Layer 2 protocol that allows a network device to advertise its identity and capabilities on the local network.

Identify the name of the RDMA-capable interface for high-performance K8s network.

In this guide, ens2f0 network interface for high-performance K8s network was chosen and will be activated by NVIDIA Network Operator deployment:

Worker Node console

Copy
Copied!

            
            # rdma link
link rocep7s0f0/1 state DOWN physical_state DISABLED netdev ens2f0 
link rocep7s0f1/1 state DOWN physical_state DISABLED netdev ens2f1  
link rocep131s0f0/1 state ACTIVE physical_state LINK_UP netdev ens4f0 
link rocep131s0f1/1 state DOWN physical_state DISABLED netdev ens4f1

Set RDMA subsystem network namespace mode to exclusive mode.
RDMA subsystem network namespace mode ( netns parameter in ib_core module) in exclusive mode allows network namespace isolation for RDMA workloads on the worker node servers. Please create /etc/modprobe.d/ib_core.conf configuration file to change ib_core module parameters:
/etc/modprobe.d/ib_core.conf

Copy

Copied!
```
            
            # Set netns to exclusive mode for namespace isolation
options ib_core netns_mode=0
        
```
Then re-generate the initial RAM disks and reboot servers:
Worker Node console

Copy

Copied!
```
            
            # update-initramfs -u
# reboot
        
```
After the server comes back, check netns mode:
Worker Node console

Copy

Copied!
```
            
            # rdma system
 
netns exclusive
        
```

K8s Cluster Deployment and Configuration

The Kubernetes cluster in this solution will be installed using Kubespray with a non-root depuser account from the deployment node.

SSH Private Key and SSH Passwordless Login

Log in to the deployment node as a deployment user (in this case, depuser) and create an SSH private key for configuring the passwordless authentication on your computer by running the following commands:

Deployment Node console

Copy
Copied!

            
            $ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/depuser/.ssh/id_rsa): 
Created directory '/home/depuser/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/depuser/.ssh/id_rsa
Your public key has been saved in /home/depuser/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:IfcjdT/spXVHVd3n6wm1OmaWUXGuHnPmvqoXZ6WZYl0 depuser@depserver
The key's randomart image is:
+---[RSA 3072]----+
|                *|
|               .*|
|      . o . .  o=|
|       o + . o +E|
|        S o  .**O|
|         . .o=OX=|
|           . o%*.|
|             O.o.|
|           .*.ooo|
+----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in the deployment by running the following command (example):

Deployment Node console

Copy
Copied!

            
            $ ssh-copy-id depuser@192.168.222.111
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/depuser/.ssh/id_rsa.pub"
The authenticity of host '192.168.222.111 (192.168.222.111)' can't be established.
ECDSA key fingerprint is SHA256:6nhUgRlt9gY2Y2ofukUqE0ltH+derQuLsI39dFHe0Ag.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
depuser@192.168.222.111's password: 
 
Number of key(s) added: 1
 
Now try logging into the machine, with:   "ssh 'depuser@192.168.222.111'"
and check to make sure that only the key(s) you wanted were added.

Verify that you have passwordless SSH connectivity to all nodes in your deployment by running the following command (example):

Deployment Node console

Copy
Copied!

            
            $ ssh depuser@192.168.222.111

Kubespray Deployment and Configuration

General Setting

To install dependencies for running Kubespray with Ansible on the deployment node, please run following commands:

Deployment Node console

Copy
Copied!

            
            $ cd ~
$ sudo apt -y install python3-pip jq
$ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.15.0.tar.gz
$ tar -zxf v2.15.0.tar.gz
$ cd kubespray-2.15.0
$ sudo pip3 install -r requirements.txt

Warning

The default folder for subsequent commands is ~/kubespray-2.15.0.

Deployment Customization

Create a new cluster configuration and host configuration file .
Replace the IP addresses below with your nodes' IP addresses:

Deployment Node console

Copy
Copied!

            
            $ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(192.168.222.111 192.168.222.101 192.168.222.102)
$ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:

inventory/mycluster/hosts.yaml

Copy
Copied!

            
            all:
  hosts:
    node1:
      ansible_host: 192.168.222.111
      ip: 192.168.222.111
      access_ip: 192.168.222.111
    node2:
      ansible_host: 192.168.222.101
      ip: 192.168.222.101
      access_ip: 192.168.222.101
    node3:
      ansible_host: 192.168.222.102
      ip: 192.168.222.102
      access_ip: 192.168.222.102
  children:
    kube-master:
      hosts:
        node1:
    kube-node:
      hosts:
        node2:
        node3:
    etcd:
      hosts:
        node1:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: {}

Review and change cluster installation parameters in the files:

inventory/mycluster/group_vars/all/all.yml
inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml set a d efault Kubernetes CNI by setting the desired kube_network_plugin value (default : calico ) parameter.

inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

Copy
Copied!

            
            ...
 
# Choose network plugin (cilium, calico, contiv, weave or flannel. Use cni for generic cni plugin)
# Can also be set to 'cloud', which lets the cloud provider setup appropriate routing
kube_network_plugin: calico
 
# Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni
kube_network_plugin_multus: false
 
...

Choice container runtime

In this guide containerd was chosen as the default container runtime in K8s cluster deployment because docker will be deprecated soon.

To use the containerd container runtime, set the following variables:

In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml :

inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

Copy
Copied!

            
            ...
 
## Container runtime
## docker for docker, crio for cri-o and containerd for containerd.
container_manager: containerd
 
...

In inventory/mycluster/group_vars/all/all.yml:

inventory/mycluster/group_vars/all/all.yml

Copy
Copied!

            
            ...
 
## Experimental kubeadm etcd deployment mode. Available only for new deployment
etcd_kubeadm_enabled: true
 
...

In inventory/mycluster/group_vars/etcd.yml:

inventory/mycluster/group_vars/etcd.yml

Copy
Copied!

            
            ...
 
## Settings for etcd deployment type
etcd_deployment_type: host
 
...

Deploying the Cluster Using KubeSpray Ansible Playbook

Run the following line to start the deployment process:

Deployment Node console

Copy
Copied!

            
            $ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete, please make sure no errors are encountered.

A successful result should look something like the following:

Deployment Node console

Copy
Copied!

            
            ...
PLAY RECAP ***********************************************************************************************************************************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
node1                      : ok=554  changed=81   unreachable=0    failed=0    skipped=1152 rescued=0    ignored=2   
node2                      : ok=360  changed=42   unreachable=0    failed=0    skipped=633  rescued=0    ignored=1   
node3                      : ok=360  changed=42   unreachable=0    failed=0    skipped=632  rescued=0    ignored=1   
 
Sunday 11 July 2021  22:36:04 +0000 (0:00:00.053)      0:06:51.785 ************ 
=============================================================================== 
kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------------------------------------------- 37.24s
kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 28.29s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 16.57s
kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 14.23s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 11.06s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 9.18s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 8.61s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources --------------------------------------------------------------------------------------------------------------------------- 7.02s
container-engine/crictl : download_file | Download item ------------------------------------------------------------------------------------------------------------------------------- 5.78s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 5.52s
Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------ 5.24s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 4.89s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 4.81s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates ---------------------------------------------------------------------------------------------------------------- 4.68s
reload etcd --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.65s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 4.24s
kubernetes/preinstall : Get current calico cluster version ---------------------------------------------------------------------------------------------------------------------------- 3.70s
network_plugin/calico : Start Calico resources ---------------------------------------------------------------------------------------------------------------------------------------- 3.42s
container-engine/crictl : extract_file | Unpacking archive ---------------------------------------------------------------------------------------------------------------------------- 3.35s
kubernetes-apps/cluster_roles : Apply workaround to allow all nodes with cert O=system:nodes to register ------------------------------------------------------------------------------ 3.32s

K8s Cluster Customization

Now that the K8S cluster is deployed, connect to the K8S master node with the root user account in order to customize deployment.

Label the worker nodes.

Master Node console

Copy
Copied!

            
            # kubectl label nodes node2 node-role.kubernetes.io/worker=
# kubectl label nodes node3 node-role.kubernetes.io/worker=

K8S Cluster Deployment Verification

Following is an output example of K8s cluster deployment information using the Calico CNI plugin.

To ensure that the Kubernetes cluster is installed correctly, run the following commands:

Master Node console

Copy
Copied!

            
            # kubectl get nodes -o wide
NAME    STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node1   Ready    master   44m   v1.19.7   192.168.222.111   <none>        Ubuntu 20.04.2 LTS   5.4.0-72-generic   containerd://1.4.4
node2   Ready    worker   42m   v1.19.7   192.168.222.101   <none>        Ubuntu 20.04.2 LTS   5.4.0-72-generic   containerd://1.4.4
node3   Ready    worker   42m   v1.19.7   192.168.222.102   <none>        Ubuntu 20.04.2 LTS   5.4.0-72-generic   containerd://1.4.4
 
# kubectl -n kube-system get pods -o wide
NAME                                      READY   STATUS    RESTARTS   AGE   IP                NODE    NOMINATED NODE   READINESS GATES
calico-kube-controllers-8b5ff5d58-ph86x   1/1     Running   0          43m   192.168.222.101   node2   <none>           <none>
calico-node-l48qg                         1/1     Running   0          43m   192.168.222.102   node3   <none>           <none>
calico-node-ldx7w                         1/1     Running   0          43m   192.168.222.111   node1   <none>           <none>
calico-node-x9bh5                         1/1     Running   0          43m   192.168.222.101   node2   <none>           <none>
coredns-85967d65-pslmm                    1/1     Running   0          27m   10.233.96.1       node2   <none>           <none>
coredns-85967d65-qp2rl                    1/1     Running   0          43m   10.233.90.230     node1   <none>           <none>
dns-autoscaler-5b7b5c9b6f-8wb67           1/1     Running   0          43m   10.233.90.229     node1   <none>           <none>
etcd-node1                                1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
kube-apiserver-node1                      1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
kube-controller-manager-node1             1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
kube-proxy-6p4rm                          1/1     Running   0          44m   192.168.222.101   node2   <none>           <none>
kube-proxy-8bj6s                          1/1     Running   0          44m   192.168.222.111   node1   <none>           <none>
kube-proxy-dj4l8                          1/1     Running   0          44m   192.168.222.102   node3   <none>           <none>
kube-scheduler-node1                      1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
nginx-proxy-node2                         1/1     Running   0          44m   192.168.222.101   node2   <none>           <none>
nginx-proxy-node3                         1/1     Running   0          44m   192.168.222.102   node3   <none>           <none>
nodelocaldns-8b6kf                        1/1     Running   0          43m   192.168.222.102   node3   <none>           <none>
nodelocaldns-kzmmh                        1/1     Running   0          43m   192.168.222.101   node2   <none>           <none>
nodelocaldns-zh9fz                        1/1     Running   0          43m   192.168.222.111   node1   <none>           <none>

NVIDIA Network Operator Installation for K8S Cluster

NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components in order to enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

To make it work, several components need to be provisioned and configured. All operator configuration and installation steps should be performed from the K8S master node with the root user account.

Prerequisites

Install Helm.

Master Node console

Copy
Copied!

            
            # snap install helm --classic

Install additional RDMA CNI plugin
RDMA CNI plugin allows network namespace isolation for RDMA workloads in a containerized environment.
Deploy CNI's using the following YAML files:

Master Node console

Copy
Copied!

            
            # kubectl apply -f https://raw.githubusercontent.com/Mellanox/rdma-cni/master/deployment/rdma-cni-daemonset.yaml

To ensure the plugin is installed correctly, run the following command:

Master Node console

Copy
Copied!

            
            # kubectl -n kube-system get pods -o wide | egrep  "rdma"
 
kube-rdma-cni-ds-5zl8d                    1/1     Running   0          11m    192.168.222.102   node3   <none>           <none>
kube-rdma-cni-ds-q74n5                    1/1     Running   0          11m    192.168.222.101   node2   <none>           <none>
kube-rdma-cni-ds-rnqkr                    1/1     Running   0          11m    192.168.222.111   node1   <none>           <none>

Deployment

Add the NVIDIA Network Operator Helm repository:

Master Node console

Copy
Copied!

            
            # helm repo add mellanox https://mellanox.github.io/network-operator
# helm repo update

Create the values.yaml file in user home folder (e xample):

values.yaml

Copy
Copied!

            
            nfd:
  enabled: true
 
sriovNetworkOperator:
  enabled: true
 
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: false
 
nvPeerDriver:
  deploy: false
 
rdmaSharedDevicePlugin:
  deploy: false
 
sriovDevicePlugin:
  deploy: false
 
secondaryNetwork:
  deploy: true
  cniPlugins:
    deploy: true
    image: containernetworking-plugins
    repository: mellanox
    version: v0.8.7
    imagePullSecrets: []
  multus:
    deploy: true
    image: multus
    repository: nfvpe
    version: v3.6
    imagePullSecrets: []
    config: ''
  ipamPlugin:
    deploy: true
    image: whereabouts
    repository: mellanox
    version: v0.3
    imagePullSecrets: []

Deploy the operator:

Master Node console

Copy
Copied!

            
            # helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name
 
NAME: network-operator
LAST DEPLOYED: Sun Jul 11 23:06:54 2021
NAMESPACE: network-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Get Network Operator deployed resources by running the following commands:
 
$ kubectl -n network-operator get pods
$ kubectl -n mlnx-network-operator-resources get pods

To ensure that the Operator is deployed correctly, run the following commands:

Master Node console

Copy
Copied!

            
            # kubectl -n network-operator get pods -o wide
NAME                                                            READY   STATUS    RESTARTS   AGE   IP                NODE    NOMINATED NODE   READINESS GATES
network-operator-1627211751-5bd467cbd9-2hwqx                      1/1     Running   0          29h   10.233.90.5      node1   <none>           <none>
network-operator-1627211751-node-feature-discovery-master-dgs69   1/1     Running   0          29h   10.233.90.6      node1   <none>           <none>
network-operator-1627211751-node-feature-discovery-worker-7n6gs   1/1     Running   0          29h   10.233.90.3      node1   <none>           <none>
network-operator-1627211751-node-feature-discovery-worker-sjdxw   1/1     Running   1          29h   10.233.96.7      node2   <none>           <none>
network-operator-1627211751-node-feature-discovery-worker-vzpvg   1/1     Running   1          29h   10.233.92.5      node3   <none>           <none>
network-operator-1627211751-sriov-network-operator-5f869696sdzp   1/1     Running   0          29h   10.233.90.4      node1   <none>           <none>

High-Speed Network Configuration

After installing the operator, please check the SriovNetworkNodeState CRs to see all SRIOV-enabled devices in your node.
In our deployment has been chosen network interface with name ens2f0. To review the interface status please use following command:

Master Node console

Copy
Copied!

            
            # kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io node2 -o yaml
 
...
 
status:
  interfaces:
  - deviceID: 101d
    driver: mlx5_core
    linkSpeed: 100000 Mb/s
    linkType: ETH
    mac: 0c:42:a1:2b:74:ae
    mtu: 1500
    name: ens2f0
    pciAddress: "0000:07:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 101d
    driver: mlx5_core
    linkType: ETH
    mac: 0c:42:a1:2b:74:af
    mtu: 1500
    name: ens2f1
    pciAddress: "0000:07:00.1"
    totalvfs: 8
    vendor: 15b3
 
...

Create SriovNetworkNodePolicy CR policy.yaml file, by specifying chosen interface in the 'nicSelector' (in this example, for the ens2f0 interface):

policy.yaml

Copy
Copied!

            
            apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnxnics
  namespace: network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  resourceName: mlnx2f0
  priority: 98
  mtu: 9000
  numVfs: 8
  nicSelector:
    vendor: "15b3"
    pfNames: [ "ens2f0" ]
  deviceType: netdevice
  isRdma: true

Deploy policy.yaml:

Master Node console

Copy
Copied!

            
            # kubectl apply -f policy.yaml

Create a SriovNetwork CR network.yaml file which refers to the 'resourceName' defined in SriovNetworkNodePolicy (in this example, referencing the mlnx2f0 resource and set 192.168.101.0/24 as CIDR range for the high-speed network):

network.yaml

Copy
Copied!

            
            apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: "netmlnx2f0"
  namespace: network-operator
spec:
  ipam: |
    {
      "datastore": "kubernetes",
      "kubernetes": {
         "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "log_file": "/tmp/whereabouts.log",
      "log_level": "debug",
      "type": "whereabouts",
      "range": "192.168.101.0/24"
    }
  vlan: 0
  networkNamespace: "default"
  spoofChk: "off"
  resourceName: "mlnx2f0"
  linkState: "enable"
  metaPlugins: |
    {
      "type": "rdma"
    }

Deploy network.yaml:

Master Node console

Copy
Copied!

            
            # kubectl apply -f network.yaml

Validating the Deployment

Check if the deployment is finished successfully:

Master Node console

Copy
Copied!

            
            # kubectl -n nvidia-network-operator-resources get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP               NODE    NOMINATED NODE   READINESS GATES
cni-plugins-ds-f548q         1/1     Running   1          30m   192.168.222.101   node2   <none>           <none>
cni-plugins-ds-qw7hx         1/1     Running   1          30m   192.168.222.102   node3   <none>           <none>
kube-multus-ds-cjbf9         1/1     Running   1          30m   192.168.222.102   node3   <none>           <none>
kube-multus-ds-rgc95         1/1     Running   1          30m   192.168.222.101   node2   <none>           <none>
whereabouts-gwr7p            1/1     Running   1          30m   192.168.222.101   node2   <none>           <none>
whereabouts-n29nq            1/1     Running   1          30m   192.168.222.102   node3   <none>           <none>

Check deployed network:

Master Node console

Copy
Copied!

            
            # kubectl get network-attachment-definitions.k8s.cni.cncf.io 
NAME         AGE
netmlnx2f0   4m56s

Check worker node resources:

Master Node console

Copy
Copied!

            
            # kubectl describe nodes node2
 
...
 
Addresses:
  InternalIP:  192.168.222.101
  Hostname:    node2
Capacity:
  cpu:                 24
  ephemeral-storage:   229698892Ki
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              264030604Ki
  nvidia.com/mlnx2f0:  8
  pods:                110
Allocatable:
  cpu:                 23900m
  ephemeral-storage:   211690498517
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              242694540Ki
  nvidia.com/mlnx2f0:  8
  pods:                110
 
...

Manage HugePages

Kubernetes supports the allocation and consumption of pre-allocated HugePages by applications in a Pod. The nodes will automatically discover and report all HugePages resources as schedulable resources. For get additional information K8s HugePages management, please refer here.

In order to allocate, HugePages needs to modify GRUB_CMDLINE_LINUX_DEFAULT parameter in /etc/default/grub. This setting, below, allocates 1GB * 16 pages = 16GB and 2MB * 2048 pages= 4GB HugePages on boot time:

/etc/default/grub

Copy
Copied!

            
            ...
 
GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048"
 
...

Run update-grub to apply the config to grub and reboot server:

Worker Node console

Copy
Copied!

            
            # update-grub
# reboot

After the server comes back, check hugepages allocation from master node by command:

Master Node console

Copy
Copied!

            
            # kubectl describe nodes node2
...
Capacity:
  cpu:                 24
  ephemeral-storage:   229698892Ki
  hugepages-1Gi:       16Gi
  hugepages-2Mi:       4Gi
  memory:              264030604Ki
  nvidia.com/mlnx2f0:  8
  pods:                110
Allocatable:
  cpu:                 23900m
  ephemeral-storage:   211690498517
  hugepages-1Gi:       16Gi
  hugepages-2Mi:       4Gi
  memory:              242694540Ki
  nvidia.com/mlnx2f0:  8
  pods:                110
...

Enable CPU and Topology Management

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

Require as much CPU time as possible
Are sensitive to processor cache misses
Are low-latency network applications
Coordinate with other processes and benefit from sharing a single processor cache

Topology Manager uses topology information from collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and Pod resources requested. In order to extract the best performance, optimizations related to CPU isolation and memory and device locality are required.

Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

Important

To use Topology Manager, CPU Manager with static policy must be used.

For additional information, please refer to Control Topology Management Policies on a node and Control Topology Management Policies on a node.

In order to enable CPU Manager and Topology Manager, please add following lines to kubelet configuration file /etc/kubernetes/kubelet-config.yaml:

/etc/kubernetes/kubelet-config.yaml

Copy
Copied!

            
            ...
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 10s
topologyManagerPolicy: single-numa-node
featureGates:
  CPUManager: true
  TopologyManager: true

Due to changes in cpuManagerPolicy, remove /var/lib/kubelet/cpu_manager_state and restart kubelet service on each affected K8s worker node.

Worker Node console

Copy
Copied!

            
            # rm -f /var/lib/kubelet/cpu_manager_state
# service kubelet restart

Application

DPDK traffic emulation is shown in Testbed Flow Diagram below. The traffic will be pushed from Trex Server via ens2f0 interface to TestPMD POD via SRIOV network interface net1. TestPMD POD will swap mac-address and re-routes ingress traffic via the same interface net1 to the same interface on Trex Server.

Verification

Create a sample deployment test-deployment.yaml (container image should include InfiniBand userspace drivers and performance tools):

test-deployment.yaml

Copy
Copied!

            
            apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlnx-inbox-pod
  labels:
    app: sriov
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sriov
  template:
    metadata:
      labels:
        app: sriov
      annotations:
        k8s.v1.cni.cncf.io/networks: netmlnx2f0
    spec:
      containers:
      - image: < Container image >
        name: mlnx-inbox-ctr
        securityContext:
          capabilities:
            add: [ "IPC_LOCK" ]
        resources:
          requests:
            cpu: 4
            nvidia.com/mlnx2f0: 1
          limits:
            cpu: 4
            nvidia.com/mlnx2f0: 1
        command:
        - sh
        - -c
        - sleep inf

Deploy the sample deployment.

Master Node console

Copy
Copied!

            
            # kubectl apply -f test-deployment.yaml

Verify the deployment is running.

Master Node console

Copy
Copied!

            
            # kubectl get pod -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
mlnx-inbox-pod-599dc445c8-72x6g   1/1     Running   0          12s   10.233.96.5   node2   <none>           <none>
mlnx-inbox-pod-599dc445c8-v5lnx   1/1     Running   0          12s   10.233.92.4   node3   <none>           <none>

Check available network interfaces in POD.

Master Node console

Copy
Copied!

            
            # kubectl exec -it mlnx-inbox-pod-599dc445c8-72x6g -- bash
 
root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# rdma link
link rocep7s0f0v2/1 state ACTIVE physical_state LINK_UP netdev net1
 
root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if208: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 12:51:ab:b3:ef:26 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.233.96.5/32 brd 10.233.96.5 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::1051:abff:feb3:ef26/64 scope link 
       valid_lft forever preferred_lft forever
201: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 02:40:7d:5e:5f:af brd ff:ff:ff:ff:ff:ff
    inet 192.168.101.2/24 brd 192.168.101.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::40:7dff:fe5e:5faf/64 scope link 
       valid_lft forever preferred_lft forever

Run synthetic RDMA benchmark tests with ib_write_bw bandwidth and latency test using RDMA write transactions.

Server	ib_write_bw -F -d $IB_DEV_NAME --report_gbits
Client	ib_write_bw -F $SERVER_IP -d $IB_DEV_NAME --report_gbits

Please open two consoles to K8s master node—one for the server apps side and the second for the client apps side.

In a first console (server side) to K8s master node, run the following commands:

Master Node console

Copy
Copied!

            
            # kubectl exec -it mlnx-inbox-pod-599dc445c8-72x6g -- bash
root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ip a s net1 | grep inet
    inet 192.168.101.2/24 brd 192.168.101.255 scope global net1
    inet6 fe80::40:7dff:fe5e:5faf/64 scope link 
root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# rdma link
link rocep7s0f0v2/1 state ACTIVE physical_state LINK_UP netdev net1 
root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ib_write_bw -F -d rocep7s0f0v2 --report_gbits
 
************************************
* Waiting for client to connect... *
************************************

In a second console (client side) to K8s master node, run the following commands:

Master Node console

Copy
Copied!

            
            # kubectl exec -it mlnx-inbox-pod-599dc445c8-v5lnx -- bash
root@mlnx-inbox-pod-599dc445c8-v5lnx:/tmp# rdma link
link rocep7s0f0v3/1 state ACTIVE physical_state LINK_UP netdev net1 
root@mlnx-inbox-pod-599dc445c8-v5lnx:/tmp# ib_write_bw  -F -d rocep7s0f0v3 192.168.101.2 --report_gbits
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : rocep7s0f0v3
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x01f2 PSN 0x75e7cf RKey 0x050e26 VAddr 0x007f51e51b9000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:01
 remote address: LID 0000 QPN 0x00f2 PSN 0x13427f RKey 0x010e26 VAddr 0x007f1ecaac8000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5000             94.26              92.87  		         0.169509
---------------------------------------------------------------------------------------

TRex Server Deployment

In our guide used TRex package v2.87.
For detailed TRex installation and configuration guide, please refer to TRex Documentation.

TRex Installation and configuration steps done with the root user account.

Prerequisites

For the TRex server, a standard server with installed RDMA subsystem has been used.

Activate the network interfaces that been used by TRex application with netplan.
In our deployment, interfaces ens2f0 and ens2f1 are used:

/etc/netplan/00-installer-config.yaml

Copy
Copied!

            
            # This is the network config written by 'subiquity'
network:
  ethernets:
    ens4f0:
      dhcp4: true
      dhcp-identifier: mac
    ens2f0: {}
    ens2f1: {}
  version: 2

Then re-apply netplan and check link status for ens2f0/ens2f1 network interfaces.

TRex server console

Copy
Copied!

            
            # netplan apply
# rdma link
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens2f0 
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens2f1 
link mlx5_2/1 state ACTIVE physical_state LINK_UP netdev ens4f0 
link mlx5_3/1 state DOWN physical_state DISABLED netdev ens4f1

Updated MTU size for interfaces ens2f0 and ens2f1.

TRex server console

Copy
Copied!

            
            # ip link set ens2f0 mtu 9000
# ip link set ens2f1 mtu 9000

Installation

Create TRex working directory and obtaining the TRex package.

TRex server console

Copy
Copied!

            
            # cd /tmp
# wget https://trex-tgn.cisco.com/trex/release/v2.87.tar.gz --no-check-certificate
# mkdir /scratch
# cd /scratch
# tar -zxf /tmp/v2.87.tar.gz
# chmod 777 -R /scratch

First-Time Scripts

The next step will continue from folder /scratch/v2.87.

Run TRex configuration script in interactive mode. Follow the instructions on the screen to create a basic config file /etc/trex_cfg.yaml :

TRex server console

Copy
Copied!

            
            # ./dpdk_setup_ports.py -i

The /etc/trex_cfg.yaml configuration file is created. Later we'll change it to suit our setup.

Appendix

Performance Testing

Below, a performance test is shown of DPDK traffic emulation between TRex traffic generator and TESTPMD application running on the K8s worker node, in accordance with the Testbed diagram presented above.

Prerequisites

Before starting the test, update TRex configuration file /etc/trex_cfg.yaml with a mac-address of the high-performance interface from the TESTPMD pod. Below are the steps to complete this update.

Run pod on K8s cluster with TESTPMD apps according to below presented YAML configuration file testpmd-inbox.yaml (container image should include InfiniBand userspace drivers and TESTPMD apps):

testpmd-inbox.yaml

Copy
Copied!

            
            apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  labels:
    app: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
      annotations:
        k8s.v1.cni.cncf.io/networks: netmlnx2f0
    spec:
      containers:
      - image: < container image >
        name: test-pod
        securityContext:
          capabilities:
            add: [ "IPC_LOCK" ]
        volumeMounts:
        - mountPath: /hugepages
          name: hugepage
        resources:
          requests:
            hugepages-1Gi: 2Gi
            memory: 16Gi
            cpu: 8
            nvidia.com/mlnx2f0: 1
          limits:
            hugepages-1Gi: 2Gi
            memory: 16Gi
            cpu: 8
            nvidia.com/mlnx2f0: 1
        command:
        - sh
        - -c
        - sleep inf
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages

Deploy the deployment with the following command:

Master Node console

Copy
Copied!

            
            # kubectl apply -f testpmd-inbox.yaml

Get the network information from the deployed pod by running the following:

Master Node console

Copy
Copied!

            
            # kubectl get pod -o wide
NAME                               READY   STATUS        RESTARTS   AGE    IP            NODE    NOMINATED NODE   READINESS GATES
test-deployment-676476c78d-glbfs   1/1     Running       0          30s    10.233.92.5   node3   <none>           <none>
 
# kubectl exec -it test-deployment-676476c78d-glbfs -- ip a s net1
193: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 32:f9:3f:e3:dc:89 brd ff:ff:ff:ff:ff:ff
    inet 192.168.101.3/24 brd 192.168.101.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::30f9:3fff:fee3:dc89/64 scope link 
       valid_lft forever preferred_lft forever

Update TRex configuration file /etc/trex_cfg.yaml with mac-address if the NET1 network interface 32:f9:3f:e3:dc:89:

/etc/trex_cfg.yaml

Copy
Copied!

            
            ### Config file generated by dpdk_setup_ports.py ###
 
- version: 2
  interfaces: ['07:00.0', '0d:00.0']
  port_info:
      - dest_mac: 32:f9:3f:e3:dc:89 # MAC OF NET1 INTERFACE
        src_mac:  0c:42:a1:24:05:1a
      - dest_mac: 32:f9:3f:e3:dc:89 # MAC OF NET1 INTERFACE
        src_mac:  0c:42:a1:24:05:1b
 
  platform:
      master_thread_id: 0
      latency_thread_id: 12
      dual_if:
        - socket: 0
          threads: [1,2,3,4,5,6,7,8,9,10,11]

DPDK Emulation Test

Run TESTPMD apps in container:

Master Node console

Copy
Copied!

            
            # kubectl exec -it test-deployment-676476c78d-glbfs -- bash
root@test-deployment-676476c78d-glbfs:/tmp# dpdk-testpmd -c 0x1fe  -m 1024 -w $PCIDEVICE_NVIDIA_COM_MLNX2F0 -- --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=8 --txq=8 --nb-cores=4  --rss-udp --forward-mode=macswap  -a -i
...
testpmd>

Warning

Specific TESTPMD parameters:

$PCIDEVICE_NVIDIA_COM_MLNX2F0 - system variable PCI address of NET1

More information about additional TESTPMD parameters:
https://doc.dpdk.org/guides/testpmd_app_ug/run_app.html?highlight=testpmd
https://doc.dpdk.org/guides/linux_gsg/linux_eal_parameters.html

Run TRex traffic generator on TRex server:

TRex server console

Copy
Copied!

            
            # cd /scratch/v2.87/
# ./t-rex-64 -v 7 -i -c 11 --no-ofed-check

Open second screen to TRex server and create a traffic generation file mlnx-trex.py in folder /scratch/v2.87:

mlnx-trex.py

Copy
Copied!

            
            from trex_stl_lib.api import *
 
class STLS1(object):
 
    def create_stream (self):
        
        pkt = Ether()/IP(src="16.0.0.1",dst="48.0.0.1")/UDP(dport=12)/(22*'x')
                  
        vm = STLScVmRaw( [
                                STLVmFlowVar(name="v_port",
                                                min_value=4337,
                                                  max_value=5337,
                                                  size=2, op="inc"),
                                STLVmWrFlowVar(fv_name="v_port",
                                            pkt_offset= "UDP.sport" ),
                                STLVmFixChecksumHw(l3_offset="IP",l4_offset="UDP",l4_type=CTRexVmInsFixHwCs.L4_TYPE_UDP),
 
                            ]
                        )
 
        return STLStream(packet = STLPktBuilder(pkt = pkt ,vm = vm ) ,
                                mode = STLTXCont(pps = 8000000) )
 
 
    def get_streams (self, direction = 0, **kwargs):
        # create 1 stream
        return [ self.create_stream() ]
 
 
# dynamic load - used for trex console or simulator
def register():
    return STLS1()

After run TRex console and generate traffic to TESTPMD pod:

TRex server console

Copy
Copied!

            
            # cd /scratch/v2.87/
# ./trex-console
Using 'python3' as Python interpeter
 
Connecting to RPC server on localhost:4501                   [SUCCESS]
 
Connecting to publisher server on localhost:4500             [SUCCESS]
 
Acquiring ports [0, 1]:                                      [SUCCESS]
 
Server Info:
Server version:   v2.87 @ STL
Server mode:      Stateless
Server CPU:       11 x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Ports count:      2 x 100Gbps @ MT2892 Family [ConnectX-6 Dx]	
 
-=TRex Console v3.0=-
 
Type 'help' or '?' for supported actions
 
trex> tui<enter>
...
tui> start -f mlnx-trex.py -m 45mpps -p 0
...
Global Statistitcs
 
connection   : localhost, Port 4501                       total_tx_L2  : 23.9 Gbps                      
version      : STL @ v2.87                                total_tx_L1  : 30.93 Gbps                     
cpu_util.    : 82.88% @ 11 cores (11 per dual port)       total_rx     : 25.31 Gbps                     
rx_cpu_util. : 0.0% / 0 pps                               total_pps    : 44.84 Mpps                     
async_util.  : 0.05% / 11.22 Kbps                         drop_rate    : 0 bps                          
total_cps.   : 0 cps                                      queue_full   : 0 pkts  
...

Summary

From the above test, it is evident that the desired traffic is 45mpps with SR-IOV network port in POD.

Warning

In order to get better results, additional application tuning is required for Trex and TESTPMD.

Done!

Authors

	Vitaliy Razinkov Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

	Amir Zeidner For the past several years, Amir has worked as a Solutions Architect primarily in the Telco space, leading advanced solutions to answer 5G, NFV, and SDN networking infrastructures requirements. Amir’s expertise in data plane acceleration technologies, such as Accelerated Switching and Network Processing (ASAP²) and DPDK, together with a deep knowledge of open source cloud-based infrastructures, allows him to promote and deliver unique end-to-end NVIDIA Networking solutions throughout the Telco world.

Amir Zeidner

For the past several years, Amir has worked as a Solutions Architect primarily in the Telco space, leading advanced solutions to answer 5G, NFV, and SDN networking infrastructures requirements. Amir’s expertise in data plane acceleration technologies, such as Accelerated Switching and Network Processing (ASAP²) and DPDK, together with a deep knowledge of open source cloud-based infrastructures, allows him to promote and deliver unique end-to-end NVIDIA Networking solutions throughout the Telco world.

Related Documents

Spectrum Ethernet Switches Bare Metal Ethernet Ubuntu testpmd Network Operator ConnectX DPDK Kubernetes SR-IOV

Last updated on Sep 12, 2023.

On This Page

Switch console

Switch console

Server console

Server console

Worker Node console

Worker Node console

Worker Node console

/etc/modprobe.d/ib_core.conf

Worker Node console

Worker Node console

Deployment Node console

Deployment Node console

Deployment Node console

Deployment Node console

Deployment Node console

inventory/mycluster/hosts.yaml

inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

inventory/mycluster/group_vars/all/all.yml

inventory/mycluster/group_vars/etcd.yml

Deployment Node console

Deployment Node console

Master Node console

Master Node console

Master Node console

Master Node console

Master Node console

Master Node console

values.yaml

Master Node console

Master Node console

Master Node console

policy.yaml

Master Node console

network.yaml

Master Node console

Master Node console

Master Node console

Master Node console

/etc/default/grub

Worker Node console

Master Node console

/etc/kubernetes/kubelet-config.yaml

Worker Node console

test-deployment.yaml

Master Node console

Master Node console

Master Node console

Master Node console

Master Node console

/etc/netplan/00-installer-config.yaml

TRex server console

TRex server console

TRex server console

TRex server console

testpmd-inbox.yaml

Master Node console

Master Node console

/etc/trex_cfg.yaml

Master Node console

TRex server console

mlnx-trex.py

TRex server console