image image image image image image



On This Page

Created on July 7, 2021.

Scope

The following Reference Deployment Guide (RDG) explains how to build a high performing Kubernetes (K8s) cluster with containerd container runtime that is capable of running DPDK-based applications over NVIDIA Networking end-to-end Ethernet infrastructure. 
This RDG describes a solution with multiple servers connected to a single switch that provides secondary network for the Kubernetes cluster. A more complex scale-out network topology of multiple L2 domains is beyond the scope of this document.

Abbreviations and Acronyms

Term

Definition

Term

Definition

CNI

Container Network Interface

LLDPLink Layer Discovery Protocol
CRCustom Resources 

NFD

Node Feature Discovery

CRD

Custom Resources Definition

OCI

Open Container Initiative

CRI

Container Runtime Interface

PF

Physical Function

DHCP

Dynamic Host Configuration Protocol

QSG

Quick Start Guide

DNS

Domain Name System

RDG

Reference Deployment Guide

DP

Device Plugin

RDMA

Remote Direct Memory Access

DPDK

Data Plane Development Kit

RoCE

RDMA over Converged Ethernet

EVPN

Ethernet VPN

SR-IOV

Single Root Input Output Virtualization

HWE

Hardware Enablement

VF

Virtual Function

IPAMIP Address Management

VPN

Virtual Private Network

K8s

Kubernetes

VXLAN

Virtual eXtensible Local Area Network

Introduction

Provisioning Kubernetes cluster with containerd container runtime for running DPDK-based workloads may become an extremely complicated task.
Proper design and software and hardware component selection may become a gating task toward successful deployment.
This guide provides a complete solution cycle including technology overview, design, component selection, and deployment steps. 
The solution will be delivered on top of standard servers over the NVIDIA end-to-end Ethernet infrastructure
In this document, we will be using the new NVIDIA Network Operator which is in charge of deploying and configuring SRIOV Device Plugin and SRIOV CNI. These components allow to run DPDK workloads on a Kubernetes Worker Node.

References

Solution Architecture

Key Components and Technologies

  • NVIDIA ConnectX SmartNICs
    10/25/40/50/100/200 and 400G Ethernet Network Adapters
    The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
    NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

  • NVIDIA LinkX Cables 
    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA Spectrum Ethernet Switches
    Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
    Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects. 
    NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® LinuxSONiC and NVIDIA Onyx®.

  • NVIDIA Cumulus Linux 
    NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

  • RDMA 
    RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
    Like locally based DMA, 
    RDMA improves throughput and performance and frees up compute resources.

  • Kubernetes
    Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

  • Kubespray 
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
    • A highly available cluster
    • Composable attributes
    • Support for most popular Linux distributions

  • NVIDIA Network Operator
    An analog to the NVIDIA GPU Operator, the 
    NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

  • What is containerd?
    An industry-standard container runtime with an emphasis on simplicity, robustness and portability. containerd is available as a daemon for Linux and Windows. It manages the complete container lifecycle of its host system, from image transfer and storage to container execution and supervision to low-level storage to network attachments and beyond.

  • NVIDIA PMDs
    NVIDIA Poll Mode Driver (PMD) is an open-source upstream driver, embedded within dpdk.org releases, designed for fast packet processing and low latency by providing kernel bypass for receive and send and by avoiding the interrupt processing performance overhead.

  • TRex—Realistic Traffic Generator
    TRex is an open source stateful and stateless traffic generator fueled by DPDK. It generates L3-7 traffic and provides in one tool capabilities provided by commercial tools. TRex can scale up to 200Gb/sec with one server.

Logical Design

The logical design includes the following parts:

  1. Deployment node running Kubespray that deploys Kubernetes clusters.

  2. K8s master node running all Kubernetes management components.

  3. K8s worker nodes. 

  4. TRex server.

  5. High-speed Ethernet fabric for DPDK tenant network

  6. Deployment and K8s management network.

Fabric Design

The high-performance network is a secondary network for Kubernetes cluster and required the L2 network topology.

This RDG describes a solution with multiple servers connected to a single switch that provides secondary network for the Kubernetes cluster.

A more complex scale-out network topology of multiple L2 domains is beyond the scope of this document. 

Software Stack Components

Bill of Materials

The following hardware setup is utilized in this guide.

The above table does not contain the management network connectivity components.

Deployment and Configuration

Wiring

On each K8s worker node and TRex server, the first port of each NVIDIA Network Adapter is wired to the NVIDIA switch in high-performance fabric using NVIDIA LinkX DAC cables.

Deployment and Management network is part of IT infrastructure and is not covered in this guide.


Fabric

Prerequisites

  • High-performance Ethernet fabric
    • Single switch
      NVIDIA SN2100

    • Switch OS
      Cumulus Linux v4.2.1

  • Deployment and management network
    DNS and DHCP network services and network topology are part of the IT infrastructure. The component installation and configuration are not covered in this guide.

Network Configuration

Below are the server names with their relevant network configurations.


Server/Switch type


Server/Switch name

IP and NICS

High-speed network


Management network

1/25 GbE

Deployment node

depserver


ens4f0: DHCP
192.168.222.110

Master node

node1


ens4f0: DHCP
192.168.222.111

Worker node

node2

ens2f0: no IP set

ens4f0: DHCP
192.168.222.101

Worker node

node3

ens2f0: no IP set

ens4f0: DHCP
192.168.222.102

TRex server

node4

ens2f0: no IP set

ens2f1: no IP set

ens4f0: DHCP

192.168.222.103

High-speed switch

leaf01


mgmt0: From DHCP
192.168.222.201

ensXf0 high-speed network interfaces do not require additional configuration.

Fabric Configuration

This solution is based on Cumulus Linux v4.2.1 switch operation system.

Intermediate-level Linux knowledge is assumed for this guide. Familiarity with basic text editing, Linux file permissions, and process monitoring is required. A variety of text editors are pre-installed, including vi and nano.

Networking engineers who are unfamiliar with Linux concepts should refer to this reference guide to compare the Cumulus Linux CLI and configuration options and their equivalent Cisco Nexus 3000 NX-OS commands and settings. There is also a series of short videos with an introduction to Linux and Cumulus-Linux-specific concepts. 

A Greenfield deployment is assumed for this guide. Please refer to the following guide for Upgrading Cumulus Linux.

Fabric configuration steps:

  1. Administratively enable all physical ports.

  2. Create a bridge and configure one or more front panel ports as members of the bridge.

  3. Commit configuration.

Switch configuration steps.

Switch console
Linux swx-mld-l03 4.19.0-cl-1-amd64 #1 SMP Cumulus 4.19.94-1+cl4.2.1u1 (2020-08-28) x86_64
 
Welcome to NVIDIA Cumulus (R) Linux (R)
 
For support and online technical documentation, visit
http://www.cumulusnetworks.com/support
 
The registered trademark Linux (R) is used pursuant to a sublicense from LMI,
the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide
basis.
 
cumulus@leaf01:mgmt:~$ net add interface swp1-16
cumulus@leaf01:mgmt:~$ net add bridge bridge ports swp1-16
cumulus@leaf01:mgmt:~$ net commit

To view link status, use the net show interface all command. The following examples show the output of ports in admin downdown, and up modes.

Switch console
cumulus@leaf01:mgmt:~$ net show interface all
State  Name    Spd   MTU    Mode        LLDP                             Summary
-----  ------  ----  -----  ----------  -------------------------------  ------------------------
UP     lo      N/A   65536  Loopback                                     IP: 127.0.0.1/8
       lo                                                                IP: ::1/128
UP     eth0    1G    1500   Mgmt        mgmt-xxx-xxx-xxx-xxx (8)         Master: mgmt(UP)
       eth0                                                              IP: 192.168.222.201/24(DHCP)
UP     swp1    100G  9216   Access/L2                                    Master: bridge(UP)
UP     swp2    100G  9216   Access/L2   node2 (0c:42:a1:2b:74:ae)        Master: bridge(UP)
UP     swp3    100G  9216   Access/L2                                    Master: bridge(UP)
UP     swp4    100G  9216   Access/L2   node3 (0c:42:a1:24:05:4a)        Master: bridge(UP)
UP     swp5    100G  9216   Access/L2                                    Master: bridge(UP)
UP     swp6    100G  9216   Access/L2   node4 (0c:42:a1:24:05:1a)        Master: bridge(UP)
UP     swp7    100G  9216   Access/L2                                    Master: bridge(UP)
UP     swp8    100G  9216   Access/L2   node4 (0c:42:a1:24:05:1b)        Master: bridge(UP)
DN     swp9    N/A   9216   Access/L2                                    Master: bridge(UP)
DN     swp10   N/A   9216   Access/L2                                    Master: bridge(UP)
DN     swp11   N/A   9216   Access/L2                                    Master: bridge(UP)
DN     swp12   N/A   9216   Access/L2                                    Master: bridge(UP)
DN     swp13   N/A   9216   Access/L2                                    Master: bridge(UP)
DN     swp14   N/A   9216   Access/L2                                    Master: bridge(UP)
DN     swp15   N/A   9216   Access/L2                                    Master: bridge(UP)
DN     swp16   N/A   9216   Access/L2                                    Master: bridge(UP)
UP     bridge  N/A   9216   Bridge/L2
UP     mgmt    N/A   65536  VRF                                          IP: 127.0.0.1/8
       mgmt                                                              IP: ::1/128
 

Nodes Configuration

General Prerequisites:

  • Hardware
    All the K8s worker nodes have the same hardware specification (see BoM for details).

  • Host BIOS
    Verify that SR-IOV supported server platform is being used and review the BIOS settings in the server platform vendor documentation to enable SR-IOV in the BIOS.

  • Host OS
    Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages.

  • Experience with Kubernetes
    Familiarization with the Kubernetes Cluster architecture is essential. 

Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.

All worker nodes must have the same PCIe placement for the NIC and expose the same interface name.


Host OS Prerequisites 

Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages and create a non-root depuser account with sudo privileges without password.

Update the Ubuntu software packages by running the following commands:

Server console
$ sudo apt-get update
$ sudo apt-get upgrade -y
$ sudo reboot 

In this solution we added the following line to the EOF /etc/sudoers:

Server console
$ sudo vim /etc/sudoers
  
#includedir /etc/sudoers.d
  
#K8s cluster deployment user with sudo privileges without password
depuser ALL=(ALL) NOPASSWD:ALL

NIC Firmware Upgrade

It is recommended to upgrade the NIC firmware on the worker nodes to the latest released version.
Download mlxup firmware update and query utility to each worker node and update the NIC firmware.
The most recent version of mlxup can be downloaded from the official download page. mlxup can download and update the NIC firmware to the latest firmware over the Internet.
The utility execution required sudo privileges:

Worker Node console
# wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup
# chmod +x mlxup
# ./mlxup -online -u

RDMA Subsystem Configuration

RDMA subsystem configuration is required on each worker node.

  1. Instal LLDP Daemon and RDMA Core Userspace Libraries and Daemons.

    Worker Node console
    # apt install -y lldpd rdma-core

    LLDPD is a daemon able to receive and send LLDP frames. The Link Layer Discovery Protocol (LLDP) is a vendor-neutral Layer 2 protocol that allows a network device to advertise its identity and capabilities on the local network.

  2. Identify the name of the RDMA-capable interface for high-performance K8s network.
    In this guide, ens2f0 network interface for high-performance K8s network was chosen and will be activated by NVIDIA Network Operator deployment:

    Worker Node console
    # rdma link
    link rocep7s0f0/1 state DOWN physical_state DISABLED netdev ens2f0 
    link rocep7s0f1/1 state DOWN physical_state DISABLED netdev ens2f1  
    link rocep131s0f0/1 state ACTIVE physical_state LINK_UP netdev ens4f0 
    link rocep131s0f1/1 state DOWN physical_state DISABLED netdev ens4f1 
  3. Set RDMA subsystem network namespace mode to exclusive mode.
    RDMA subsystem network namespace mode (netns parameter in ib_core module) in exclusive mode allows network namespace isolation for RDMA workloads on the worker node servers. Please create /etc/modprobe.d/ib_core.conf configuration file to change ib_core module parameters:

    /etc/modprobe.d/ib_core.conf
    # Set netns to exclusive mode for namespace isolation
    options ib_core netns_mode=0

    Then re-generate the initial RAM disks and reboot servers:

    Worker Node console
    # update-initramfs -u
    # reboot

    After the server comes back, check netns mode:

    Worker Node console
    # rdma system
     
    netns exclusive

K8s Cluster Deployment and Configuration

The Kubernetes cluster in this solution will be installed using Kubespray with a non-root depuser account from the deployment node.

SSH Private Key and SSH Passwordless Login

Log in to the deployment node as a deployment user (in this case, depuser) and create an SSH private key for configuring the passwordless authentication on your computer by running the following commands:

Deployment Node console
$ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/depuser/.ssh/id_rsa): 
Created directory '/home/depuser/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/depuser/.ssh/id_rsa
Your public key has been saved in /home/depuser/.ssh/id_rsa.pub
The key fingerprint is:
SHA256:IfcjdT/spXVHVd3n6wm1OmaWUXGuHnPmvqoXZ6WZYl0 depuser@depserver
The key's randomart image is:
+---[RSA 3072]----+
|                *|
|               .*|
|      . o . .  o=|
|       o + . o +E|
|        S o  .**O|
|         . .o=OX=|
|           . o%*.|
|             O.o.|
|           .*.ooo|
+----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in the deployment by running the following command (example):

Deployment Node console
$ ssh-copy-id depuser@192.168.222.111
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/depuser/.ssh/id_rsa.pub"
The authenticity of host '192.168.222.111 (192.168.222.111)' can't be established.
ECDSA key fingerprint is SHA256:6nhUgRlt9gY2Y2ofukUqE0ltH+derQuLsI39dFHe0Ag.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
depuser@192.168.222.111's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'depuser@192.168.222.111'"
and check to make sure that only the key(s) you wanted were added.

Verify that you have passwordless SSH connectivity to all nodes in your deployment by running the following command (example):

Deployment Node console
$ ssh depuser@192.168.222.111

Kubespray Deployment and Configuration

General Setting

To install dependencies for running Kubespray with Ansible on the deployment node, please run following commands:

Deployment Node console
$ cd ~
$ sudo apt -y install python3-pip jq
$ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.15.0.tar.gz
$ tar -zxf v2.15.0.tar.gz
$ cd kubespray-2.15.0
$ sudo pip3 install -r requirements.txt
The default folder for subsequent commands is ~/kubespray-2.15.0.

Deployment Customization

Create a new cluster configuration and host configuration file.
Replace the IP addresses below with your nodes' IP addresses:

Deployment Node console
$ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(192.168.222.111 192.168.222.101 192.168.222.102)
$ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:

inventory/mycluster/hosts.yaml
all:
  hosts:
    node1:
      ansible_host: 192.168.222.111
      ip: 192.168.222.111
      access_ip: 192.168.222.111
    node2:
      ansible_host: 192.168.222.101
      ip: 192.168.222.101
      access_ip: 192.168.222.101
    node3:
      ansible_host: 192.168.222.102
      ip: 192.168.222.102
      access_ip: 192.168.222.102
  children:
    kube-master:
      hosts:
        node1:
    kube-node:
      hosts:
        node2:
        node3:
    etcd:
      hosts:
        node1:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: {}

Review and change cluster installation parameters in the files:

  • inventory/mycluster/group_vars/all/all.yml 
  • inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml set a default Kubernetes CNI by setting the desired kube_network_plugin value (default: calicoparameter.

inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml
...
 
# Choose network plugin (cilium, calico, contiv, weave or flannel. Use cni for generic cni plugin)
# Can also be set to 'cloud', which lets the cloud provider setup appropriate routing
kube_network_plugin: calico
 
# Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni
kube_network_plugin_multus: false
 
...

Choice container runtime

In this guide containerd was chosen as the default container runtime in K8s cluster deployment because docker will be deprecated soon.
To use the containerd container runtime, set the following variables:

  1. In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml:

    inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml
    ...
     
    ## Container runtime
    ## docker for docker, crio for cri-o and containerd for containerd.
    container_manager: containerd
    
    ...
  2. In inventory/mycluster/group_vars/all/all.yml:

    inventory/mycluster/group_vars/all/all.yml
    ...
    
    ## Experimental kubeadm etcd deployment mode. Available only for new deployment
    etcd_kubeadm_enabled: true
    
    ...


  3. In inventory/mycluster/group_vars/etcd.yml:

    inventory/mycluster/group_vars/etcd.yml
    ...
    
    ## Settings for etcd deployment type
    etcd_deployment_type: host
    
    ...


Deploying the Cluster Using KubeSpray Ansible Playbook

Run the following line to start the deployment process:

Deployment Node console
$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete, please make sure no errors are encountered.

A successful result should look something like the following:

Deployment Node console
...
PLAY RECAP ***********************************************************************************************************************************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
node1                      : ok=554  changed=81   unreachable=0    failed=0    skipped=1152 rescued=0    ignored=2   
node2                      : ok=360  changed=42   unreachable=0    failed=0    skipped=633  rescued=0    ignored=1   
node3                      : ok=360  changed=42   unreachable=0    failed=0    skipped=632  rescued=0    ignored=1   

Sunday 11 July 2021  22:36:04 +0000 (0:00:00.053)      0:06:51.785 ************ 
=============================================================================== 
kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------------------------------------------- 37.24s
kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 28.29s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 16.57s
kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 14.23s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 11.06s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 9.18s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 8.61s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources --------------------------------------------------------------------------------------------------------------------------- 7.02s
container-engine/crictl : download_file | Download item ------------------------------------------------------------------------------------------------------------------------------- 5.78s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 5.52s
Configure | Check if etcd cluster is healthy ------------------------------------------------------------------------------------------------------------------------------------------ 5.24s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 4.89s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 4.81s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS templates ---------------------------------------------------------------------------------------------------------------- 4.68s
reload etcd --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 4.65s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 4.24s
kubernetes/preinstall : Get current calico cluster version ---------------------------------------------------------------------------------------------------------------------------- 3.70s
network_plugin/calico : Start Calico resources ---------------------------------------------------------------------------------------------------------------------------------------- 3.42s
container-engine/crictl : extract_file | Unpacking archive ---------------------------------------------------------------------------------------------------------------------------- 3.35s
kubernetes-apps/cluster_roles : Apply workaround to allow all nodes with cert O=system:nodes to register ------------------------------------------------------------------------------ 3.32s

K8s Cluster Customization

Now that the K8S cluster is deployed, connect to the K8S master node with the root user account in order to customize deployment.

  1. Label the worker nodes.

    Master Node console
    # kubectl label nodes node2 node-role.kubernetes.io/worker=
    # kubectl label nodes node3 node-role.kubernetes.io/worker=

K8S Cluster Deployment Verification

Following is an output example of K8s cluster deployment information using the Calico CNI plugin.

To ensure that the Kubernetes cluster is installed correctly, run the following commands:

Master Node console
# kubectl get nodes -o wide
NAME    STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node1   Ready    master   44m   v1.19.7   192.168.222.111   <none>        Ubuntu 20.04.2 LTS   5.4.0-72-generic   containerd://1.4.4
node2   Ready    worker   42m   v1.19.7   192.168.222.101   <none>        Ubuntu 20.04.2 LTS   5.4.0-72-generic   containerd://1.4.4
node3   Ready    worker   42m   v1.19.7   192.168.222.102   <none>        Ubuntu 20.04.2 LTS   5.4.0-72-generic   containerd://1.4.4

# kubectl -n kube-system get pods -o wide
NAME                                      READY   STATUS    RESTARTS   AGE   IP                NODE    NOMINATED NODE   READINESS GATES
calico-kube-controllers-8b5ff5d58-ph86x   1/1     Running   0          43m   192.168.222.101   node2   <none>           <none>
calico-node-l48qg                         1/1     Running   0          43m   192.168.222.102   node3   <none>           <none>
calico-node-ldx7w                         1/1     Running   0          43m   192.168.222.111   node1   <none>           <none>
calico-node-x9bh5                         1/1     Running   0          43m   192.168.222.101   node2   <none>           <none>
coredns-85967d65-pslmm                    1/1     Running   0          27m   10.233.96.1       node2   <none>           <none>
coredns-85967d65-qp2rl                    1/1     Running   0          43m   10.233.90.230     node1   <none>           <none>
dns-autoscaler-5b7b5c9b6f-8wb67           1/1     Running   0          43m   10.233.90.229     node1   <none>           <none>
etcd-node1                                1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
kube-apiserver-node1                      1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
kube-controller-manager-node1             1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
kube-proxy-6p4rm                          1/1     Running   0          44m   192.168.222.101   node2   <none>           <none>
kube-proxy-8bj6s                          1/1     Running   0          44m   192.168.222.111   node1   <none>           <none>
kube-proxy-dj4l8                          1/1     Running   0          44m   192.168.222.102   node3   <none>           <none>
kube-scheduler-node1                      1/1     Running   0          45m   192.168.222.111   node1   <none>           <none>
nginx-proxy-node2                         1/1     Running   0          44m   192.168.222.101   node2   <none>           <none>
nginx-proxy-node3                         1/1     Running   0          44m   192.168.222.102   node3   <none>           <none>
nodelocaldns-8b6kf                        1/1     Running   0          43m   192.168.222.102   node3   <none>           <none>
nodelocaldns-kzmmh                        1/1     Running   0          43m   192.168.222.101   node2   <none>           <none>
nodelocaldns-zh9fz                        1/1     Running   0          43m   192.168.222.111   node1   <none>           <none>

NVIDIA Network Operator Installation for K8S Cluster 

NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components in order to enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

To make it work, several components need to be provisioned and configured. All operator configuration and installation steps should be performed from the K8S master node with the root user account.

Prerequisites

  1. Install Helm.

    Master Node console
    # snap install helm --classic
  2. Install additional RDMA CNI plugin
    RDMA CNI plugin allows network namespace isolation for RDMA workloads in a containerized environment.
    Deploy CNI's using the following YAML files:

    Master Node console
    # kubectl apply -f https://raw.githubusercontent.com/Mellanox/rdma-cni/master/deployment/rdma-cni-daemonset.yaml

    To ensure the plugin is installed correctly, run the following command:

    Master Node console
    # kubectl -n kube-system get pods -o wide | egrep  "rdma"
    
    kube-rdma-cni-ds-5zl8d                    1/1     Running   0          11m    192.168.222.102   node3   <none>           <none>
    kube-rdma-cni-ds-q74n5                    1/1     Running   0          11m    192.168.222.101   node2   <none>           <none>
    kube-rdma-cni-ds-rnqkr                    1/1     Running   0          11m    192.168.222.111   node1   <none>           <none>

Deployment

Add the NVIDIA Network Operator Helm repository:

Master Node console
# helm repo add mellanox https://mellanox.github.io/network-operator
# helm repo update

Create the values.yaml file in user home folder (example):

values.yaml
nfd:
  enabled: true

sriovNetworkOperator:
  enabled: true

# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: false

nvPeerDriver:
  deploy: false

rdmaSharedDevicePlugin:
  deploy: false

sriovDevicePlugin:
  deploy: false

secondaryNetwork:
  deploy: true
  cniPlugins:
    deploy: true
    image: containernetworking-plugins
    repository: mellanox
    version: v0.8.7
    imagePullSecrets: []
  multus:
    deploy: true
    image: multus
    repository: nfvpe
    version: v3.6
    imagePullSecrets: []
    config: ''
  ipamPlugin:
    deploy: true
    image: whereabouts
    repository: mellanox
    version: v0.3
    imagePullSecrets: []

Deploy the operator: 

Master Node console
# helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name

NAME: network-operator
LAST DEPLOYED: Sun Jul 11 23:06:54 2021
NAMESPACE: network-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Get Network Operator deployed resources by running the following commands:

$ kubectl -n network-operator get pods
$ kubectl -n mlnx-network-operator-resources get pods

To ensure that the Operator is deployed correctly, run the following commands:

Master Node console
# kubectl -n network-operator get pods -o wide
NAME                                                            READY   STATUS    RESTARTS   AGE   IP                NODE    NOMINATED NODE   READINESS GATES
network-operator-1627211751-5bd467cbd9-2hwqx                      1/1     Running   0          29h   10.233.90.5      node1   <none>           <none>
network-operator-1627211751-node-feature-discovery-master-dgs69   1/1     Running   0          29h   10.233.90.6      node1   <none>           <none>
network-operator-1627211751-node-feature-discovery-worker-7n6gs   1/1     Running   0          29h   10.233.90.3      node1   <none>           <none>
network-operator-1627211751-node-feature-discovery-worker-sjdxw   1/1     Running   1          29h   10.233.96.7      node2   <none>           <none>
network-operator-1627211751-node-feature-discovery-worker-vzpvg   1/1     Running   1          29h   10.233.92.5      node3   <none>           <none>
network-operator-1627211751-sriov-network-operator-5f869696sdzp   1/1     Running   0          29h   10.233.90.4      node1   <none>           <none>


High-Speed Network Configuration

After installing the operator, please check the SriovNetworkNodeState CRs to see all SRIOV-enabled devices in your node.
In our deployment has been chosen network interface with name ens2f0. To review the interface status please use following command:

Master Node console
# kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io node2 -o yaml

...

status:
  interfaces:
  - deviceID: 101d
    driver: mlx5_core
    linkSpeed: 100000 Mb/s
    linkType: ETH
    mac: 0c:42:a1:2b:74:ae
    mtu: 1500
    name: ens2f0
    pciAddress: "0000:07:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 101d
    driver: mlx5_core
    linkType: ETH
    mac: 0c:42:a1:2b:74:af
    mtu: 1500
    name: ens2f1
    pciAddress: "0000:07:00.1"
    totalvfs: 8
    vendor: 15b3

...

Create SriovNetworkNodePolicy CR policy.yaml file, by specifying chosen interface in the 'nicSelector' (in this example, for the ens2f0 interface):

policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlnxnics
  namespace: network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  resourceName: mlnx2f0
  priority: 98
  mtu: 9000
  numVfs: 8
  nicSelector:
    vendor: "15b3"
    pfNames: [ "ens2f0" ]
  deviceType: netdevice
  isRdma: true

Deploy policy.yaml:

Master Node console
# kubectl apply -f policy.yaml

Create a SriovNetwork CR network.yaml file which refers to the 'resourceName' defined in SriovNetworkNodePolicy (in this example, referencing the mlnx2f0 resource and set  192.168.101.0/24 as CIDR range for the high-speed network):

network.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: "netmlnx2f0"
  namespace: network-operator
spec:
  ipam: |
    {
      "datastore": "kubernetes",
      "kubernetes": {
         "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "log_file": "/tmp/whereabouts.log",
      "log_level": "debug",
      "type": "whereabouts",
      "range": "192.168.101.0/24"
    }
  vlan: 0
  networkNamespace: "default"
  spoofChk: "off"
  resourceName: "mlnx2f0"
  linkState: "enable"
  metaPlugins: |
    {
      "type": "rdma"
    }

Deploy network.yaml:

Master Node console
# kubectl apply -f network.yaml

Validating the Deployment

Check if the deployment is finished successfully:

Master Node console
# kubectl -n nvidia-network-operator-resources get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP               NODE    NOMINATED NODE   READINESS GATES
cni-plugins-ds-f548q         1/1     Running   1          30m   192.168.222.101   node2   <none>           <none>
cni-plugins-ds-qw7hx         1/1     Running   1          30m   192.168.222.102   node3   <none>           <none>
kube-multus-ds-cjbf9         1/1     Running   1          30m   192.168.222.102   node3   <none>           <none>
kube-multus-ds-rgc95         1/1     Running   1          30m   192.168.222.101   node2   <none>           <none>
whereabouts-gwr7p            1/1     Running   1          30m   192.168.222.101   node2   <none>           <none>
whereabouts-n29nq            1/1     Running   1          30m   192.168.222.102   node3   <none>           <none>

Check deployed network:

Master Node console
# kubectl get network-attachment-definitions.k8s.cni.cncf.io 
NAME         AGE
netmlnx2f0   4m56s

Check worker node resources:

Master Node console
# kubectl describe nodes node2

...

Addresses:
  InternalIP:  192.168.222.101
  Hostname:    node2
Capacity:
  cpu:                 24
  ephemeral-storage:   229698892Ki
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              264030604Ki
  nvidia.com/mlnx2f0:  8
  pods:                110
Allocatable:
  cpu:                 23900m
  ephemeral-storage:   211690498517
  hugepages-1Gi:       0
  hugepages-2Mi:       0
  memory:              242694540Ki
  nvidia.com/mlnx2f0:  8
  pods:                110

...

Manage HugePages

Kubernetes supports the allocation and consumption of pre-allocated HugePages by applications in a Pod. The nodes will automatically discover and report all HugePages resources as schedulable resources. For get additional information K8s HugePages management, please refer here.

In order to allocate, HugePages needs to modify GRUB_CMDLINE_LINUX_DEFAULT parameter in /etc/default/grubThis setting, below, allocates 1GB * 16 pages = 16GB and 2MB * 2048 pages= 4GB HugePages on boot time:

/etc/default/grub
...

GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=16 hugepagesz=2M hugepages=2048"

...

Run update-grub to apply the config to grub and reboot server:

Worker Node console
# update-grub
# reboot

After the server comes back, check hugepages allocation from master node by command:

Master Node console
# kubectl describe nodes node2
...
Capacity:
  cpu:                 24
  ephemeral-storage:   229698892Ki
  hugepages-1Gi:       16Gi
  hugepages-2Mi:       4Gi
  memory:              264030604Ki
  nvidia.com/mlnx2f0:  8
  pods:                110
Allocatable:
  cpu:                 23900m
  ephemeral-storage:   211690498517
  hugepages-1Gi:       16Gi
  hugepages-2Mi:       4Gi
  memory:              242694540Ki
  nvidia.com/mlnx2f0:  8
  pods:                110
...


Enable CPU and Topology Managemen
t

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

  • Require as much CPU time as possible
  • Are sensitive to processor cache misses
  • Are low-latency network applications
  • Coordinate with other processes and benefit from sharing a single processor cache

Topology Manager uses topology information from collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and Pod resources requested. In order to extract the best performance, optimizations related to CPU isolation and memory and device locality are required.

Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

To use Topology Manager, CPU Manager with static policy must be used.

For additional information, please refer to Control Topology Management Policies on a node and Control Topology Management Policies on a node.

In order to enable CPU Manager and Topology Manager, please add following lines to kubelet configuration file /etc/kubernetes/kubelet-config.yaml

/etc/kubernetes/kubelet-config.yaml
...
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 10s
topologyManagerPolicy: single-numa-node
featureGates:
  CPUManager: true
  TopologyManager: true

Due to changes in cpuManagerPolicy, remove /var/lib/kubelet/cpu_manager_state and restart kubelet service on each affected K8s worker node.

Worker Node console
# rm -f /var/lib/kubelet/cpu_manager_state
# service kubelet restart

Application

DPDK traffic emulation is shown in Testbed Flow Diagram below. The traffic will be pushed from Trex Server via ens2f0 interface to TestPMD POD via SRIOV network interface net1. TestPMD POD will swap mac-address and re-routes ingress traffic via the same interface net1 to the same interface on Trex Server.

Verification

  1. Create a sample deployment test-deployment.yaml (container image should include InfiniBand userspace drivers and performance tools):

    test-deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: mlnx-inbox-pod
      labels:
        app: sriov
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: sriov
      template:
        metadata:
          labels:
            app: sriov
          annotations:
            k8s.v1.cni.cncf.io/networks: netmlnx2f0
        spec:
          containers:
          - image: < Container image >
            name: mlnx-inbox-ctr
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              requests:
                cpu: 4
                nvidia.com/mlnx2f0: 1
              limits:
                cpu: 4
                nvidia.com/mlnx2f0: 1
            command:
            - sh
            - -c
            - sleep inf

     

  2. Deploy the sample deployment.

    Master Node console
    # kubectl apply -f test-deployment.yaml


  3. Verify the deployment is running.

    Master Node console
    # kubectl get pod -o wide
    NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE    NOMINATED NODE   READINESS GATES
    mlnx-inbox-pod-599dc445c8-72x6g   1/1     Running   0          12s   10.233.96.5   node2   <none>           <none>
    mlnx-inbox-pod-599dc445c8-v5lnx   1/1     Running   0          12s   10.233.92.4   node3   <none>           <none>


  4. Check available network interfaces in POD.

    Master Node console
    # kubectl exec -it mlnx-inbox-pod-599dc445c8-72x6g -- bash
    
    root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# rdma link
    link rocep7s0f0v2/1 state ACTIVE physical_state LINK_UP netdev net1
    
    root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ip a s
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
        inet6 ::1/128 scope host 
           valid_lft forever preferred_lft forever
    2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
        link/ipip 0.0.0.0 brd 0.0.0.0
    4: eth0@if208: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
        link/ether 12:51:ab:b3:ef:26 brd ff:ff:ff:ff:ff:ff link-netnsid 0
        inet 10.233.96.5/32 brd 10.233.96.5 scope global eth0
           valid_lft forever preferred_lft forever
        inet6 fe80::1051:abff:feb3:ef26/64 scope link 
           valid_lft forever preferred_lft forever
    201: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
        link/ether 02:40:7d:5e:5f:af brd ff:ff:ff:ff:ff:ff
        inet 192.168.101.2/24 brd 192.168.101.255 scope global net1
           valid_lft forever preferred_lft forever
        inet6 fe80::40:7dff:fe5e:5faf/64 scope link 
           valid_lft forever preferred_lft forever
  5. Run synthetic RDMA benchmark tests with ib_write_bw bandwidth and latency test using RDMA write transactions.

    Server

    ib_write_bw  -F -d $IB_DEV_NAME --report_gbits

    Client

    ib_write_bw  -F $SERVER_IP -d $IB_DEV_NAME --report_gbits

    Please open two consoles to K8s master node—one for the server apps side and the second for the client apps side.
    In a first console (server side) to K8s master node, run the following commands:

    Master Node console
    # kubectl exec -it mlnx-inbox-pod-599dc445c8-72x6g -- bash
    root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ip a s net1 | grep inet
        inet 192.168.101.2/24 brd 192.168.101.255 scope global net1
        inet6 fe80::40:7dff:fe5e:5faf/64 scope link 
    root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# rdma link
    link rocep7s0f0v2/1 state ACTIVE physical_state LINK_UP netdev net1 
    root@mlnx-inbox-pod-599dc445c8-72x6g:/tmp# ib_write_bw -F -d rocep7s0f0v2 --report_gbits
    
    ************************************
    * Waiting for client to connect... *
    ************************************

    In a second console (client side) to K8s master node, run the following commands:

    Master Node console
    # kubectl exec -it mlnx-inbox-pod-599dc445c8-v5lnx -- bash
    root@mlnx-inbox-pod-599dc445c8-v5lnx:/tmp# rdma link
    link rocep7s0f0v3/1 state ACTIVE physical_state LINK_UP netdev net1 
    root@mlnx-inbox-pod-599dc445c8-v5lnx:/tmp# ib_write_bw  -F -d rocep7s0f0v3 192.168.101.2 --report_gbits
    ---------------------------------------------------------------------------------------
                        RDMA_Write BW Test
     Dual-port       : OFF		Device         : rocep7s0f0v3
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : Ethernet
     GID index       : 2
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0000 QPN 0x01f2 PSN 0x75e7cf RKey 0x050e26 VAddr 0x007f51e51b9000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:01
     remote address: LID 0000 QPN 0x00f2 PSN 0x13427f RKey 0x010e26 VAddr 0x007f1ecaac8000
     GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:02
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
     65536      5000             94.26              92.87  		         0.169509
    ---------------------------------------------------------------------------------------
    
    


TRex Server Deployment

In our guide used TRex package v2.87.
For detailed TRex installation and configuration guide, please refer to TRex Documentation.

TRex Installation and configuration steps done with the root user account.

Prerequisites

For the TRex server, a standard server with installed RDMA subsystem has been used.

Activate the network interfaces that been used by TRex application with netplan.
In our deployment, interfaces ens2f0 and ens2f1 are used:

/etc/netplan/00-installer-config.yaml
# This is the network config written by 'subiquity'
network:
  ethernets:
    ens4f0:
      dhcp4: true
      dhcp-identifier: mac
    ens2f0: {}
    ens2f1: {}
  version: 2

Then re-apply netplan and check link status for ens2f0/ens2f1 network interfaces.

TRex server console
# netplan apply
# rdma link
link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev ens2f0 
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev ens2f1 
link mlx5_2/1 state ACTIVE physical_state LINK_UP netdev ens4f0 
link mlx5_3/1 state DOWN physical_state DISABLED netdev ens4f1 
 

Updated MTU size for interfaces ens2f0 and ens2f1.

TRex server console
# ip link set ens2f0 mtu 9000
# ip link set ens2f1 mtu 9000

Installation

Create TRex working directory and obtaining the TRex package. 

TRex server console
# cd /tmp
# wget https://trex-tgn.cisco.com/trex/release/v2.87.tar.gz --no-check-certificate
# mkdir /scratch
# cd /scratch
# tar -zxf /tmp/v2.87.tar.gz
# chmod 777 -R /scratch


First-Time Scripts 

The next step will continue from folder /scratch/v2.87.

Run TRex configuration script in interactive mode. Follow the instructions on the screen to create a basic config file /etc/trex_cfg.yaml:

TRex server console
# ./dpdk_setup_ports.py -i

The /etc/trex_cfg.yaml configuration file is created. Later we'll change it to suit our setup.


Appendix

Performance Testing

Below, a performance test is shown of DPDK traffic emulation between TRex traffic generator and TESTPMD application running on the K8s worker node, in accordance with the Testbed diagram presented above. 

Prerequisites

Before starting the test, update TRex configuration file /etc/trex_cfg.yaml with a mac-address of the high-performance interface from the TESTPMD pod. Below are the steps to complete this update.

  1. Run pod on K8s cluster with TESTPMD apps according to below presented YAML configuration file testpmd-inbox.yaml (container image should include InfiniBand userspace drivers and TESTPMD apps):

    testpmd-inbox.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test-deployment
      labels:
        app: test
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: test
      template:
        metadata:
          labels:
            app: test
          annotations:
            k8s.v1.cni.cncf.io/networks: netmlnx2f0
        spec:
          containers:
          - image: < container image >
            name: test-pod
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            volumeMounts:
            - mountPath: /hugepages
              name: hugepage
            resources:
              requests:
                hugepages-1Gi: 2Gi
                memory: 16Gi
                cpu: 8
                nvidia.com/mlnx2f0: 1
              limits:
                hugepages-1Gi: 2Gi
                memory: 16Gi
                cpu: 8
                nvidia.com/mlnx2f0: 1
            command:
            - sh
            - -c
            - sleep inf
          volumes:
          - name: hugepage
            emptyDir:
              medium: HugePages

    Deploy the deployment with the following command:

    Master Node console
    # kubectl apply -f testpmd-inbox.yaml
  2. Get the network information from the deployed pod by running the following:

    Master Node console
    # kubectl get pod -o wide
    NAME                               READY   STATUS        RESTARTS   AGE    IP            NODE    NOMINATED NODE   READINESS GATES
    test-deployment-676476c78d-glbfs   1/1     Running       0          30s    10.233.92.5   node3   <none>           <none>
    
    # kubectl exec -it test-deployment-676476c78d-glbfs -- ip a s net1
    193: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
        link/ether 32:f9:3f:e3:dc:89 brd ff:ff:ff:ff:ff:ff
        inet 192.168.101.3/24 brd 192.168.101.255 scope global net1
           valid_lft forever preferred_lft forever
        inet6 fe80::30f9:3fff:fee3:dc89/64 scope link 
           valid_lft forever preferred_lft forever


  3. Update TRex configuration file /etc/trex_cfg.yaml  with mac-address if the NET1 network interface 32:f9:3f:e3:dc:89:

    /etc/trex_cfg.yaml
    ### Config file generated by dpdk_setup_ports.py ###
    
    - version: 2
      interfaces: ['07:00.0', '0d:00.0']
      port_info:
          - dest_mac: 32:f9:3f:e3:dc:89 # MAC OF NET1 INTERFACE
            src_mac:  0c:42:a1:24:05:1a
          - dest_mac: 32:f9:3f:e3:dc:89 # MAC OF NET1 INTERFACE
            src_mac:  0c:42:a1:24:05:1b
    
      platform:
          master_thread_id: 0
          latency_thread_id: 12
          dual_if:
            - socket: 0
              threads: [1,2,3,4,5,6,7,8,9,10,11]
    
    

DPDK Emulation Test

  1. Run TESTPMD apps in container:

    Master Node console
    # kubectl exec -it test-deployment-676476c78d-glbfs -- bash
    root@test-deployment-676476c78d-glbfs:/tmp# dpdk-testpmd -c 0x1fe  -m 1024 -w $PCIDEVICE_NVIDIA_COM_MLNX2F0 -- --burst=64 --txd=1024 --rxd=1024 --mbcache=512 --rxq=8 --txq=8 --nb-cores=4  --rss-udp --forward-mode=macswap  -a -i
    ...
    testpmd> 

    Specific TESTPMD parameters:

    $PCIDEVICE_NVIDIA_COM_MLNX2F0 - system variable PCI address of NET1

    More information about additional TESTPMD parameters:
    https://doc.dpdk.org/guides/testpmd_app_ug/run_app.html?highlight=testpmd 
    https://doc.dpdk.org/guides/linux_gsg/linux_eal_parameters.html

  2. Run TRex traffic generator on TRex server:

    TRex server console
    # cd /scratch/v2.87/
    # ./t-rex-64 -v 7 -i -c 11 --no-ofed-check

    Open second screen to TRex server and create a traffic generation file mlnx-trex.py in folder /scratch/v2.87:

    mlnx-trex.py
    from trex_stl_lib.api import *
     
    class STLS1(object):
     
        def create_stream (self):
            
            pkt = Ether()/IP(src="16.0.0.1",dst="48.0.0.1")/UDP(dport=12)/(22*'x')
                      
            vm = STLScVmRaw( [
                                    STLVmFlowVar(name="v_port",
                                                    min_value=4337,
                                                      max_value=5337,
                                                      size=2, op="inc"),
                                    STLVmWrFlowVar(fv_name="v_port",
                                                pkt_offset= "UDP.sport" ),
                                    STLVmFixChecksumHw(l3_offset="IP",l4_offset="UDP",l4_type=CTRexVmInsFixHwCs.L4_TYPE_UDP),
     
                                ]
                            )
     
            return STLStream(packet = STLPktBuilder(pkt = pkt ,vm = vm ) ,
                                    mode = STLTXCont(pps = 8000000) )
     
     
        def get_streams (self, direction = 0, **kwargs):
            # create 1 stream
            return [ self.create_stream() ]
     
     
    # dynamic load - used for trex console or simulator
    def register():
        return STLS1()


    After r
    un TRex console and generate traffic to TESTPMD pod:

    TRex server console
    # cd /scratch/v2.87/
    # ./trex-console
    Using 'python3' as Python interpeter
    
    Connecting to RPC server on localhost:4501                   [SUCCESS]
    
    Connecting to publisher server on localhost:4500             [SUCCESS]
    
    Acquiring ports [0, 1]:                                      [SUCCESS]
    
    Server Info:
    Server version:   v2.87 @ STL
    Server mode:      Stateless
    Server CPU:       11 x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
    Ports count:      2 x 100Gbps @ MT2892 Family [ConnectX-6 Dx]	
    
    -=TRex Console v3.0=-
    
    Type 'help' or '?' for supported actions
    
    trex> tui<enter>
    ...
    tui> start -f mlnx-trex.py -m 45mpps -p 0
    ...
    Global Statistitcs
    
    connection   : localhost, Port 4501                       total_tx_L2  : 23.9 Gbps                      
    version      : STL @ v2.87                                total_tx_L1  : 30.93 Gbps                     
    cpu_util.    : 82.88% @ 11 cores (11 per dual port)       total_rx     : 25.31 Gbps                     
    rx_cpu_util. : 0.0% / 0 pps                               total_pps    : 44.84 Mpps                     
    async_util.  : 0.05% / 11.22 Kbps                         drop_rate    : 0 bps                          
    total_cps.   : 0 cps                                      queue_full   : 0 pkts  
    ...

    Summary

    From the above test, it is evident that the desired traffic is 45mpps with SR-IOV network port in POD.

    In order to get better results, additional application tuning is required for Trex and TESTPMD.


    Done!

Authors

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Amir Zeidner

For the past several years, Amir has worked as a Solutions Architect primarily in the Telco space, leading advanced solutions to answer 5G, NFV, and SDN networking infrastructures requirements. Amir’s expertise in data plane acceleration technologies, such as Accelerated Switching and Network Processing (ASAP²) and DPDK, together with a deep knowledge of open source cloud-based infrastructures, allows him to promote and deliver unique end-to-end NVIDIA Networking solutions throughout the Telco world.


























































































Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright
© 2023 NVIDIA Corporation & affiliates. All Rights Reserved.