Technology Preview for K8s cluster using NVIDIA DPUs and Host Base Networking (HBN)

Technology Preview for K8s cluster using NVIDIA DPUs and Host Base Networking (HBN)

Scope

This technical preview document is intended for previewing NVIDIA Host-Based Networking (HBN) service running on BlueField DPU in a Kubernetes use case.

Abbreviations and Acronyms

Term

Definition

Term

Definition

BGP

Border Gateway Protocol

LACP

Link Aggregation Control Protocol

CNI

Container Network Interface

LAG

Link Aggregation

DOCA

Datacenter-on-a-Chip Architecture

PF

Physical Function

DPU

Data Processing Unit

SDN

Software Defined Networking

ECMP

Equal-Cost Multi Pathing

SRIOV

Single-Root IO Virtualization

EVPN

Ethernet Virtual Private Network

VF

Virtual Function

HBN

Host-Based Networking

VXLAN

Virtual Extensible Local-Area-Network

K8s

Kubernetes

Introduction

The BlueField®-2 data processing unit (DPU) provides innovative offload, acceleration, security, and efficiency in every host.

BlueField-2 combines the power of ConnectX®-6 with programmable Arm cores and hardware acceleration engines for software-defined storage, networking, security, and management workloads.

NVIDIA HBN is a service running on the DPU, which simplifies network provisioning by terminating the Ethernet layer 2 network in the DPU, allowing the physical network to become more of a "plug-and-play" utilizing a BGP-managed layer 3 network.

With HBN, the workload servers are connected to the physical switches over router ports, using unnumbered BGP configuration, which is automatic and does not require any IP subnet and address allocation for the underlay network, and provides a built-in active-active high-availability and load balancing based on ECMP.

The DPUs in the servers act as virtual tunnel endpoints (VTEPs) for the host network, providing a stretched layer 2 between all the nodes in the cluster over VXLAN using EVPN technology.

The configuration of HBN on the DPUs is almost identical to the configuration of physical NVIDIA switches, as it uses the same NVUE CLI commands, or NVUE programmatic API.

Note

At the time of publishing this document, there are some throughput limitations with HBN that will be addressed in the upcoming releases. RoCE support by HBN will be added in future releases as well.

References

Solution Architecture

Logical Design

HBN.png

Our deployment uses HBN to create two stretched layer 2 networks: One for the primary network (Calico running in IP-in-IP configuration) using VLAN 60 and one for a secondary SR-IOV network using VLAN 50.

The primary network runs over the physical function (PF) of the DPU, using veth-based kernel interfaces (Virtual Ethernet Interfaces) for the pods named eth0, while the secondary network allocates virtual functions (VFs) for each pod, named net1, which are allocated out of a pool of up to 16 VFs supported by HBN per DPU. This network can utilize the DPU's hardware acceleration capabilities. Support for RoCE traffic will be added through it soon.

The external access for the deployment (i.e., lab network access and Internet connectivity) is achieved through a gateway node connected to a leaf and acting as the default gateway for the management network 172.60.0.0/24 on VLAN 60.

The traffic that traverses between the servers through the leaf switches (with an exception of the gateway node) solely uses layer 3 (packets routed between BGP neighbors), and utilizes ECMP for high-availability and load balancing.

The gateway node used in this setup uses a VTEP in the leaf.

Note

Please note that the configuration and deployment of the gateway node used to provide external access in this example is not included in the scope of this document.

Please note that the gateway solution used in this example does not support a high-availability scenario. It is recommended to use a high availablity gateway deployment in a production environment (i.e., connected to more than a single leaf switch).

The presented deployment which represents a single-rack sized deployment, can be easily scaled out by adding additional switches, to create a large scale, multi-rack deployment that can accommodate hundreds of nodes.

The main advantage of using HBN in the deployment is that very little switch configuration is required: Each switch added needs a unique /32 address and possibly an AS number (leaf switches), but otherwise no additional configuration is needed.

The same applies for the configuration of HBN on each DPU, as it requires only a unique /32 address and a unique AS number.

BGP neighboring uses the "unnumbered" mode, which automatically identifies the neighbors and allocates local IPv6 subnets on every link—a kind of "plug-and-play" connectivity.

Used IP addresses:

Below are the IP addresses set on various interfaces in the setup.

Note

Please note that tmfifo_net0 interfaces are the virtual interfaces automatically created on top of the RShim connection between the host and the DPU, in both sides.

We can use these interfaces to install DPU software from the host, as an alternative to the out-of-band 1GbE physical port on the DPU.

interfaces.png

Device

Description

Interfaces

IP Addresses

Master

K8s master node and deployment node

ens2f0np0

tmfifo_net0

172.60.0.11/24

192.168.100.1/30

Worker1

K8s worker node 1

ens2f0np0

tmfifo_net0

172.60.0.12/24

192.168.100.1/30

Worker2

K8s worker node 2

ens2f0np0

tmfifo_net0

172.60.0.13/24

192.168.100.1/30

DPU

Any of the used DPUs

tmfifo_net0

192.168.100.2/30

Switch and HBN configuration and connectivity:

Switch

Description

Router ID

AS Number

Links

Leaf1

Leaf (TOR) switch 1

10.10.10.1/32

65101

To DPUs: swp1-3

To spines: swp31-32

To gateway node: swp30

Leaf2

Leaf (TOR) switch 2

10.10.10.2/32

65102

To DPUs: swp1-3

To spines: swp31-32

Spine1

Spine switch 1

10.10.10.101/32

65199

To leafs: swp1-2

Spine2

Spine switch 2

10.10.10.102/32

65199

Master HBN

HBN on the master node DPU

10.10.10.11/32

65111

To leafs: p0_sf, p1_sf

Worker 1 HBN

HBN on Worker1 node DPU

10.10.10.12/32

65112

Worker 2 HBN

HBN on Worker2 node DPU

10.10.10.13/32

65113

VTEPs configuration:

VTEP

Interfaces

VLAN

VNI

Leaf1

swp30

60

10060

HBN (Any)

pf0vf0_sf - pf0vf15_sf

50

10050

pf0hpf_sf

60

10060

Software Stack Components

hbn-sw-stack.png

Bill of Materials

bom.png

Deployment and Configuration

Configuring the Network Switches

  • The NVIDIA® SN3700 switches are installed with NVIDIA® Cumulus® Linux 5.3 OS

  • Each node is connected to two TOR switches over two 100Gb/s router ports (interfaces swp1-3 on the TOR switches)

  • The TOR switches are connected to the two spine switches using router ports (interfaces swp31-32 on the TORs and swp1-2 on the spines)

  • In addition, a gateway node is connected through a VTEP to one of the TORs, providing external network access and Internet connectivity on the management network (VLAN60, interface swp30 on Leaf1)

Configure the switches using the following NVUE commands:

Leaf1 Console

Copy
Copied!
            

nv set interface lo ip address 10.10.10.1/32 nv set interface swp30,swp31-32,swp1-3 nv set interface swp30 link mtu 9000 nv set interface swp30 bridge domain br_default nv set interface swp30 bridge domain br_default access 60 nv set bridge domain br_default vlan 60 nv set bridge domain br_default vlan 60 vni 10060 nv set nve vxlan source address 10.10.10.1 nv set nve vxlan arp-nd-suppress on nv set system global anycast-mac 44:38:39:BE:EF:AA nv set evpn enable on nv set router bgp autonomous-system 65101 nv set router bgp router-id 10.10.10.1 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp31 peer-group underlay nv set vrf default router bgp neighbor swp32 peer-group underlay nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group hbn remote-as external nv set vrf default router bgp neighbor swp1 peer-group hbn nv set vrf default router bgp neighbor swp2 peer-group hbn nv set vrf default router bgp neighbor swp3 peer-group hbn nv set vrf default router bgp peer-group hbn address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv config apply -y

Leaf2 Console

Copy
Copied!
            

nv set interface lo ip address 10.10.10.2/32 nv set interface swp31-32,swp1-3 nv set router bgp autonomous-system 65102 nv set router bgp router-id 10.10.10.2 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp31 peer-group underlay nv set vrf default router bgp neighbor swp32 peer-group underlay nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group hbn remote-as external nv set vrf default router bgp neighbor swp1 peer-group hbn nv set vrf default router bgp neighbor swp2 peer-group hbn nv set vrf default router bgp neighbor swp3 peer-group hbn nv set vrf default router bgp peer-group hbn address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv config apply -y

Spine1 Console

Copy
Copied!
            

nv set interface lo ip address 10.10.10.101/32 nv set interface swp1-2 nv set router bgp autonomous-system 65199 nv set router bgp router-id 10.10.10.101 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp1 peer-group underlay nv set vrf default router bgp neighbor swp2 peer-group underlay nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv config apply -y

Spine2 Console

Copy
Copied!
            

nv set interface lo ip address 10.10.10.102/32 nv set interface swp1-4 nv set router bgp autonomous-system 65199 nv set router bgp router-id 10.10.10.102 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor swp1 peer-group underlay nv set vrf default router bgp neighbor swp2 peer-group underlay nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv config apply -y

Host Preparation

  1. Install Ubuntu 22.04 on the servers and make sure it is up-to-date:

    Server Console

    Copy
    Copied!
                

    $ sudo apt update $ sudo apt upgrade $ sudo reboot

  2. To allow password-less sudo, add the local user to the sudoers file on each host:

    Server Console

    Copy
    Copied!
                

    $ sudo vi /etc/sudoers   #includedir /etc/sudoers.d #K8s cluster deployment user with sudo privileges without password user ALL=(ALL) NOPASSWD:ALL

  3. On the deployment node, generate an SSH key and copy it to each node. For example:

    Deployment Node Console (Master node is used)

    Copy
    Copied!
                

    $ ssh-keygen   Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Created directory '/home/user/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@master The key's randomart image is: +---[RSA 2048]----+ |      ...+oo+o..o| |      .oo   .o. o| |     . .. . o  +.| |   E  .  o +  . +| |    .   S = +  o | |     . o = + o  .| |      . o.o +   o| |       ..+.*. o+o| |        oo*ooo.++| +----[SHA256]-----+   $ ssh-copy-id -i ~/.ssh/id_rsa user@10.10.0.1   /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub" The authenticity of host 10.10.0.1 (10.10.0.1)' can't be established. ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys user@10.10.0.1's password:   Number of key(s) added: 1

  4. Now try logging into the machine, and verify that you can log in without a password.

Deploying the BFB

  1. Download the DOCA host drivers packages form the DOCA webpage by scrolling down to the bottom of the page and selecting the relevant package. It will install all the necessary software to access and install the DPU from the host.

    doca-host.png

  2. Install the DOCA host drivers package you downloaded:

    Server Console

    Copy
    Copied!
                

    # wget https://www.mellanox.com/downloads/DOCA/DOCA_v1.5.1/doca-host-repo-ubuntu2204_1.5.1-0.1.8.1.5.1007.1.5.8.1.1.2.1_amd64.deb # dpkg -i doca-host-repo-ubuntu2204_1.5.1-0.1.8.1.5.1007.1.5.8.1.1.2.1_amd64.deb # apt-get update # apt install doca-runtime # apt install doca-tools

  3. Download BFB 4.0.3 with DOCA 2.0.2v2:

    bfb.png

  4. Create the config file bf.cfg:

    Server Console

    Copy
    Copied!
                

    # echo 'ENABLE_SFC_HBN=yes' > bf.cfg

  5. Install the BFB (the DPU's operating system image):

    Server Console

    Copy
    Copied!
                

    # bfb-install -c bf.cfg -r rshim0 -b DOCA_2.0.2_BSP_4.0.3_Ubuntu_22.04-10.23-04.prod.bfb

  6. After the installation is complete, perform a full power cycle to the servers, allowing the DPU firmware to reboot and upgrade if needed.

  7. After the servers return, assign IP addresses to the first interface of the DPU on each host using netplan. This is an example for the master node. The same should be done for the workers (172.60.0.12 and 172.60.0.13). Notice the default route to the gateway node (172.60.0.254) to provide external/Internet connectivity:

    Server Console

    Copy
    Copied!
                

    # vi /etc/netplan/00-installer-config.yaml   network: ethernets: eno1: dhcp4: true eno2: dhcp4: true eno3: dhcp4: true eno4: dhcp4: true ens2f0np0: dhcp4: false mtu: 9000 addresses: [172.60.0.11/24]      nameservers: addresses: [8.8.8.8] search: [] routes: - to: default via: 172.60.0.254     ens2f1np1: dhcp4: false version: 2

  8. Apply the settings:

    Server Console

    Copy
    Copied!
                

    # netplan apply

Installing DOCA Container Configs Package

Log into the DPU in one of the following ways:

  • Using the OOB management interface (if connected and obtained an IP address over DHCP)

  • Using the built-in network interface over RShim (tmfifo_net0)

This example uses the RShim option:

  1. Use the following command to assign an IP address to the tmfifo_net0 interface:

    Server Console

    Copy
    Copied!
                

    # ip address add 192.168.100.1/30 dev tmfifo_net0

    When you first log into the DPU, use ubuntu/ubuntu as credentials.

  2. You will be asked to modify the password:

    Server Console

    Copy
    Copied!
                

    # ssh ubuntu@192.168.100.2 The authenticity of host '192.168.100.2 (192.168.100.2)' can't be established. ED25519 key fingerprint is SHA256:S2gzl4QzVUY0g3GRsl9VLi3tYHQdIe7oQ+8I8tr95c4. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '192.168.100.2' (ED25519) to the list of known hosts. ubuntu@192.168.100.2's password: You are required to change your password immediately (administrator enforced) Welcome to Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-1049-bluefield aarch64)   * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage   System information as of Mon Jan 16 12:53:57 UTC 2023   System load: 0.09 Usage of /: 6.7% of 58.00GB Memory usage: 13% Swap usage: 0% Processes: 485 Users logged in: 0 IPv4 address for docker0: 172.17.0.1 IPv4 address for mgmt: 127.0.0.1 IPv6 address for mgmt: ::1 IPv4 address for oob_net0: 10.10.7.37 IPv4 address for tmfifo_net0: 192.168.100.2   0 updates can be applied immediately.       The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.   Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.   WARNING: Your password has expired. You must change your password now and login again! Changing password for ubuntu. Current password: New password: Retype new password: passwd: password updated successfully Connection to 192.168.100.2 closed.

  3. Once you are logged into the DPU, download the DOCA container configs package and install it. This package includes the necessary scripts and configurations for activating the HBN container on the DPU:

    DPU Console

    Copy
    Copied!
                

    # wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/doca/doca_container_configs/versions/2.0.2v2/zip -O doca_container_configs_2.0.2v2.zip --no-check-certificate # unzip -o doca_container_configs_2.0.2v2.zip -d doca_container_configs_2.0.2v2 # cd doca_container_configs_2.0.2v2/scripts/doca_hbn/1.4.0 # chmod +x hbn-dpu-setup.sh # ./hbn-dpu-setup.sh # cd ../../../configs/2.0.2/ # cp doca_hbn.yaml /etc/kubelet.d/

    Warning

    You will not be able to pull the zip file from the Internet if the DPU's out-of-band management interface is not used in your setup.

    In this case, please pull it to the host and use scp to copy it to the DPU over the RShim's network interface (tmfifo_net0).

    On the host console:

    Copy
    Copied!
                

    # wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/doca/doca_container_configs/versions/2.0.2v2/zip -O doca_container_configs_2.0.2v2.zip --no-check-certificate # scp  doca_container_configs_2.0.2v2.zip ubuntu@192.168.100.2:/home/ubuntu/

  4. Reboot the DPU:

    DPU Console

    Copy
    Copied!
                

    # reboot

Configuring HBN

After the DPU returns, it will run the "doca-hbn" container.

  1. To find its ID (will appear in the first column under CONTAINER), run:

    DPU Console

    Copy
    Copied!
                

    # crictl ps

  2. Connect to that container:

    DPU Console

    Copy
    Copied!
                

    # crictl exec -it <container-id> bash

  3. Use the following NVUE commands to configure HBN (must be done on each DPU):

    HBN Container Console on Master Node DPU

    Copy
    Copied!
                

    nv set bridge domain br_default vlan 50 vni 10050 nv set bridge domain br_default vlan 60 vni 10060 nv set evpn enable on nv set interface lo ip address 10.10.10.11/32 nv set interface p0_sf,p1_sf link state up nv set interface p0_sf,p1_sf,pf0hpf_sf,pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf type swp nv set interface pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf bridge domain br_default access 50 nv set interface pf0hpf_sf,pf1hpf_sf bridge domain br_default access 60 nv set nve vxlan arp-nd-suppress on nv set nve vxlan enable on nv set nve vxlan source address 10.10.10.11 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor p0_sf peer-group underlay nv set vrf default router bgp neighbor p1_sf peer-group underlay nv set router bgp autonomous-system 65111 nv set router bgp enable on nv set router bgp router-id 10.10.10.11 nv set router policy route-map LOOPBACK rule 1 action permit nv set router policy route-map LOOPBACK rule 1 match interface lo nv set system global anycast-mac 44:38:39:BE:EF:AA nv set vrf default router bgp address-family ipv4-unicast enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp enable on nv set vrf default router bgp neighbor p0_sf type unnumbered nv set vrf default router bgp neighbor p1_sf type unnumbered nv set vrf default router bgp path-selection multipath aspath-ignore on nv set vrf default router bgp peer-group underlay address-family ipv4-unicast policy outbound route-map LOOPBACK nv set vrf default router bgp peer-group underlay address-family ipv4-unicast soft-reconfiguration on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv config apply -y

    HBN Container Console on Worker Node 1 DPU

    Copy
    Copied!
                

    nv set bridge domain br_default vlan 50 vni 10050 nv set bridge domain br_default vlan 60 vni 10060 nv set evpn enable on nv set interface lo ip address 10.10.10.12/32 nv set interface p0_sf,p1_sf link state up nv set interface p0_sf,p1_sf,pf0hpf_sf,pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf type swp nv set interface pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf bridge domain br_default access 50 nv set interface pf0hpf_sf,pf1hpf_sf bridge domain br_default access 60 nv set nve vxlan arp-nd-suppress on nv set nve vxlan enable on nv set nve vxlan source address 10.10.10.12 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor p0_sf peer-group underlay nv set vrf default router bgp neighbor p1_sf peer-group underlay nv set router bgp autonomous-system 65112 nv set router bgp enable on nv set router bgp router-id 10.10.10.12 nv set router policy route-map LOOPBACK rule 1 action permit nv set router policy route-map LOOPBACK rule 1 match interface lo nv set system global anycast-mac 44:38:39:BE:EF:AA nv set vrf default router bgp address-family ipv4-unicast enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp enable on nv set vrf default router bgp neighbor p0_sf type unnumbered nv set vrf default router bgp neighbor p1_sf type unnumbered nv set vrf default router bgp path-selection multipath aspath-ignore on nv set vrf default router bgp peer-group underlay address-family ipv4-unicast policy outbound route-map LOOPBACK nv set vrf default router bgp peer-group underlay address-family ipv4-unicast soft-reconfiguration on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable onnv config apply -y

    HBN Container Console on Worker Node 2 DPU

    Copy
    Copied!
                

    nv set bridge domain br_default vlan 50 vni 10050 nv set bridge domain br_default vlan 60 vni 10060 nv set evpn enable on nv set interface lo ip address 10.10.10.13/32 nv set interface p0_sf,p1_sf link state up nv set interface p0_sf,p1_sf,pf0hpf_sf,pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf type swp nv set interface pf0vf0_sf,pf0vf10_sf,pf0vf11_sf,pf0vf12_sf,pf0vf13_sf,pf0vf1_sf,pf0vf2_sf,pf0vf3_sf,pf0vf4_sf,pf0vf5_sf,pf0vf6_sf,pf0vf7_sf,pf0vf8_sf,pf0vf9_sf bridge domain br_default access 50 nv set interface pf0hpf_sf,pf1hpf_sf bridge domain br_default access 60 nv set nve vxlan arp-nd-suppress on nv set nve vxlan enable on nv set nve vxlan source address 10.10.10.13 nv set vrf default router bgp peer-group underlay remote-as external nv set vrf default router bgp neighbor p0_sf peer-group underlay nv set vrf default router bgp neighbor p1_sf peer-group underlay nv set router bgp autonomous-system 65113 nv set router bgp enable on nv set router bgp router-id 10.10.10.13 nv set router policy route-map LOOPBACK rule 1 action permit nv set router policy route-map LOOPBACK rule 1 match interface lo nv set system global anycast-mac 44:38:39:BE:EF:AA nv set vrf default router bgp address-family ipv4-unicast enable on nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on nv set vrf default router bgp address-family l2vpn-evpn enable on nv set vrf default router bgp enable on nv set vrf default router bgp neighbor p0_sf type unnumbered nv set vrf default router bgp neighbor p1_sf type unnumbered nv set vrf default router bgp path-selection multipath aspath-ignore on nv set vrf default router bgp peer-group underlay address-family ipv4-unicast policy outbound route-map LOOPBACK nv set vrf default router bgp peer-group underlay address-family ipv4-unicast soft-reconfiguration on nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on nv config apply -y

  4. You can exit back to the host and validate the connectivity between the hosts by pinging them over the high-speed interface:

    Server Console

    Copy
    Copied!
                

    $ ping 172.60.0.11 $ ping 172.60.0.12 $ ping 172.60.0.13

Deploying Kubernetes

Now we will deploy Kubernetes on the hosts using kubespray.

  1. On the deployment node (master node can also be used as the deployment node), run:

    Deployment Node Console (Master node is used)

    Copy
    Copied!
                

    $ cd ~ $ sudo apt -y install python3-pip jq $ wget https://github.com/kubernetes-sigs/kubespray/archive/refs/tags/v2.22.0.tar.gz $ tar -zxf v2.20.0.tar.gz $ cd kubespray-2.20.0 $ sudo pip3 install -r requirements.txt

  2. Create the initial cluster configuration for three nodes. We use the addresses we assigned to the DPU interface. The high-speed network is used for both the primary and the secondary network of our Kubernetes cluster:

    Deployment Node Console (Master Node is used)

    Copy
    Copied!
                

    $ cp -rfp inventory/sample inventory/mycluster $ declare -a IPS=(172.60.0.11 172.60.0.12 172.60.0.13) $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

  3. Edit the hosts list as follows:

    Deployment Node Console (Master Node is used)

    Copy
    Copied!
                

    $ vi inventory/mycluster/hosts.yaml   all: hosts:    master:      ansible_host: 172.60.0.11      ip: 172.60.0.11      access_ip: 172.60.0.11    worker1:      ansible_host: 172.60.0.12      ip: 172.60.0.12      access_ip: 172.60.0.12    worker2:      ansible_host: 172.60.0.13      ip: 172.60.0.13      access_ip: 172.60.0.13 children:    kube_control_plane:      hosts:        master:    kube_node:      hosts:        worker1:        worker2:    etcd:      hosts:        master:    k8s_cluster:      children:        kube_control_plane:        kube_node:    calico_rr:      hosts: {}

  4. Edit the Calico configuration to use IP-in-IP:

    Deployment Node Console (Master Node is used)

    Copy
    Copied!
                

    $ vi inventory/mycluster/group_vars/k8s_cluster/k8s-net-calico.yml     calico_network_backend: "bird"   calico_ipip_mode: "Always"   calico_vxlan_mode: "Never"

  5. Deploy the cluster:

    Deployment Node Console (Master Node is used)

    Copy
    Copied!
                

    $ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

  6. Label the worker nodes as workers:

    Master Node Console

    Copy
    Copied!
                

    # kubectl label nodes worker1 node-role.kubernetes.io/worker= # kubectl label nodes worker2 node-role.kubernetes.io/worker=

Installing the NVIDIA Network Operator

  1. On the master node, install helm and add the Mellanox repo:

    Master Node Console

    Copy
    Copied!
                

    # snap install helm --classic # helm repo add mellanox https://mellanox.github.io/network-operator && helm repo update

  2. Create the values.yaml file:

    Master Node Console

    Copy
    Copied!
                

    # nano values.yaml     nfd: enabled: true   sriovNetworkOperator: enabled: true   deployCR: true ofedDriver: deploy: false   nvPeerDriver: deploy: false   rdmaSharedDevicePlugin: deploy: false   sriovDevicePlugin: deploy: false   secondaryNetwork: deploy: true

  3. Install the operator:

    Master Node Console

    Copy
    Copied!
                

    # helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name

  4. Validate the installation:

    Master Node Console

    Copy
    Copied!
                

    # kubectl -n network-operator get pods

  5. Create the policy.yaml file:

    Master Node Console

    Copy
    Copied!
                

    # nano policy.yaml     apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnic namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true"   resourceName: mlnxnet priority: 99 mtu: 9000 numVfs: 16 nicSelector: pfNames: [ "ens2f0np0" ] deviceType: netdevice isRdma: true

  6. Apply it:

    Master Node Console

    Copy
    Copied!
                

    # kubectl apply -f policy.yaml

  7. Create the network.yaml file:

    Master Node Console

    Copy
    Copied!
                

    # nano network.yaml     apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: mlnx-network namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "172.50.0.0/24" } networkNamespace: default resourceName: mlnxnet

  8. Apply it:

    Master Node Console

    Copy
    Copied!
                

    # kubectl apply -f network.yaml

  9. Wait a few minutes for the configuration to complete and then validate the network and its resources:

    Master Node Console

    Copy
    Copied!
                

    # kubectl get network-attachment-definitions.k8s.cni.cncf.io # kubectl get node worker1 -o json | jq '.status.allocatable' # kubectl get node worker2 -o json | jq '.status.allocatable'

    Warning

    It may be necessary to perform a full power cycle on the servers if firmware configuration changes are required.

Validating the Deployment

Running Test Daemon-Set

On the master node:

  1. Create the following yaml file:

    Master Node Console

    Copy
    Copied!
                

    # nano testds.yaml  apiVersion: apps/v1 kind: DaemonSet metadata: name: example-daemon labels: app: example-ds spec: selector: matchLabels: app: example-ds template: metadata: labels: app: example-ds annotations: k8s.v1.cni.cncf.io/networks: mlnx-network spec: containers: - image: ubuntu name: example-ds-pod securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: memory: 16Gi cpu: 8 nvidia.com/mlnxnet: '1' requests: memory: 16Gi cpu: 8 nvidia.com/mlnxnet: '1' command: - sleep - inf

  2. Then create the deployment:

    Master Node Console

    Copy
    Copied!
                

    # kubectl create -f testds.yaml

Running TCP Throughput Test

Now we will run a TCP throughput test on top of the high-speed secondary network between our pods running on different worker nodes.

Open an additional terminal window to the master node and connect to each of the pods:

  1. Check the names of the pods:

    Master Node Console

    Copy
    Copied!
                

    # kubectl get pods

  2. Connect to the desired pod:

    Master Node Console

    Copy
    Copied!
                

    # kubectl exec -it <pod-name> -- bash

  3. Install and run iperf TCP test between the two pods over the high-speed secondary network:

    First Pod Console

    Copy
    Copied!
                

    # apt update # apt install iproute2 iperf -y

  4. Check the IP address of the first pod on the high speed secondary network net1:

    First Pod Console

    Copy
    Copied!
                

    # ip addr

  5. Run iperf on it:

    First Pod Console

    Copy
    Copied!
                

    # iperf -s

  6. On the second pod, install iperf and run it in client mode, connecting to the first pod:

    Second Pod Console

    Copy
    Copied!
                

    # apt update # apt install iperf -y # iperf -c <server-address> -P 10

    Done!

Authors

SD.jpg

Shachar Dor

Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management.

Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company.

Last updated on Oct 23, 2023.