image image image image image image



On This Page

Scope

This Reference Deployment Guide (RDG) aims at providing a practical and scalable Ethernet fabric deployment that is suitable for high-performance workloads in K8s. This fabric provides both primary K8s network (e.g. Calico) and a secondary high-performance network for RDMA/DPDK, in conjunction with the SRIOV and RDMA plugins and CNIs.

The proposed fabric configuration supports up to 480 workload servers in its maximum scale and provides a non-blocking throughput of up to 200Gbps between pods.

Abbreviations and Acronyms

TermDefinitionTermDefinition
BGPBorder Gateway ProtocolMLAGMulti-Chassis Link Aggregation
CNIContainer Network InterfaceRDMARemote Direct Memory Access
DMADirect Memory AccessTORTop of Rack
EVPNEthernet Virtual Private NetworkVLANVirtual LAN (Local Area Network)
ISLInter-Switch LinkVRRVirtual Router Redundancy 
K8sKubernetesVTEPVirtual Tunnel End Point
LACPLink Aggregation Control ProtocolVXLANVirtual Extensible LAN

Introduction

K8s is the industry-standard platform for deploying and orchestrating cloud-native workloads.

The common K8s networking solutions (e.g. the commonly used Flannel and Calico CNI plugins) are not optimized for performance and do not utilize the current state-of-the-art networking technologies that are hardware-accelerated. Today's interconnect solutions from NVIDIA can provide up to 200Gbps of throughput at a very low latency with a minimal load on the server's CPU. To take advantage of these capabilities, provisioning of an additional network for the pods is needed - a high-speed RDMA-capable network.

This document demonstrates how to deploy, enable and configure a high-speed, hardware-accelerated network fabric in a K8s cluster, providing both the primary network and a secondary RDMA network on the same wire. The network fabric also includes highly-available border router functionality which provides in/out connectivity to the cluster (e.g. access to the Internet).

This document is intended for K8s administrators that want to enable a high-speed fabric for their applications running on top of K8s, such as big-data, machine learning, storage and database solutions, etc. 

The document begins with the design of the fabric and of the K8s deployment, then continues with the actual deployment and configuration steps, concluding with a performance test that demonstrates the benefits of the solution.

References

Solution Architecture

Key Components and Technologies

  • NVIDIA ConnectX SmartNICs
    10/25/40/50/100/200 and 400G Ethernet Network Adapters
    The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
    NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

  • NVIDIA LinkX Cables 
    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA Spectrum Ethernet Switches
    Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
    Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects. 
    NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® LinuxSONiC and NVIDIA Onyx®.

  • NVIDIA Cumulus Linux 
    NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

  • RDMA 
    RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
    Like locally based DMA, 
    RDMA improves throughput and performance and frees up compute resources.

  • Kubernetes
    Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

  • Kubespray 
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
    • A highly available cluster
    • Composable attributes
    • Support for most popular Linux distributions

Logical Design

The physical servers used in this document: 

  •   1 x Deployment Node
  •   1 x Master Node
  •   4 x Worker Nodes; each with 1 x ConnectX-6 NIC

The deployment of the fabric is based on a 2-level leaf-spine topology. 

The deployment includes two separate physical networks:

  1. A high-speed Ethernet fabric
  2. An IPMI/bare-metal management network (not covered in this document)


This document covers a single K8s controller deployment scenario. For high-availability cluster deployment, please refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md

Network / Fabric Design

This document demonstrates a minimalistic scale of 2 workload racks with 2 servers each (as shown in the diagram below):

By using the same design, the fabric can be scaled to accommodate up to 480 workload servers using up to 30 workload racks with up to 16 servers each. Every workload rack uses a single leaf switch (TOR). The infrastructure rack consists of a highly-available border router (an MLAG pair) which provides a connection to an external gateway/router and to a maximum of 14 infrastructure servers.

The high-speed network consists of two logical segments:

  1. The management network and the primary K8s network (used by Calico) - VLAN10
  2. The secondary K8s network which provides RDMA to the pods - VLAN20

The fabric implements a VXLAN overlay network with a BGP EVPN control plane, which enables the "stretching" of the VLANs across all the racks.

Every leaf switch has a VTEP which takes care of VXLAN encapsulation and decapsulation. The communication between the VTEPs is done by routing through the spines, controlled by a BGP control plane.

The infrastructure rack (as seen on the left in the illustration below) has two leaf switches that act as a highly available border router which provides both highly available connectivity for the infrastructure servers (the deployment server and the K8s master node) and redundant routing into and out of the cluster through a gateway node. This high availability is achieved by an MLAG configuration, the use of LACP bonds, and a redundant router mechanism which uses VRR.

Below is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 200GbE to the host (30 racks, 16 servers each using 16 spines and 32 leafs).

Please note that in this setup, the MSN2100 switches in the infrastructure rack should be replaced by MSN2700 switches (having 32 ports instead of 16 ports):

In the case of a maximum scale fabric (as shown above), there will be 16 x 200Gbps links going up from each leaf to the spines and therefore a maximum of 16 x 200Gbps links going to servers in each rack.

Software Stack Components


Please make sure to upgrade all the NVIDIA software components to their latest released version.

Bill of Materials


Please note that older MSN2100 switches with hardware revision 0 (zero) do not support the functionality presented in this document. You can verify that your switch is newer by running the "decode-syseeprom" command and checking the "Device Version" field (must be greater than zero).

Deployment and Configuration

Node and Switch Definitions

These are the definitions and parameters used for deploying the demonstrated fabric:

Spines
hostnamerouter idautonomous systemdownlinks
spine1 (MSN3700)10.0.0.1/3265100swp1-6
spine2 (MSN3700) 10.0.0.2/3265100swp1-6
Leafs
hostnamerouter idautonomous systemuplinks peers on spines
leaf1a (MSN2100)10.0.0.101/3265101swp13-14swp1 
leaf1b (MSN2100)10.0.0.102/3265101swp13-14swp2 
leaf2 (MSN3700)10.0.0.103/3265102swp29-32swp3-4 
leaf3 (MSN3700)10.0.0.104/3265103swp29-32swp5-6 
Workload Server Ports
rack idvlan idaccess portstrunk ports
210swp1-4
220
swp1-4
310swp1-4
320
swp1-4
Border Routers (Infrastructure Rack TORs)
hostnameisl portsclag system macclag priorityvxlan anycast ip
leaf1aswp15-1644:38:39:FF:FF:AA100010.10.11.1
leaf1bswp15-1644:38:39:FF:FF:AA3276810.10.11.1
Border VLANs
vlan idvirt macvirt ipprimary router ipsecondary router ip
1000:00:00:00:00:1010.10.0.1/1610.10.0.2/1610.10.0.3/16
100:00:00:00:00:0110.1.0.1/2410.1.0.2/2410.1.0.3/24
Infrastructure Server Ports
vlan idport namesbond names
1swp1bond1
10swp2, swp3bond2, bond3
Hosts

Rack

Server/Switch type

Server/Switch name

IP and NICs

Default Gateway

Rack1

(Infrastructure)

Deployment Nodedepserver

bond0 (enp197s0f0, enp197s0f1)

10.10.0.250/16

10.10.0.1

Rack1

(Infrastructure)

Master Nodenode1

bond0 (enp197s0f0, enp197s0f1)

10.10.1.1/16

10.10.0.1
Rack2Worker Nodenode2

enp197s0f0

10.10.1.2/16

10.10.0.1
Rack2Worker Nodenode3

enp197s0f0

10.10.1.3/16

10.10.0.1
Rack3Worker Nodenode4

enp197s0f0

10.10.1.4/16

10.10.0.1
Rack3Worker Nodenode5

enp197s0f0

10.10.1.5/16

10.10.0.1

Wiring

This is the wiring principal for the workload racks:

  • Each server in the racks is wired to the leaf (or "TOR") switch
  • Every leaf is wired to all the spines


This is the wiring principal for the infrastructure rack:

  • Each server in the racks is wired to two leafs (or "TORs") switches
  • Every leaf is wired to all the spines

Fabric Configuration

Updating Cumulus Linux

As a best practice, make sure to use the latest released Cumulus Linux NOS version.

Please see this guide on how to upgrade Cumulus Linux.

Configuring the Cumulus Linux Switch

Make sure your Cumulus Linux switch has passed its initial configuration stages (please see the Quick-Start Guide for version 4.3 for additional information):

  1. License installation
  2. Creation of switch interfaces (e.g. swp1-32)


Following is the configuration for the switches:

Please note that you can add the command "net del all" before the following commands in order to clear any previous configuration.

Spine1 Console
net add bgp autonomous-system 65100
net add loopback lo ip address 10.0.0.1/32
net add bgp router-id 10.0.0.1 
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add interface swp1 mtu 9216
net add bgp neighbor swp1 interface peer-group underlay
net add interface swp2 mtu 9216
net add bgp neighbor swp2 interface peer-group underlay
net add interface swp3 mtu 9216
net add bgp neighbor swp3 interface peer-group underlay
net add interface swp4 mtu 9216
net add bgp neighbor swp4 interface peer-group underlay
net add interface swp5 mtu 9216
net add bgp neighbor swp5 interface peer-group underlay
net add interface swp6 mtu 9216
net add bgp neighbor swp6 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net commit
Spine2 Console
net add bgp autonomous-system 65100
net add loopback lo ip address 10.0.0.2/32
net add bgp router-id 10.0.0.2 
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add interface swp1 mtu 9216
net add bgp neighbor swp1 interface peer-group underlay
net add interface swp2 mtu 9216
net add bgp neighbor swp2 interface peer-group underlay
net add interface swp3 mtu 9216
net add bgp neighbor swp3 interface peer-group underlay
net add interface swp4 mtu 9216
net add bgp neighbor swp4 interface peer-group underlay
net add interface swp5 mtu 9216
net add bgp neighbor swp5 interface peer-group underlay
net add interface swp6 mtu 9216
net add bgp neighbor swp6 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net commit
Leaf1A Console
net add bgp autonomous-system 65101
net add bgp router-id 10.0.0.101
net add loopback lo ip address 10.0.0.101/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bridge bridge ports peerlink
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.101
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.101
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.101
net add bridge bridge ports vni20
net add loopback lo clag vxlan-anycast-ip 10.10.11.1
net add bgp l2vpn evpn advertise-default-gw
net add bond peerlink bond slaves swp15,swp16
net add interface peerlink.4094 clag args --initDelay 10
net add interface peerlink.4094 clag backup-ip 10.0.0.102
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag priority 1000
net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA
net add bgp neighbor peerlink.4094 interface remote-as internal
net add bgp l2vpn evpn neighbor peerlink.4094 activate
net add vlan 10 ip address 10.10.0.2/16
net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16
net add vlan 1 ip address 10.1.0.2/24
net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24
net commit
Leaf1B Console
net add bgp autonomous-system 65101
net add bgp router-id 10.0.0.102
net add loopback lo ip address 10.0.0.102/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bridge bridge ports peerlink
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.102
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.102
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.102
net add bridge bridge ports vni20
net add loopback lo clag vxlan-anycast-ip 10.10.11.1
net add bgp l2vpn evpn advertise-default-gw
net add bond peerlink bond slaves swp15,swp16
net add interface peerlink.4094 clag args --initDelay 10
net add interface peerlink.4094 clag backup-ip 10.0.0.101
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag priority 32768
net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA
net add bgp neighbor peerlink.4094 interface remote-as internal
net add bgp l2vpn evpn neighbor peerlink.4094 activate
net add vlan 10 ip address 10.10.0.3/16
net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16
net add vlan 1 ip address 10.1.0.3/24
net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24
net commit
Leaf2 Console
net add bgp autonomous-system 65102
net add bgp router-id 10.0.0.102
net add loopback lo ip address 10.0.0.103/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp29 mtu 9216
net add bgp neighbor swp29 interface peer-group underlay
net add interface swp30 mtu 9216
net add bgp neighbor swp30 interface peer-group underlay
net add interface swp31 mtu 9216
net add bgp neighbor swp31 interface peer-group underlay
net add interface swp32 mtu 9216
net add bgp neighbor swp32 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bridge bridge ports swp1,swp2,swp3,swp4
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.103
net add interface swp1,swp2,swp3,swp4 bridge pvid 10
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add interface swp1,swp2,swp3,swp4 bridge vids 20
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.103
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.103
net add bridge bridge ports vni20
net commit
Leaf3 Console
net add bgp autonomous-system 65103
net add bgp router-id 10.0.0.103
net add loopback lo ip address 10.0.0.104/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp29 mtu 9216
net add bgp neighbor swp29 interface peer-group underlay
net add interface swp30 mtu 9216
net add bgp neighbor swp30 interface peer-group underlay
net add interface swp31 mtu 9216
net add bgp neighbor swp31 interface peer-group underlay
net add interface swp32 mtu 9216
net add bgp neighbor swp32 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn  neighbor underlay activate
net add bgp l2vpn evpn  advertise-all-vni
net add bgp l2vpn evpn  advertise ipv4 unicast
net add bridge bridge ports swp1,swp2,swp3,swp4
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.104
net add interface swp1,swp2,swp3,swp4 bridge pvid 10
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add interface swp1,swp2,swp3,swp4 bridge vids 20
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.104
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.104
net add bridge bridge ports vni20
net commit

Connecting the Infrastructure Servers

Infrastructure servers (deployment and K8s master servers) are placed in the infrastructure rack.

This will require the following additional configuration steps:

  1. Adding the ports connected to the servers to an MLAG bond
  2. Placing the bond in the relevant VLAN

In our case, the servers are connected to ports swp2 and swp3 on both leafs (Leaf1A and Leaf1B), and will be using VLAN10 that we created on the border leafs, the commands on both Leaf1A and Leaf1B will be:

Leaf1A and Leaf1B Console
net add interface swp2 mtu 8950
net add bond bond2 bond slaves swp2
net add bond bond2 mtu 8950
net add bond bond2 clag id 2
net add bond bond2 bridge access 10
net add bond bond2 bond lacp-bypass-allow
net add bond bond2 stp bpduguard
net add bond bond2 stp portadminedge
net add interface swp3 mtu 8950
net add bond bond3 bond slaves swp3
net add bond bond3 mtu 8950
net add bond bond3 clag id 3
net add bond bond3 bridge access 10
net add bond bond3 bond lacp-bypass-allow
net add bond bond3 stp bpduguard
net add bond bond3 stp portadminedge
net commit

Connecting an External Gateway to the Infrastructure Rack

In our setup, we will connect an external gateway machine (10.1.0.254/24) over an LACP bond to swp1 of both border leafs (via VLAN1).
This gateway will be used to access any external network (e.g. the Internet). The configuration commands on both border leafs are as follows:

Leaf1A Console
net add interface swp1 mtu 8950
net add bond bond1 bond slaves swp1
net add bond bond1 mtu 8950
net add bond bond1 clag id 1
net add bond bond1 bridge access 1
net add bond bond1 bond lacp-bypass-allow
net add bond bond1 stp bpduguard
net add bond bond1 stp portadminedge
net add routing route 0.0.0.0/0 10.1.0.254 
net commit

Please note that the gateway machine should be configured statically to access our primary network (10.1.0.0/16) via its relevant interface.

Host Configuration

Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.

All Worker nodes must have the same PCIe placement for the NIC, and expose the same interface name.


Our host will be running Ubuntu Linux, the configuration is as follows:

Installing and Updating the OS

Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages, and create a non-root user account with sudo privileges without password.

Also make sure to assign the correct network configuration to the hosts (IP addresses, default gateway, DNS server, NTP server) and to create bonds on the nodes in the infrastructure rack (master node and deployment node).

Update the Ubuntu software packages by running the following commands:

Non-root User Account Prerequisites 

In this solution we added the following line to the EOF /etc/sudoers:

Server Console
$ sudo vi /etc/sudoers

#includedir /etc/sudoers.d

#K8s cluster deployment user with sudo privileges without password

user ALL=(ALL) NOPASSWD:ALL

SR-IOV Activation and Virtual Functions Configuration

Use the following commands to install the mstflint tool and verify that SRIOV is enabled and that there are enough virtual functions on the NIC:

Worker Node Console
# apt install mstflint

# lspci | grep Mellanox
c5:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
c5:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]

# mstconfig -d c5:00.0 q | grep SRIOV_EN
         SRIOV_EN                            True(1)         
# mstconfig -d c5:00.0 q | grep NUM_OF_VFS
         NUM_OF_VFS  						 8

In case SRIOV is not configured or the number of VFs is insufficient, please configure using the following commands (and then reboot the machine):

Worker Node Console
# mstconfig -d c5:00.0 -y set SRIOV_EN=True NUM_OF_VFS=8

# reboot

The above operation activated SRIOV and defined the maximum number of VFs supported. Below we will perform the actual activation of the virtual functions.

Installing rdma-core and Setting RDMA to "Exclusive Mode"

Install the rdma-core package:

Worker Node Console
# apt install rdma-core -y

Set netns to exclusive mode for providing namespace isolation on the high-speed interface. This way, each pod can only see and access its own virtual functions.

Create the following file:

Worker Node Console
# vi /etc/modprobe.d/ib_core.conf

# Set netns to exclusive mode for namespace isolation
options ib_core netns_mode=0

Then run the commands below:

Worker Node Console
# update-initramfs -u
# reboot

After the node comes back, check netns mode:

Worker Node Console
# rdma system

netns exclusive

Setting MTU on the Physical Port

We need to set the MTU on the physical port of the server to allow for optimized throughput.

Since the fabric is using VXLAN overlay, we will use the maximum MTU of 9216 on the core links and an MTU of 8950 on the edge links (servers links), making sure that the VXLAN header added to the packets will not cause fragmentation.

In order to configure the MTU on the server ports, please edit the netplan config file (in this example on node2):

Worker Node Console
# vi /etc/netplan/00-installer-config.yaml

network:
  ethernets:
    enp197s0f0:
      addresses:
       - 10.10.1.2
      gateway4: 10.10.0.1
      mtu: 8950
  version: 2

Please note that you can use the "rdma link" command to identify the name assigned to the high-speed interface, for example:

# rdma link

link rocep197s0f0/1 state ACTIVE physical_state LINK_UP netdev enp197s0f0


Then apply it:

Worker Node Console
# netplan apply

Virtual Function Activation

Now we will activate 8 virtual functions using the following command:

Worker Node Console
# PF_NAME=enp197s0f0
# echo 8 > /sys/class/net/${PF_NAME}/device/sriov_numvfs

Please note that the above configuration is not persistent!

NIC Firmware Upgrade

It is recommended that you upgrade the NIC firmware on the worker nodes to the latest released version.

Please make sure to use the root account using:

Worker Node Console
$ sudo su -

Please make sure to download the "mlxup" program to each Worker Node and install the latest firmware for the NIC (requires Internet connectivity, please check the official download page)

Worker Node Console
# wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup
# chmod 777 mlxup
# ./mlxup -u --online

K8s Cluster Deployment and Configuration

The K8s cluster in this solution will be installed using Kubespray with a non-root user account from the Deployment Node.

SSH Private Key and SSH Passwordless Login

Login to the Deployment Node as a deployment user (in this case - user) and create an SSH private key for configuring the password-less authentication on your computer by running the following commands:

Deployment Node Console
$ ssh-keygen

Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa):
Created directory '/home/user/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/user/.ssh/id_rsa.
Your public key has been saved in /home/user/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@depl-node
The key's randomart image is:
+---[RSA 2048]----+
|      ...+oo+o..o|
|      .oo   .o. o|
|     . .. . o  +.|
|   E  .  o +  . +|
|    .   S = +  o |
|     . o = + o  .|
|      . o.o +   o|
|       ..+.*. o+o|
|        oo*ooo.++|
+----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command (example):

Deployment Node Console
$ ssh-copy-id -i ~/.ssh/id_rsa user@10.10.1.1

/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub"
The authenticity of host '10.10.1.1 (10.10.1.1)' can't be established.
ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
user@10.10.1.1's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'user@10.10.1.1'"
and check to make sure that only the key(s) you wanted were added.

Verify that you have password-less SSH connectivity to all nodes in your deployment by running the following command (example):

Deployment Node Console
$ ssh user@10.10.1.1

Kubespray Deployment and Configuration

To install dependencies for running Kubespray with Ansible on the Deployment server please run following commands:

Deployment Node Console
$ cd ~
$ sudo apt -y install python3-pip jq
$ git clone https://github.com/kubernetes-sigs/kubespray.git
$ cd kubespray
$ sudo pip3 install -r requirements.txt

Create a new cluster configuration. The default folder for subsequent commands is ~/kubespray.

Replace the IP addresses below with your nodes' IP addresses:

Deployment Node Console
$ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(10.10.1.1 10.10.1.2 10.10.1.3 10.10.1.4 10.10.1.5)
$ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:

inventory/mycluster/hosts.yaml
$ sudo vi inventory/mycluster/hosts.yaml

all:
  hosts:
    node1:
      ansible_host: 10.10.1.1
      ip: 10.10.1.1
      access_ip: 10.10.1.1
    node2:
      ansible_host: 10.10.1.2
      ip: 10.10.1.2
      access_ip: 10.10.1.2
    node3:
      ansible_host: 10.10.1.3
      ip: 10.10.1.3
      access_ip: 10.10.1.3
    node4:
      ansible_host: 10.10.1.4
      ip: 10.10.1.4
      access_ip: 10.10.1.4
    node5:
      ansible_host: 10.10.1.5
      ip: 10.10.1.5
      access_ip: 10.10.1.5
  children:
    kube_control_plane:
      hosts:
        node1:
    kube_node:
      hosts:
        node2:
        node3:
        node4:
        node5:
    etcd:
      hosts:
        node1:
    k8s_cluster:
      children:
        kube_control_plane:
        kube_node:
    calico_rr:
      hosts: {}

Review and change cluster installation parameters in the files inventory/mycluster/group_vars/all/all.yml and inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

In inventory/mycluster/group_vars/all/all.yml remove the comment from following line so the metrics can receive data about the use of cluster resources:

Deployment Node Console
$ sudo vi inventory/mycluster/group_vars/all/all.yml

## The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable.
kube_read_only_port: 10255

In inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml set the value of kube_version to v1.21.0, set the container_manager to containerd and enable multi_networking by setting kube_network_plugin_multustrue.

Deployment Node Console
$ sudo vi inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml

…
## Change this to use another Kubernetes version, e.g. a current beta release
kube_version: v1.21.0
…
## Container runtime
## docker for docker, crio for cri-o and containerd for containerd.
container_manager: containerd
…
# Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni
kube_network_plugin_multus: true
…


In inventory/mycluster/group_vars/etcd.yml set the etcd_deployment_type to host:

Deployment Node Console
$ sudo vi inventory/mycluster/group_vars/etcd.yml

...

## Settings for etcd deployment type
etcd_deployment_type: host

Deploying the cluster using Kubespray Ansible Playbook

Run the following line to start the deployment process:

Deployment Node Console
$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete, please make sure no errors are encountered.

A successful result should look something like the following:

Deployment Node Console
PLAY RECAP ***********************************************************************************************************************************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
node1                      : ok=584  changed=133  unreachable=0    failed=0    skipped=1151 rescued=0    ignored=2   
node2                      : ok=387  changed=86   unreachable=0    failed=0    skipped=634  rescued=0    ignored=1   
node3                      : ok=387  changed=86   unreachable=0    failed=0    skipped=633  rescued=0    ignored=1   
node4                      : ok=387  changed=86   unreachable=0    failed=0    skipped=633  rescued=0    ignored=1   
node5                      : ok=387  changed=86   unreachable=0    failed=0    skipped=633  rescued=0    ignored=1   

Thursday 20 May 2021  07:59:23 +0000 (0:00:00.071)       0:11:57.632 ********** 
=============================================================================== 
kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 77.14s
kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------------------------------------------- 36.82s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 32.52s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 25.75s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.73s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.15s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.00s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 20.24s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 16.27s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 15.36s
container-engine/containerd : ensure containerd packages are installed --------------------------------------------------------------------------------------------------------------- 13.29s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 12.29s
kubernetes/preinstall : Install packages requirements -------------------------------------------------------------------------------------------------------------------------------- 12.15s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 11.40s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 11.05s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 10.19s
kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 10.02s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 9.36s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 9.15s
reload etcd --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.65s


Now that the K8s cluster is deployed, connect to the K8s Master Node for the following sections.

Please make sure to use the root account:

Master Node Console
$ sudo su -

K8s Deployment Verification

Below is an output example of a K8s cluster with the deployment information, with default Kubespray configuration using the Calico K8s CNI plugin.

To ensure that the K8s cluster is installed correctly, run the following commands:

Master Node Console
# kubectl get nodes -o wide
NAME    STATUS   ROLES                  AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node1   Ready    control-plane,master   6h40m   v1.21.0   10.10.1.1     <none>        Ubuntu 20.04.2 LTS   5.4.0-73-generic   containerd://1.4.4
node2   Ready    <none>                 6h39m   v1.21.0   10.10.1.2     <none>        Ubuntu 20.04.2 LTS   5.4.0-73-generic   containerd://1.4.4
node3   Ready    <none>                 6h39m   v1.21.0   10.10.1.3     <none>        Ubuntu 20.04.2 LTS   5.4.0-73-generic   containerd://1.4.4
node4   Ready    <none>                 6h39m   v1.21.0   10.10.1.4     <none>        Ubuntu 20.04.2 LTS   5.4.0-73-generic   containerd://1.4.4
node5   Ready    <none>                 6h39m   v1.21.0   10.10.1.5     <none>        Ubuntu 20.04.2 LTS   5.4.0-73-generic   containerd://1.4.4
      
$ kubectl get pod -n kube-system -o wide
NAME                                       READY   STATUS    RESTARTS   AGE     IP             NODE    NOMINATED NODE   READINESS GATES
calico-kube-controllers-7797d7b677-4kndh   1/1     Running   0          6h39m   10.10.1.3      node3   <none>           <none>
calico-node-6xqxn                          1/1     Running   1          6h40m   10.10.1.5      node5   <none>           <none>
calico-node-7st5x                          1/1     Running   0          6h40m   10.10.1.2      node2   <none>           <none>
calico-node-8qdpx                          1/1     Running   0          6h40m   10.10.1.1      node1   <none>           <none>
calico-node-qjflr                          1/1     Running   2          6h40m   10.10.1.4      node4   <none>           <none>
calico-node-x68rz                          1/1     Running   0          6h40m   10.10.1.3      node3   <none>           <none>
coredns-7fcf4fd7c7-7p6k5                   1/1     Running   0          6h7m    10.233.92.1    node3   <none>           <none>
coredns-7fcf4fd7c7-mwfd6                   1/1     Running   0          6h39m   10.233.90.1    node1   <none>           <none>
dns-autoscaler-7df78bfcfb-xl48v            1/1     Running   0          6h39m   10.233.90.2    node1   <none>           <none>
kube-apiserver-node1                       1/1     Running   0          6h41m   10.10.1.1      node1   <none>           <none>
kube-controller-manager-node1              1/1     Running   0          6h41m   10.10.1.1      node1   <none>           <none>
kube-multus-ds-amd64-8dmpv                 1/1     Running   0          6h39m   10.10.1.3      node3   <none>           <none>
kube-multus-ds-amd64-b74t4                 1/1     Running   1          6h39m   10.10.1.5      node5   <none>           <none>
kube-multus-ds-amd64-nvrl9                 1/1     Running   2          6h39m   10.10.1.4      node4   <none>           <none>
kube-multus-ds-amd64-s9lr4                 1/1     Running   0          6h39m   10.10.1.2      node2   <none>           <none>
kube-multus-ds-amd64-zrxcs                 1/1     Running   0          6h39m   10.10.1.1      node1   <none>           <none>
kube-proxy-bq9xg                           1/1     Running   2          6h40m   10.10.1.4      node4   <none>           <none>
kube-proxy-bs8br                           1/1     Running   0          6h40m   10.10.1.3      node3   <none>           <none>
kube-proxy-fxs88                           1/1     Running   0          6h40m   10.10.1.1      node1   <none>           <none>
kube-proxy-rts6t                           1/1     Running   1          6h40m   10.10.1.5      node5   <none>           <none>
kube-proxy-vml29                           1/1     Running   0          6h40m   10.10.1.2      node2   <none>           <none>
kube-scheduler-node1                       1/1     Running   0          6h41m   10.10.1.1      node1   <none>           <none>
nginx-proxy-node2                          1/1     Running   0          6h40m   10.10.1.2      node2   <none>           <none>
nginx-proxy-node3                          1/1     Running   0          6h40m   10.10.1.3      node3   <none>           <none>
nginx-proxy-node4                          1/1     Running   2          6h40m   10.10.1.4      node4   <none>           <none>
nginx-proxy-node5                          1/1     Running   1          6h40m   10.10.1.5      node5   <none>           <none>
nodelocaldns-kdsg5                         1/1     Running   2          6h39m   10.10.1.4      node4   <none>           <none>
nodelocaldns-mhh9g                         1/1     Running   0          6h39m   10.10.1.2      node2   <none>           <none>
nodelocaldns-nbhnr                         1/1     Running   0          6h39m   10.10.1.3      node3   <none>           <none>
nodelocaldns-nkj9h                         1/1     Running   0          6h39m   10.10.1.1      node1   <none>           <none>
nodelocaldns-rfnqk                         1/1     Running   1          6h39m   10.10.1.5      node5   <none>           <none>

Installing the Whereabouts CNI

You can install this plugin with a daemon set, using the following commands:

Master Node Console
# kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/daemonset-install.yaml
# kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/whereabouts.cni.cncf.io_ippools.yaml

 To ensure the plugin is installed correctly, run the following command:

Master Node Console
# kubectl get pods -A | grep whereabouts
kube-system   whereabouts-74nwr                          1/1     Running   0          6h4m
kube-system   whereabouts-7pq2l                          1/1     Running   0          6h4m
kube-system   whereabouts-gbpht                          1/1     Running   0          6h4m
kube-system   whereabouts-slbnj                          1/1     Running   0          6h4m
kube-system   whereabouts-tw7dc                          1/1     Running   0          6h4m

Deploying the SRIOV Device Plugin and CNI

Prepare the following files and apply them:

Master Node Console
# vi configMap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: sriovdp-config
  namespace: kube-system
data:
  config.json: |
    {
       "resourceList": [
            {
                "resourceName": "sriov_rdma",
                "resourcePrefix": "nvidia.com",
                "selectors": {
                    "vendors": ["15b3"],
                    "pfNames": ["enp197s0f0"],
                    "isRdma": true
                }
            }
       ]
    }

sriovdp-daemonset.yaml
# vi sriovdp-daemonset.yaml

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sriov-device-plugin
  namespace: kube-system

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-sriov-device-plugin-amd64
  namespace: kube-system
  labels:
    tier: node
    app: sriovdp
spec:
  selector:
    matchLabels:
      name: sriov-device-plugin
  template:
    metadata:
      labels:
        name: sriov-device-plugin
        tier: node
        app: sriovdp
    spec:
      hostNetwork: true
      nodeSelector:
        beta.kubernetes.io/arch: amd64
      serviceAccountName: sriov-device-plugin
      containers:
      - name: kube-sriovdp
        image: docker.io/nfvpe/sriov-device-plugin:v3.3
        imagePullPolicy: IfNotPresent
        args:
        - --log-dir=sriovdp
        - --log-level=10
        securityContext:
          privileged: true
        resources:
          requests:
            cpu: "250m"
            memory: "40Mi"
          limits:
            cpu: 1
            memory: "200Mi"
        volumeMounts:
        - name: devicesock
          mountPath: /var/lib/kubelet/
          readOnly: false
        - name: log
          mountPath: /var/log
        - name: config-volume
          mountPath: /etc/pcidp
        - name: device-info
          mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
      volumes:
        - name: devicesock
          hostPath:
            path: /var/lib/kubelet/
        - name: log
          hostPath:
            path: /var/log
        - name: device-info
          hostPath:
            path: /var/run/k8s.cni.cncf.io/devinfo/dp
            type: DirectoryOrCreate
        - name: config-volume
          configMap:
            name: sriovdp-config
            items:
            - key: config.json
              path: config.json

sriov-cni-daemonset.yaml
# vi sriov-cni-daemonset.yaml

---                                                                                                                                                                                           
apiVersion: apps/v1                                                                                                                                                                           
kind: DaemonSet                                                                                                                                                                               
metadata:                                                                                                                                                                                     
  name: kube-sriov-cni-ds-amd64                                                                                                                                                               
  namespace: kube-system                                                                                                                                                                      
  labels:                                                                                                                                                                                     
    tier: node                                                                                                                                                                                
    app: sriov-cni                                                                                                                                                                            
spec:                                                                                                                                                                                         
  selector:                                                                                                                                                                                   
    matchLabels:                                                                                                                                                                              
      name: sriov-cni                                                                                                                                                                         
  template:                                                                                                                                                                                   
    metadata:                                                                                                                                                                                 
      labels:
        name: sriov-cni
        tier: node
        app: sriov-cni
    spec:
      nodeSelector:
        beta.kubernetes.io/arch: amd64
      containers:
      - name: kube-sriov-cni
        image: nfvpe/sriov-cni:v2.3
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          privileged: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - ALL
        resources:
          requests:
            cpu: "100m"
            memory: "50Mi"
          limits:
            cpu: "100m"
            memory: "50Mi"
        volumeMounts:
        - name: cnibin
          mountPath: /host/opt/cni/bin
      volumes:
        - name: cnibin
          hostPath:
            path: /opt/cni/bin

Master Node Console
# kubectl apply -f configMap.yaml
# kubectl apply -f sriovdp-daemonset.yaml
# kubectl apply -f sriov-cni-daemonset.yaml 

Deploying the RDMA CNI

The RDMA CNI enables namespace isolation for the virtual functions.

Deploy the RDMA CNI using the following YAML file:

rdma-cni-daemonset.yaml
# vi rdma-cni-daemonset.yaml

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-rdma-cni-ds
  namespace: kube-system
  labels:
    tier: node
    app: rdma-cni
    name: rdma-cni
spec:
  selector:
    matchLabels:
      name: rdma-cni
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        tier: node
        app: rdma-cni
        name: rdma-cni
    spec:
      hostNetwork: true
      containers:
        - name: rdma-cni
          image: mellanox/rdma-cni
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
          resources:
            requests:
              cpu: "100m"
              memory: "50Mi"
            limits:
              cpu: "100m"
              memory: "50Mi"
          volumeMounts:
            - name: cnibin
              mountPath: /host/opt/cni/bin
      volumes:
        - name: cnibin
          hostPath:
            path: /opt/cni/bin

Master Node Console
# kubectl apply -f rdma-cni-daemonset.yaml

Applying Network Attachment Definitions

Apply the following YAML file to configure the network attachment for the pods:

netattdef.yaml
# vi netattdef.yaml

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  annotations:
    k8s.v1.cni.cncf.io/resourceName: nvidia.com/sriov_rdma
  name: sriov20
  namespace: default
spec:
  config: |-
    {
      "cniVersion": "0.3.1",
      "name": "sriov-rdma",
      "plugins": [
         {
           "type": "sriov",
           "vlan": 20,
           "spoofchk": "off",
           "vlanQoS": 0,
           "ipam": {
              "datastore": "kubernetes",
              "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"},
              "log_file": "/tmp/whereabouts.log",
              "log_level": "debug",
              "type": "whereabouts",
              "range": "192.168.20.0/24"
           }
         },
         {
           "type": "rdma"
         },
         {
           "mtu": 8950,
           "type": "tuning"
         }
      ]
    }

Master Node Console
# kubectl apply -f netattdef.yaml  

Creating a Test Deployment

Create a test daemon set using the following YAML. It will create a pod on every node that we can use to test RDMA connectivity and performance over the high-speed network.

Please notice that it adds an annotation referencing the required network ("sriov20") and has resource requests for the sriov virtual function resource ("nvidia.com/sriov_rdma").

Container image specified below should include NVIDIA user space drivers and perftest.

simple-daemon.yaml
# vi simple-daemon.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: example-daemon
  labels:
    app: example-dae
spec:
  selector:
    matchLabels:
      app: example-dae
  template:
    metadata:
      labels:
        app: example-dae
      annotations:
        k8s.v1.cni.cncf.io/networks: sriov20
    spec:
      containers:
       - image: < container image >
        name: example-dae-pod     
        securityContext:
          capabilities:
            add: [ "IPC_LOCK" ]
        resources:
          limits:
            memory: 16Gi
            cpu: 8           
            nvidia.com/sriov_rdma: '1'
          requests:
            memory: 16Gi
            cpu: 8
            nvidia.com/sriov_rdma: '1'
        command:
        - sleep
        - inf

Apply the resource:

Master Node Console
# kubectl apply -f simple-daemon.yaml

Validate daemon set is running successfully, you should see four pods running, one on each worker node:

Master Node Console
# kubectl get pod -o wide
NAME                           READY   STATUS    RESTARTS   AGE     IP             NODE    NOMINATED NODE   READINESS GATES
example-daemon-2p7t2           1/1     Running   0          5h21m   10.233.92.3    node3   <none>           <none>
example-daemon-g8mcx           1/1     Running   0          5h21m   10.233.96.84   node2   <none>           <none>
example-daemon-kf56h           1/1     Running   0          5h21m   10.233.105.4   node4   <none>           <none>
example-daemon-zdmz8           1/1     Running   0          5h21m   10.233.70.5    node5   <none>           <none>

Please refer to the appendix for running an RDMA performance test between the two pods in your test deployment.

Appendix

Performance Testing

Now that we have our test daemonset running, we can run a performance test to check the RDMA performance between the two pods running on two different worker nodes:

In one console window, connect to the master node and make sure to use the root account by using:

Master Node Console
$ sudo su -

Connect to one of the pods in the daemonset (example):

Master Node Console
# kubectl exec -it example-daemon-2p7t2 -- bash

From within the container, check its IP address on the high-speed network interface (net1):

First pod console
# ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default 
    link/ether 0e:e8:a8:d6:f7:3c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.233.92.3/32 brd 10.233.92.3 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::ce8:a8ff:fed6:f73c/64 scope link 
       valid_lft forever preferred_lft forever
26: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000
    link/ether ea:fe:9f:4a:28:8e brd ff:ff:ff:ff:ff:ff
    inet 192.168.20.88/24 brd 192.168.20.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::e8fe:9fff:fe4a:288e/64 scope link 

Then, start the ib_write_bw server side:

First pod console
# ib_write_bw -a --report_gbits
************************************
* Waiting for client to connect... *
************************************

Using another console window, connect again to the master node and connect to the second pod in the deployment (example):

Master Node Console
$ sudo su -
# kubectl exec -it example-daemon-zdmz8 -- bash

From within the container, start the ib_write_bw client (using the IP address taken from the receiving container).

Please verify that the maximum bandwidth between containers reaches more than 190 Gb/s

Second pod console
# ib_write_bw -a -F --report_gbits 192.168.20.88
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : rocep197s0f0v0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 2
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0122 PSN 0x3fdd80 RKey 0x02031e VAddr 0x007fb2a4731000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:91
 remote address: LID 0000 QPN 0x0164 PSN 0xa38679 RKey 0x03031f VAddr 0x007fe0387d1000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:88
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.041157            0.040923            2.557717
 4          5000           0.089667            0.089600            2.799999
 8          5000             0.18               0.18               2.795828
 16         5000             0.36               0.36               2.799164
 32         5000             0.72               0.72               2.801682
 64         5000             1.08               1.07               2.089307
 128        5000             2.15               2.08               2.031467
 256        5000             4.30               4.30               2.097492
 512        5000             8.56               8.56               2.089221
 1024       5000             17.09              17.02              2.077250
 2048       5000             33.89              33.83              2.065115
 4096       5000             85.32              66.30              2.023458
 8192       5000             163.84             136.83             2.087786
 16384      5000             184.12             167.11             1.274956
 32768      5000             190.44             180.83             0.689819
 65536      5000             190.26             182.66             0.348395
 131072     5000             193.71             179.10             0.170803
 262144     5000             192.64             191.31             0.091222
 524288     5000             192.62             191.29             0.045608
 1048576    5000             192.82             192.75             0.022977
 2097152    5000             192.38             192.22             0.011457
 4194304    5000             192.80             192.78             0.005745
 8388608    5000             192.67             192.65             0.002871
---------------------------------------------------------------------------------------

Optimizing worker nodes for performance

In order to accommodate performance-sensitive applications, we can optimize the worker nodes for better performance by enabling pod scheduling on cores that are mapped to the same NUMA node of the NIC:

On the worker node, please make sure to use the root account by using:

Worker Node Console
$ sudo su -

Check to which NUMA node the NIC is wired:

Worker Node Console
# cat /sys/class/net/enp197s0f0/device/numa_node
1

In this example, the NIC is wired to NUMA node 1.

Check the NUMA nodes of the CPU and which cores are in NUMA node 1:

Worker Node Console
# lscpu | grep NUMA
NUMA node(s):                    2
NUMA node0 CPU(s):               0-23
NUMA node1 CPU(s):               24-47

In this example case, the cores that are in NUMA node 1 are: 24-47.

Now we need to configure K8s on the worker node (kubelet):

  • The "cpuManagerPolicy" attribute specifies the selected CPU manger policy (which can be either "none" or "static").
  • The "reservedSystemCPUs" attribute lists the CPU cores that will not be used by K8S (will stay reserved for the Linux system).
  • The "topologyManagerPolicy" attribute specifies the selected policy for the topology manager (which can be either "none", "best-effort", "restricted" or "single-numa-node").

We will reserve some cores for the system, and make sure they belong to NUMA 0 (for our case):

Worker Node Console
# vi /etc/kubernetes/kubelet-config.yaml
...
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 10s
reservedSystemCPUs: "0,1,2,3"
topologyManagerPolicy: single-numa-node
featureGates:
  CPUManager: true
  TopologyManager: true
...

When changing reservedSystemCPUs or cpuManagerPolicy, the file: /var/lib/kubelet/cpu_manager_state should be deleted and kubelet service should be restarted:

Worker Node Console
# rm /var/lib/kubelet/cpu_manager_state
# service kubelet restart

Validating the fabric

To validate the fabric, we will need to assign IP addresses to the servers. Each stretched VLAN acts as a local subnet to all the servers connected to it so all the servers connected to the same VLAN must have IP addresses in the same subnet.

Then we can check that we can ping between the servers.

We can also validate on the switches:

1) That the IP addresses of the VTEPs were successfully propagated by BGP to all the leaf switches.

Please repeat the following command on the leafs: 

Leaf Switch Console
cumulus@leaf1a:mgmt:~$ net show route
show ip route
=============
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
S>* 0.0.0.0/0 [1/0] via 10.1.0.254, vlan1, weight 1, 00:01:09
B>* 10.0.0.1/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:30
B>* 10.0.0.2/32 [20/0] via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29
C>* 10.0.0.101/32 is directly connected, lo, 5d16h51m
B>* 10.0.0.102/32 [200/0] via fe80::1e34:daff:feb4:620, peerlink.4094, weight 1, 00:01:18
B>* 10.0.0.103/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29
  *                      via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29
B>* 10.0.0.104/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29
  *                      via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29
C>* 10.0.1.1/32 is directly connected, lo, 00:01:44
C * 10.1.0.0/24 [0/1024] is directly connected, vlan1-v0, 00:01:43
C>* 10.1.0.0/24 is directly connected, vlan1, 00:01:43
C * 10.10.0.0/16 [0/1024] is directly connected, vlan10-v0, 00:01:43
C>* 10.10.0.0/16 is directly connected, vlan10, 00:01:43

show ipv6 route
===============
Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
C * fe80::/64 is directly connected, peerlink.4094, 00:01:20
C * fe80::/64 is directly connected, swp14, 00:01:30
C * fe80::/64 is directly connected, swp13, 00:01:31
C * fe80::/64 is directly connected, vlan10-v0, 00:01:43
C * fe80::/64 is directly connected, vlan1-v0, 00:01:43
C * fe80::/64 is directly connected, vlan20, 00:01:43
C * fe80::/64 is directly connected, vlan10, 00:01:43
C * fe80::/64 is directly connected, vlan1, 00:01:43
C>* fe80::/64 is directly connected, bridge, 00:01:43

2) That the ARP entries were successfully propagated by EVPN (best observed on the spine).

Please repeat the following command on the spines:

Spine Switch Console
cumulus@spine1:mgmt:~$ net show bgp evpn route type macip 
BGP table version is 917, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
                    Extended Community
Route Distinguisher: 10.0.0.101:2
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]
                    10.0.1.1                               0 65101 i
                    RT:65101:20 ET:8 MM:0, sticky MAC
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920]
                    10.0.1.1                               0 65101 i
                    RT:65101:20 ET:8 Default Gateway ND:Router Flag
Route Distinguisher: 10.0.0.101:3
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 MM:0, sticky MAC
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[32]:[10.10.0.2]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
Route Distinguisher: 10.0.0.102:2
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[32]:[10.10.0.3]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8 MM:0, sticky MAC
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1]
                    10.0.1.1                               0 65101 i
                    RT:65101:10 ET:8
Route Distinguisher: 10.0.0.102:3
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620]
                    10.0.1.1                               0 65101 i
                    RT:65101:20 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]
                    10.0.1.1                               0 65101 i
                    RT:65101:20 ET:8 MM:0, sticky MAC
Route Distinguisher: 10.0.0.103:2
*  [2]:[0]:[48]:[b8:59:9f:fa:87:8e]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e]
                    10.0.0.103                             0 65102 i
                    RT:65102:10 ET:8
Route Distinguisher: 10.0.0.103:3
*  [2]:[0]:[48]:[5e:60:de:10:be:74]
                    10.0.0.103                             0 65102 i
                    RT:65102:20 ET:8
*> [2]:[0]:[48]:[5e:60:de:10:be:74]
                    10.0.0.103                             0 65102 i
                    RT:65102:20 ET:8
*  [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74]
                    10.0.0.103                             0 65102 i
                    RT:65102:20 ET:8
*> [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74]
                    10.0.0.103                             0 65102 i
                    RT:65102:20 ET:8
Route Distinguisher: 10.0.0.104:2
*  [2]:[0]:[48]:[06:e0:ca:50:81:a3]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*> [2]:[0]:[48]:[06:e0:ca:50:81:a3]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*  [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*  [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*  [2]:[0]:[48]:[32:98:4b:9b:91:03]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*> [2]:[0]:[48]:[32:98:4b:9b:91:03]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*  [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*  [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
*> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103]
                    10.0.0.104                             0 65103 i
                    RT:65103:20 ET:8
Route Distinguisher: 10.0.0.104:3
*  [2]:[0]:[48]:[b8:59:9f:fa:87:6e]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:be]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:be]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*  [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be]
                    10.0.0.104                             0 65103 i
                    RT:65103:10 ET:8

Displayed 40 prefixes (58 paths) (of requested type) 

3) That the MLAG is functioning properly on the infrastructure rack leafs:

Border Router Switch Console
cumulus@leaf1a:mgmt:~$ net show clag
The peer is alive
     Our Priority, ID, and Role: 1000 1c:34:da:b4:09:20 primary
    Peer Priority, ID, and Role: 32768 1c:34:da:b4:06:20 secondary
          Peer Interface and IP: peerlink.4094 fe80::1e34:daff:feb4:620 (linklocal)
               VxLAN Anycast IP: 10.0.1.1
                      Backup IP: 10.0.0.102 (active)
                     System MAC: 44:38:39:ff:ff:aa

CLAG Interfaces
Our Interface      Peer Interface     CLAG Id   Conflicts              Proto-Down Reason
----------------   ----------------   -------   --------------------   -----------------
           bond1   bond1              1         -                      -              
           bond2   bond2              2         -                      -              
           bond3   bond3              3         -                      -              
           vni10   vni10              -         -                      -              
           vni20   vni20              -         -                      -


Done!

Authors

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Shachar Dor

Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management. 

Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company. 



Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright
© 2023 NVIDIA Corporation & affiliates. All Rights Reserved.