RDG for a Scalable, High-performance Kubernetes Cluster over NVIDIA Ethernet Fabric

Scope

This R eference D eployment G uide ( RDG ) aims at providing a practical and scalable Ethernet fabric deployment that is suitable for high-performance workloads in K8s. This fabric provides both primary K8s network (e.g. Calico) and a secondary high-performance network for RDMA/DPDK, in conjunction with the SRIOV and RDMA plugins and CNIs.

The proposed fabric configuration supports up to 480 workload servers in its maximum scale and provides a non-blocking throughput of up to 200Gbps between pods.

Abbreviations and Acronyms

Term

Definition

Term

Definition

BGP

Border Gateway Protocol

MLAG

Multi-Chassis Link Aggregation

CNI

Container Network Interface

RDMA

Remote Direct Memory Access

DMA

Direct Memory Access

TOR

Top of Rack

EVPN

Ethernet Virtual Private Network

VLAN

Virtual LAN (Local Area Network)

ISL

Inter-Switch Link

VRR

Virtual Router Redundancy

K8s

Kubernetes

VTEP

Virtual Tunnel End Point

LACP

Link Aggregation Control Protocol

VXLAN

Virtual Extensible LAN

Introduction

K8s is the industry-standard platform for deploying and orchestrating cloud-native workloads.

The common K8s networking solutions (e.g. the commonly used Flannel and Calico CNI plugins) are not optimized for performance and do not utilize the current state-of-the-art networking technologies that are hardware-accelerated. Today's interconnect solutions from NVIDIA can provide up to 200Gbps of throughput at a very low latency with a minimal load on the server's CPU. To take advantage of these capabilities, provisioning of an additional network for the pods is needed - a high-speed RDMA-capable network.

This document demonstrates how to deploy, enable and configure a high-speed, hardware-accelerated network fabric in a K8s cluster, providing both the primary network and a secondary RDMA network on the same wire. The network fabric also includes highly-available border router functionality which provides in/out connectivity to the cluster (e.g. access to the Internet).

This document is intended for K8s administrators that want to enable a high-speed fabric for their applications running on top of K8s, such as big-data, machine learning, storage and database solutions, etc.

The document begins with the design of the fabric and of the K8s deployment, then continues with the actual deployment and configuration steps, concluding with a performance test that demonstrates the benefits of the solution.

References

Solution Architecture

Key Components and Technologies

  • NVIDIA ConnectX SmartNICs
    10/25/40/50/100/200 and 400G Ethernet Network Adapters
    The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
    NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

  • NVIDIA LinkX Cables

    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA Spectrum Ethernet Switches

    Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
    Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.
    NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® Linux , SONiC and NVIDIA Onyx®.

  • NVIDIA Cumulus Linux

    NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

  • RDMA

    RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.

    Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.

  • Kubernetes
    Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

  • Kubespray
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:

    • A highly available cluster

    • Composable attributes

    • Support for most popular Linux distributions

Logical Design

The physical servers used in this document:

  • 1 x Deployment Node

  • 1 x Master Node

  • 4 x Worker Nodes; each with 1 x ConnectX-6 NIC

The deployment of the fabric is based on a 2-level leaf-spine topology.

image2021-5-24_18-56-3.png

The deployment includes two separate physical networks:

  1. A high-speed Ethernet fabric

  2. An IPMI/bare-metal management network (not covered in this document)

Note

This document covers a single K8s controller deployment scenario. For high-availability cluster deployment, please refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md

Network / Fabric Design

This document demonstrates a minimalistic scale of 2 workload racks with 2 servers each (as shown in the diagram below):

image2021-5-20_16-27-2.png

By using the same design, the fabric can be scaled to accommodate up to 480 workload servers using up to 30 workload racks with up to 16 servers each. Every workload rack uses a single leaf switch (TOR). The infrastructure rack consists of a highly-available border router (an MLAG pair) which provides a connection to an external gateway/router and to a maximum of 14 infrastructure servers.

The high-speed network consists of two logical segments:

  1. The management network and the primary K8s network (used by Calico) - VLAN10

  2. The secondary K8s network which provides RDMA to the pods - VLAN20

The fabric implements a VXLAN overlay network with a BGP EVPN control plane, which enables the "stretching" of the VLANs across all the racks.

Every leaf switch has a VTEP which takes care of VXLAN encapsulation and decapsulation. The communication between the VTEPs is done by routing through the spines, controlled by a BGP control plane.

The infrastructure rack (as seen on the left in the illustration below) has two leaf switches that act as a highly available border router which provides both highly available connectivity for the infrastructure servers (the deployment server and the K8s master node) and redundant routing into and out of the cluster through a gateway node. This high availability is achieved by an MLAG configuration, the use of LACP bonds, and a redundant router mechanism which uses VRR.

Below is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 200GbE to the host (30 racks, 16 servers each using 16 spines and 32 leafs).

Please note that in this setup, the MSN2100 switches in the infrastructure rack should be replaced by MSN2700 switches (having 32 ports instead of 16 ports):

image2021-5-24_19-1-0.png

In the case of a maximum scale fabric (as shown above), there will be 16 x 200Gbps links going up from each leaf to the spines and therefore a maximum of 16 x 200Gbps links going to servers in each rack.

Software Stack Components

image2021-5-24_9-59-31.png

Important

Please make sure to upgrade all the NVIDIA software components to their latest released version.

Bill of Materials

image2021-5-24_11-50-44.png

Warning

Please note that older MSN2100 switches with hardware revision 0 (zero) do not support the functionality presented in this document. You can verify that your switch is newer by running the "decode-syseeprom" command and checking the "Device Version" field (must be greater than zero).

Deployment and Configuration

Node and Switch Definitions

These are the definitions and parameters used for deploying the demonstrated fabric:

Spines

hostname

router id

autonomous system

downlinks

spine1 (MSN3700)

10.0.0.1/32

65100

swp1-6

spine2 (MSN3700)

10.0.0.2/32

65100

swp1-6

Leafs

hostname

router id

autonomous system

uplinks

peers on spines

leaf1a (MSN2100)

10.0.0.101/32

65101

swp13-14

swp1

leaf1b (MSN2100)

10.0.0.102/32

65101

swp13-14

swp2

leaf2 (MSN3700)

10.0.0.103/32

65102

swp29-32

swp3-4

leaf3 (MSN3700)

10.0.0.104/32

65103

swp29-32

swp5-6

Workload Server Ports

rack id

vlan id

access ports

trunk ports

2

10

swp1-4

2

20

swp1-4

3

10

swp1-4

3

20

swp1-4

Border Routers (Infrastructure Rack TORs)

hostname

isl ports

clag system mac

clag priority

vxlan anycast ip

leaf1a

swp15-16

44:38:39:FF:FF:AA

1000

10.10.11.1

leaf1b

swp15-16

44:38:39:FF:FF:AA

32768

10.10.11.1

Border VLANs

vlan id

virt mac

virt ip

primary router ip

secondary router ip

10

00:00:00:00:00:10

10.10.0.1/16

10.10.0.2/16

10.10.0.3/16

1

00:00:00:00:00:01

10.1.0.1/24

10.1.0.2/24

10.1.0.3/24

Infrastructure Server Ports

vlan id

port names

bond names

1

swp1

bond1

10

swp2, swp3

bond2, bond3

Hosts

Rack

Server/Switch type

Server/Switch name

IP and NICs

Default Gateway

Rack1

(Infrastructure)

Deployment Node

depserver

bond0 (enp197s0f0, enp197s0f1)

10.10.0.250/16

10.10.0.1

Rack1

(Infrastructure)

Master Node

node1

bond0 (enp197s0f0, enp197s0f1)

10.10.1.1/16

10.10.0.1

Rack2

Worker Node

node2

enp197s0f0

10.10.1.2/16

10.10.0.1

Rack2

Worker Node

node3

enp197s0f0

10.10.1.3/16

10.10.0.1

Rack3

Worker Node

node4

enp197s0f0

10.10.1.4/16

10.10.0.1

Rack3

Worker Node

node5

enp197s0f0

10.10.1.5/16

10.10.0.1

Wiring

This is the wiring principal for the workload racks:

  • Each server in the racks is wired to the leaf (or "TOR") switch

  • Every leaf is wired to all the spines

image2021-6-1_9-58-41.png

This is the wiring principal for the infrastructure rack:

  • Each server in the racks is wired to two leafs (or "TORs") switches

  • Every leaf is wired to all the spines

image2021-6-1_9-59-6.png

Fabric Configuration

Updating Cumulus Linux

As a best practice, make sure to use the latest released Cumulus Linux NOS version.

Please see this guide on how to upgrade Cumulus Linux.

Configuring the Cumulus Linux Switch

Make sure your Cumulus Linux switch has passed its initial configuration stages (please see the Quick-Start Guide for version 4.3 for additional information):

  1. License installation

  2. Creation of switch interfaces (e.g. swp1-32)

Following is the configuration for the switches:

Note

Please note that you can add the command "net del all" before the following commands in order to clear any previous configuration.

Spine1 Console

Copy
Copied!
            

net add bgp autonomous-system 65100 net add loopback lo ip address 10.0.0.1/32 net add bgp router-id 10.0.0.1 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add interface swp1 mtu 9216 net add bgp neighbor swp1 interface peer-group underlay net add interface swp2 mtu 9216 net add bgp neighbor swp2 interface peer-group underlay net add interface swp3 mtu 9216 net add bgp neighbor swp3 interface peer-group underlay net add interface swp4 mtu 9216 net add bgp neighbor swp4 interface peer-group underlay net add interface swp5 mtu 9216 net add bgp neighbor swp5 interface peer-group underlay net add interface swp6 mtu 9216 net add bgp neighbor swp6 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net commit

Spine2 Console

Copy
Copied!
            

net add bgp autonomous-system 65100 net add loopback lo ip address 10.0.0.2/32 net add bgp router-id 10.0.0.2 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add interface swp1 mtu 9216 net add bgp neighbor swp1 interface peer-group underlay net add interface swp2 mtu 9216 net add bgp neighbor swp2 interface peer-group underlay net add interface swp3 mtu 9216 net add bgp neighbor swp3 interface peer-group underlay net add interface swp4 mtu 9216 net add bgp neighbor swp4 interface peer-group underlay net add interface swp5 mtu 9216 net add bgp neighbor swp5 interface peer-group underlay net add interface swp6 mtu 9216 net add bgp neighbor swp6 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net commit

Leaf1A Console

Copy
Copied!
            

net add bgp autonomous-system 65101 net add bgp router-id 10.0.0.101 net add loopback lo ip address 10.0.0.101/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports peerlink net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.101 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.101 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.101 net add bridge bridge ports vni20 net add loopback lo clag vxlan-anycast-ip 10.10.11.1 net add bgp l2vpn evpn advertise-default-gw net add bond peerlink bond slaves swp15,swp16 net add interface peerlink.4094 clag args --initDelay 10 net add interface peerlink.4094 clag backup-ip 10.0.0.102 net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag priority 1000 net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA net add bgp neighbor peerlink.4094 interface remote-as internal net add bgp l2vpn evpn neighbor peerlink.4094 activate net add vlan 10 ip address 10.10.0.2/16 net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16 net add vlan 1 ip address 10.1.0.2/24 net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24 net commit

Leaf1B Console

Copy
Copied!
            

net add bgp autonomous-system 65101 net add bgp router-id 10.0.0.102 net add loopback lo ip address 10.0.0.102/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports peerlink net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.102 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.102 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.102 net add bridge bridge ports vni20 net add loopback lo clag vxlan-anycast-ip 10.10.11.1 net add bgp l2vpn evpn advertise-default-gw net add bond peerlink bond slaves swp15,swp16 net add interface peerlink.4094 clag args --initDelay 10 net add interface peerlink.4094 clag backup-ip 10.0.0.101 net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag priority 32768 net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA net add bgp neighbor peerlink.4094 interface remote-as internal net add bgp l2vpn evpn neighbor peerlink.4094 activate net add vlan 10 ip address 10.10.0.3/16 net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16 net add vlan 1 ip address 10.1.0.3/24 net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24 net commit

Leaf2 Console

Copy
Copied!
            

net add bgp autonomous-system 65102 net add bgp router-id 10.0.0.102 net add loopback lo ip address 10.0.0.103/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp29 mtu 9216 net add bgp neighbor swp29 interface peer-group underlay net add interface swp30 mtu 9216 net add bgp neighbor swp30 interface peer-group underlay net add interface swp31 mtu 9216 net add bgp neighbor swp31 interface peer-group underlay net add interface swp32 mtu 9216 net add bgp neighbor swp32 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports swp1,swp2,swp3,swp4 net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.103 net add interface swp1,swp2,swp3,swp4 bridge pvid 10 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add interface swp1,swp2,swp3,swp4 bridge vids 20 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.103 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.103 net add bridge bridge ports vni20 net commit

Leaf3 Console

Copy
Copied!
            

net add bgp autonomous-system 65103 net add bgp router-id 10.0.0.103 net add loopback lo ip address 10.0.0.104/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp29 mtu 9216 net add bgp neighbor swp29 interface peer-group underlay net add interface swp30 mtu 9216 net add bgp neighbor swp30 interface peer-group underlay net add interface swp31 mtu 9216 net add bgp neighbor swp31 interface peer-group underlay net add interface swp32 mtu 9216 net add bgp neighbor swp32 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports swp1,swp2,swp3,swp4 net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.104 net add interface swp1,swp2,swp3,swp4 bridge pvid 10 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add interface swp1,swp2,swp3,swp4 bridge vids 20 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.104 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.104 net add bridge bridge ports vni20 net commit

Connecting the Infrastructure Servers

Infrastructure servers (deployment and K8s master servers) are placed in the infrastructure rack.

This will require the following additional configuration steps:

  1. Adding the ports connected to the servers to an MLAG bond

  2. Placing the bond in the relevant VLAN

In our case, the servers are connected to ports swp2 and swp3 on both leafs (Leaf1A and Leaf1B), and will be using VLAN10 that we created on the border leafs, the commands on both Leaf1A and Leaf1B will be:

Leaf1A and Leaf1B Console

Copy
Copied!
            

net add interface swp2 mtu 8950 net add bond bond2 bond slaves swp2 net add bond bond2 mtu 8950 net add bond bond2 clag id 2 net add bond bond2 bridge access 10 net add bond bond2 bond lacp-bypass-allow net add bond bond2 stp bpduguard net add bond bond2 stp portadminedge net add interface swp3 mtu 8950 net add bond bond3 bond slaves swp3 net add bond bond3 mtu 8950 net add bond bond3 clag id 3 net add bond bond3 bridge access 10 net add bond bond3 bond lacp-bypass-allow net add bond bond3 stp bpduguard net add bond bond3 stp portadminedge net commit

Connecting an External Gateway to the Infrastructure Rack

In our setup, we will connect an external gateway machine (10.1.0.254/24) over an LACP bond to swp1 of both border leafs (via VLAN1).
This gateway will be used to access any external network (e.g. the Internet). The configuration commands on both border leafs are as follows:

Leaf1A Console

Copy
Copied!
            

net add interface swp1 mtu 8950 net add bond bond1 bond slaves swp1 net add bond bond1 mtu 8950 net add bond bond1 clag id 1 net add bond bond1 bridge access 1 net add bond bond1 bond lacp-bypass-allow net add bond bond1 stp bpduguard net add bond bond1 stp portadminedge net add routing route 0.0.0.0/0 10.1.0.254 net commit

Please note that the gateway machine should be configured statically to access our primary network (10.1.0.0/16) via its relevant interface.

Host Configuration

Important

Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.

Important

All Worker nodes must have the same PCIe placement for the NIC, and expose the same interface name.

Our host will be running Ubuntu Linux, the configuration is as follows:

Installing and Updating the OS

Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages, and create a non-root user account with sudo privileges without password.

Also make sure to assign the correct network configuration to the hosts (IP addresses, default gateway, DNS server, NTP server) and to create bonds on the nodes in the infrastructure rack (master node and deployment node).

Update the Ubuntu software packages by running the following commands:

Non-root User Account Prerequisites

In this solution we added the following line to the EOF /etc/sudoers:

Server Console

Copy
Copied!
            

$ sudo vi /etc/sudoers   #includedir /etc/sudoers.d   #K8s cluster deployment user with sudo privileges without password   user ALL=(ALL) NOPASSWD:ALL

SR-IOV Activation and Virtual Functions Configuration

Use the following commands to install the mstflint tool and verify that SRIOV is enabled and that there are enough virtual functions on the NIC:

Worker Node Console

Copy
Copied!
            

# apt install mstflint   # lspci | grep Mellanox c5:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] c5:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]   # mstconfig -d c5:00.0 q | grep SRIOV_EN SRIOV_EN True(1) # mstconfig -d c5:00.0 q | grep NUM_OF_VFS NUM_OF_VFS 8

In case SRIOV is not configured or the number of VFs is insufficient, please configure using the following commands (and then reboot the machine):

Worker Node Console

Copy
Copied!
            

# mstconfig -d c5:00.0 -y set SRIOV_EN=True NUM_OF_VFS=8   # reboot

The above operation activated SRIOV and defined the maximum number of VFs supported. Below we will perform the actual activation of the virtual functions.

Installing rdma-core and Setting RDMA to "Exclusive Mode"

Install the rdma-core package:

Worker Node Console

Copy
Copied!
            

# apt install rdma-core -y

Set netns to exclusive mode for providing namespace isolation on the high-speed interface. This way, each pod can only see and access its own virtual functions.

Create the following file:

Worker Node Console

Copy
Copied!
            

# vi /etc/modprobe.d/ib_core.conf   # Set netns to exclusive mode for namespace isolation options ib_core netns_mode=0

Then run the commands below:

Worker Node Console

Copy
Copied!
            

# update-initramfs -u # reboot

After the node comes back, check netns mode:

Worker Node Console

Copy
Copied!
            

# rdma system   netns exclusive

Setting MTU on the Physical Port

We need to set the MTU on the physical port of the server to allow for optimized throughput.

Since the fabric is using VXLAN overlay, we will use the maximum MTU of 9216 on the core links and an MTU of 8950 on the edge links (servers links), making sure that the VXLAN header added to the packets will not cause fragmentation.

In order to configure the MTU on the server ports, please edit the netplan config file (in this example on node2):

Worker Node Console

Copy
Copied!
            

# vi /etc/netplan/00-installer-config.yaml   network: ethernets: enp197s0f0: addresses: - 10.10.1.2 gateway4: 10.10.0.1 mtu: 8950 version: 2

Note

Please note that you can use the "rdma link" command to identify the name assigned to the high-speed interface, for example:

# rdma link

link rocep197s0f0/1 state ACTIVE physical_state LINK_UP netdev enp197s0f0

Then apply it:

Worker Node Console

Copy
Copied!
            

# netplan apply

Virtual Function Activation

Now we will activate 8 virtual functions using the following command:

Worker Node Console

Copy
Copied!
            

# PF_NAME=enp197s0f0 # echo 8 > /sys/class/net/${PF_NAME}/device/sriov_numvfs

Important

Please note that the above configuration is not persistent!

NIC Firmware Upgrade

It is recommended that you upgrade the NIC firmware on the worker nodes to the latest released version.

Please make sure to use the root account using:

Worker Node Console

Copy
Copied!
            

$ sudo su -

Please make sure to download the "mlxup" program to each Worker Node and install the latest firmware for the NIC (requires Internet connectivity, please check the official download page)

Worker Node Console

Copy
Copied!
            

# wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup # chmod 777 mlxup # ./mlxup -u --online

K8s Cluster Deployment and Configuration

The K8s cluster in this solution will be installed using Kubespray with a non-root user account from the Deployment Node.

SSH Private Key and SSH Passwordless Login

Login to the Deployment Node as a deployment user (in this case - user) and create an SSH private key for configuring the password-less authentication on your computer by running the following commands:

Deployment Node Console

Copy
Copied!
            

$ ssh-keygen   Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Created directory '/home/user/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@depl-node The key's randomart image is: +---[RSA 2048]----+ |      ...+oo+o..o| |      .oo   .o. o| |     . .. . o  +.| |   E  .  o +  . +| |    .   S = +  o | |     . o = + o  .| |      . o.o +   o| |       ..+.*. o+o| |        oo*ooo.++| +----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command (example):

Deployment Node Console

Copy
Copied!
            

$ ssh-copy-id -i ~/.ssh/id_rsa user@10.10.1.1   /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub" The authenticity of host '10.10.1.1 (10.10.1.1)' can't be established. ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys user@10.10.1.1's password:   Number of key(s) added: 1   Now try logging into the machine, with:   "ssh 'user@10.10.1.1'" and check to make sure that only the key(s) you wanted were added.

Verify that you have password-less SSH connectivity to all nodes in your deployment by running the following command (example):

Deployment Node Console

Copy
Copied!
            

$ ssh user@10.10.1.1

Kubespray Deployment and Configuration

To install dependencies for running Kubespray with Ansible on the Deployment server please run following commands:

Deployment Node Console

Copy
Copied!
            

$ cd ~ $ sudo apt -y install python3-pip jq $ git clone https://github.com/kubernetes-sigs/kubespray.git $ cd kubespray $ sudo pip3 install -r requirements.txt

Create a new cluster configuration. The default folder for subsequent commands is ~/kubespray.

Replace the IP addresses below with your nodes' IP addresses:

Deployment Node Console

Copy
Copied!
            

$ cp -rfp inventory/sample inventory/mycluster $ declare -a IPS=(10.10.1.1 10.10.1.2 10.10.1.3 10.10.1.4 10.10.1.5) $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:

inventory/mycluster/hosts.yaml

Copy
Copied!
            

$ sudo vi inventory/mycluster/hosts.yaml   all: hosts: node1: ansible_host: 10.10.1.1 ip: 10.10.1.1 access_ip: 10.10.1.1 node2: ansible_host: 10.10.1.2 ip: 10.10.1.2 access_ip: 10.10.1.2 node3: ansible_host: 10.10.1.3 ip: 10.10.1.3 access_ip: 10.10.1.3 node4: ansible_host: 10.10.1.4 ip: 10.10.1.4 access_ip: 10.10.1.4 node5: ansible_host: 10.10.1.5 ip: 10.10.1.5 access_ip: 10.10.1.5 children: kube_control_plane: hosts: node1: kube_node: hosts: node2: node3: node4: node5: etcd: hosts: node1: k8s_cluster: children: kube_control_plane: kube_node: calico_rr: hosts: {}

Review and change cluster installation parameters in the files inventory/mycluster/group_vars/all/all.yml and inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

In inventory/mycluster/group_vars/all/all.yml remove the comment from following line so the metrics can receive data about the use of cluster resources:

Deployment Node Console

Copy
Copied!
            

$ sudo vi inventory/mycluster/group_vars/all/all.yml   ## The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable. kube_read_only_port: 10255

In inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml set the value of kube_version to v1.21.0, set the container_manager to containerd and enable multi_networking by setting kube_network_plugin_multus: true.

Deployment Node Console

Copy
Copied!
            

$ sudo vi inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml   … ## Change this to use another Kubernetes version, e.g. a current beta release kube_version: v1.21.0 … ## Container runtime ## docker for docker, crio for cri-o and containerd for containerd. container_manager: containerd … # Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni kube_network_plugin_multus: true …

In inventory/mycluster/group_vars/etcd.yml set the etcd_deployment_type to host:

Deployment Node Console

Copy
Copied!
            

$ sudo vi inventory/mycluster/group_vars/etcd.yml   ...   ## Settings for etcd deployment type etcd_deployment_type: host

Deploying the cluster using Kubespray Ansible Playbook

Run the following line to start the deployment process:

Deployment Node Console

Copy
Copied!
            

$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete, please make sure no errors are encountered.

A successful result should look something like the following:

Deployment Node Console

Copy
Copied!
            

PLAY RECAP *********************************************************************************************************************************************************************************** localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 node1 : ok=584 changed=133 unreachable=0 failed=0 skipped=1151 rescued=0 ignored=2 node2 : ok=387 changed=86 unreachable=0 failed=0 skipped=634 rescued=0 ignored=1 node3 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1 node4 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1 node5 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1   Thursday 20 May 2021 07:59:23 +0000 (0:00:00.071) 0:11:57.632 ********** =============================================================================== kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 77.14s kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------------------------------------------- 36.82s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 32.52s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 25.75s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.73s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.15s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.00s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 20.24s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 16.27s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 15.36s container-engine/containerd : ensure containerd packages are installed --------------------------------------------------------------------------------------------------------------- 13.29s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 12.29s kubernetes/preinstall : Install packages requirements -------------------------------------------------------------------------------------------------------------------------------- 12.15s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 11.40s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 11.05s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 10.19s kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 10.02s download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 9.36s download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 9.15s reload etcd --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.65s

Now that the K8s cluster is deployed, connect to the K8s Master Node for the following sections.

Please make sure to use the root account:

Master Node Console

Copy
Copied!
            

$ sudo su -

K8s Deployment Verification

Below is an output example of a K8s cluster with the deployment information, with default Kubespray configuration using the Calico K8s CNI plugin.

To ensure that the K8s cluster is installed correctly, run the following commands:

Master Node Console

Copy
Copied!
            

# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane,master 6h40m v1.21.0 10.10.1.1 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node2 Ready <none> 6h39m v1.21.0 10.10.1.2 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node3 Ready <none> 6h39m v1.21.0 10.10.1.3 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node4 Ready <none> 6h39m v1.21.0 10.10.1.4 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node5 Ready <none> 6h39m v1.21.0 10.10.1.5 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4        $ kubectl get pod -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-kube-controllers-7797d7b677-4kndh 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none> calico-node-6xqxn 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none> calico-node-7st5x 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none> calico-node-8qdpx 1/1 Running 0 6h40m 10.10.1.1 node1 <none> <none> calico-node-qjflr 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none> calico-node-x68rz 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none> coredns-7fcf4fd7c7-7p6k5 1/1 Running 0 6h7m 10.233.92.1 node3 <none> <none> coredns-7fcf4fd7c7-mwfd6 1/1 Running 0 6h39m 10.233.90.1 node1 <none> <none> dns-autoscaler-7df78bfcfb-xl48v 1/1 Running 0 6h39m 10.233.90.2 node1 <none> <none> kube-apiserver-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none> kube-controller-manager-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none> kube-multus-ds-amd64-8dmpv 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none> kube-multus-ds-amd64-b74t4 1/1 Running 1 6h39m 10.10.1.5 node5 <none> <none> kube-multus-ds-amd64-nvrl9 1/1 Running 2 6h39m 10.10.1.4 node4 <none> <none> kube-multus-ds-amd64-s9lr4 1/1 Running 0 6h39m 10.10.1.2 node2 <none> <none> kube-multus-ds-amd64-zrxcs 1/1 Running 0 6h39m 10.10.1.1 node1 <none> <none> kube-proxy-bq9xg 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none> kube-proxy-bs8br 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none> kube-proxy-fxs88 1/1 Running 0 6h40m 10.10.1.1 node1 <none> <none> kube-proxy-rts6t 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none> kube-proxy-vml29 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none> kube-scheduler-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none> nginx-proxy-node2 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none> nginx-proxy-node3 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none> nginx-proxy-node4 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none> nginx-proxy-node5 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none> nodelocaldns-kdsg5 1/1 Running 2 6h39m 10.10.1.4 node4 <none> <none> nodelocaldns-mhh9g 1/1 Running 0 6h39m 10.10.1.2 node2 <none> <none> nodelocaldns-nbhnr 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none> nodelocaldns-nkj9h 1/1 Running 0 6h39m 10.10.1.1 node1 <none> <none> nodelocaldns-rfnqk 1/1 Running 1 6h39m 10.10.1.5 node5 <none> <none>

Installing the Whereabouts CNI

You can install this plugin with a daemon set, using the following commands:

Master Node Console

Copy
Copied!
            

# kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/daemonset-install.yaml # kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/whereabouts.cni.cncf.io_ippools.yaml

To ensure the plugin is installed correctly, run the following command:

Master Node Console

Copy
Copied!
            

# kubectl get pods -A | grep whereabouts kube-system whereabouts-74nwr 1/1 Running 0 6h4m kube-system whereabouts-7pq2l 1/1 Running 0 6h4m kube-system whereabouts-gbpht 1/1 Running 0 6h4m kube-system whereabouts-slbnj 1/1 Running 0 6h4m kube-system whereabouts-tw7dc 1/1 Running 0 6h4m

Deploying the SRIOV Device Plugin and CNI

Prepare the following files and apply them:

Master Node Console

Copy
Copied!
            

# vi configMap.yaml   apiVersion: v1 kind: ConfigMap metadata: name: sriovdp-config namespace: kube-system data: config.json: | { "resourceList": [ { "resourceName": "sriov_rdma", "resourcePrefix": "nvidia.com", "selectors": { "vendors": ["15b3"], "pfNames": ["enp197s0f0"], "isRdma": true } } ] }

sriovdp-daemonset.yaml

Copy
Copied!
            

# vi sriovdp-daemonset.yaml   --- apiVersion: v1 kind: ServiceAccount metadata: name: sriov-device-plugin namespace: kube-system   --- apiVersion: apps/v1 kind: DaemonSet metadata: name: kube-sriov-device-plugin-amd64 namespace: kube-system labels: tier: node app: sriovdp spec: selector: matchLabels: name: sriov-device-plugin template: metadata: labels: name: sriov-device-plugin tier: node app: sriovdp spec: hostNetwork: true nodeSelector: beta.kubernetes.io/arch: amd64 serviceAccountName: sriov-device-plugin containers: - name: kube-sriovdp image: docker.io/nfvpe/sriov-device-plugin:v3.3 imagePullPolicy: IfNotPresent args: - --log-dir=sriovdp - --log-level=10 securityContext: privileged: true resources: requests: cpu: "250m" memory: "40Mi" limits: cpu: 1 memory: "200Mi" volumeMounts: - name: devicesock mountPath: /var/lib/kubelet/ readOnly: false - name: log mountPath: /var/log - name: config-volume mountPath: /etc/pcidp - name: device-info mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp volumes: - name: devicesock hostPath: path: /var/lib/kubelet/ - name: log hostPath: path: /var/log - name: device-info hostPath: path: /var/run/k8s.cni.cncf.io/devinfo/dp type: DirectoryOrCreate - name: config-volume configMap: name: sriovdp-config items: - key: config.json path: config.json

sriov-cni-daemonset.yaml

Copy
Copied!
            

# vi sriov-cni-daemonset.yaml   --- apiVersion: apps/v1 kind: DaemonSet metadata: name: kube-sriov-cni-ds-amd64 namespace: kube-system labels: tier: node app: sriov-cni spec: selector: matchLabels: name: sriov-cni template: metadata: labels: name: sriov-cni tier: node app: sriov-cni spec: nodeSelector: beta.kubernetes.io/arch: amd64 containers: - name: kube-sriov-cni image: nfvpe/sriov-cni:v2.3 imagePullPolicy: IfNotPresent securityContext: allowPrivilegeEscalation: false privileged: false readOnlyRootFilesystem: true capabilities: drop: - ALL resources: requests: cpu: "100m" memory: "50Mi" limits: cpu: "100m" memory: "50Mi" volumeMounts: - name: cnibin mountPath: /host/opt/cni/bin volumes: - name: cnibin hostPath: path: /opt/cni/bin

Master Node Console

Copy
Copied!
            

# kubectl apply -f configMap.yaml # kubectl apply -f sriovdp-daemonset.yaml # kubectl apply -f sriov-cni-daemonset.yaml

Deploying the RDMA CNI

The RDMA CNI enables namespace isolation for the virtual functions.

Deploy the RDMA CNI using the following YAML file:

rdma-cni-daemonset.yaml

Copy
Copied!
            

# vi rdma-cni-daemonset.yaml   --- apiVersion: apps/v1 kind: DaemonSet metadata: name: kube-rdma-cni-ds namespace: kube-system labels: tier: node app: rdma-cni name: rdma-cni spec: selector: matchLabels: name: rdma-cni updateStrategy: type: RollingUpdate template: metadata: labels: tier: node app: rdma-cni name: rdma-cni spec: hostNetwork: true containers: - name: rdma-cni image: mellanox/rdma-cni imagePullPolicy: IfNotPresent securityContext: privileged: true resources: requests: cpu: "100m" memory: "50Mi" limits: cpu: "100m" memory: "50Mi" volumeMounts: - name: cnibin mountPath: /host/opt/cni/bin volumes: - name: cnibin hostPath: path: /opt/cni/bin

Master Node Console

Copy
Copied!
            

# kubectl apply -f rdma-cni-daemonset.yaml

Applying Network Attachment Definitions

Apply the following YAML file to configure the network attachment for the pods:

netattdef.yaml

Copy
Copied!
            

# vi netattdef.yaml   apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: nvidia.com/sriov_rdma name: sriov20 namespace: default spec: config: |- { "cniVersion": "0.3.1", "name": "sriov-rdma", "plugins": [ { "type": "sriov", "vlan": 20, "spoofchk": "off", "vlanQoS": 0, "ipam": { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"}, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.20.0/24" } }, { "type": "rdma" }, { "mtu": 8950, "type": "tuning" } ] }

Master Node Console

Copy
Copied!
            

# kubectl apply -f netattdef.yaml

Creating a Test Deployment

Create a test daemon set using the following YAML. It will create a pod on every node that we can use to test RDMA connectivity and performance over the high-speed network.

Please notice that it adds an annotation referencing the required network ("sriov20") and has resource requests for the sriov virtual function resource ("nvidia.com/sriov_rdma").

Container image specified below should include NVIDIA user space drivers and perftest.

simple-daemon.yaml

Copy
Copied!
            

# vi simple-daemon.yaml   apiVersion: apps/v1 kind: DaemonSet metadata: name: example-daemon labels: app: example-dae spec: selector: matchLabels: app: example-dae template: metadata: labels: app: example-dae annotations: k8s.v1.cni.cncf.io/networks: sriov20 spec: containers: - image: < container image > name: example-dae-pod securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' requests: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' command: - sleep - inf

Apply the resource:

Master Node Console

Copy
Copied!
            

# kubectl apply -f simple-daemon.yaml

Validate daemon set is running successfully, you should see four pods running, one on each worker node:

Master Node Console

Copy
Copied!
            

# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES example-daemon-2p7t2 1/1 Running 0 5h21m 10.233.92.3 node3 <none> <none> example-daemon-g8mcx 1/1 Running 0 5h21m 10.233.96.84 node2 <none> <none> example-daemon-kf56h 1/1 Running 0 5h21m 10.233.105.4 node4 <none> <none> example-daemon-zdmz8 1/1 Running 0 5h21m 10.233.70.5 node5 <none> <none>

Please refer to the appendix for running an RDMA performance test between the two pods in your test deployment.

Appendix

Performance Testing

Now that we have our test daemonset running, we can run a performance test to check the RDMA performance between the two pods running on two different worker nodes:

In one console window, connect to the master node and make sure to use the root account by using:

Master Node Console

Copy
Copied!
            

$ sudo su -

Connect to one of the pods in the daemonset (example):

Master Node Console

Copy
Copied!
            

# kubectl exec -it example-daemon-2p7t2 -- bash

From within the container, check its IP address on the high-speed network interface (net1):

First pod console

Copy
Copied!
            

# ip address show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 4: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default link/ether 0e:e8:a8:d6:f7:3c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.92.3/32 brd 10.233.92.3 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::ce8:a8ff:fed6:f73c/64 scope link valid_lft forever preferred_lft forever 26: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000 link/ether ea:fe:9f:4a:28:8e brd ff:ff:ff:ff:ff:ff inet 192.168.20.88/24 brd 192.168.20.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::e8fe:9fff:fe4a:288e/64 scope link

Then, start the ib_write_bw server side:

First pod console

Copy
Copied!
            

# ib_write_bw -a --report_gbits ************************************ * Waiting for client to connect... * ************************************

Using another console window, connect again to the master node and connect to the second pod in the deployment (example):

Master Node Console

Copy
Copied!
            

$ sudo su - # kubectl exec -it example-daemon-zdmz8 -- bash

From within the container, start the ib_write_bw client (using the IP address taken from the receiving container).

Please verify that the maximum bandwidth between containers reaches more than 190 Gb/s:

Second pod console

Copy
Copied!
            

# ib_write_bw -a -F --report_gbits 192.168.20.88 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rocep197s0f0v0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 2 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0122 PSN 0x3fdd80 RKey 0x02031e VAddr 0x007fb2a4731000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:91 remote address: LID 0000 QPN 0x0164 PSN 0xa38679 RKey 0x03031f VAddr 0x007fe0387d1000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:88 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 5000 0.041157 0.040923 2.557717 4 5000 0.089667 0.089600 2.799999 8 5000 0.18 0.18 2.795828 16 5000 0.36 0.36 2.799164 32 5000 0.72 0.72 2.801682 64 5000 1.08 1.07 2.089307 128 5000 2.15 2.08 2.031467 256 5000 4.30 4.30 2.097492 512 5000 8.56 8.56 2.089221 1024 5000 17.09 17.02 2.077250 2048 5000 33.89 33.83 2.065115 4096 5000 85.32 66.30 2.023458 8192 5000 163.84 136.83 2.087786 16384 5000 184.12 167.11 1.274956 32768 5000 190.44 180.83 0.689819 65536 5000 190.26 182.66 0.348395 131072 5000 193.71 179.10 0.170803 262144 5000 192.64 191.31 0.091222 524288 5000 192.62 191.29 0.045608 1048576 5000 192.82 192.75 0.022977 2097152 5000 192.38 192.22 0.011457 4194304 5000 192.80 192.78 0.005745 8388608 5000 192.67 192.65 0.002871 ---------------------------------------------------------------------------------------

Optimizing worker nodes for performance

In order to accommodate performance-sensitive applications, we can optimize the worker nodes for better performance by enabling pod scheduling on cores that are mapped to the same NUMA node of the NIC:

On the worker node, please make sure to use the root account by using:

Worker Node Console

Copy
Copied!
            

$ sudo su -

Check to which NUMA node the NIC is wired:

Worker Node Console

Copy
Copied!
            

# cat /sys/class/net/enp197s0f0/device/numa_node 1

In this example, the NIC is wired to NUMA node 1.

Check the NUMA nodes of the CPU and which cores are in NUMA node 1:

Worker Node Console

Copy
Copied!
            

# lscpu | grep NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47

In this example case, the cores that are in NUMA node 1 are: 24-47.

Now we need to configure K8s on the worker node (kubelet):

  • The "cpuManagerPolicy" attribute specifies the selected CPU manger policy (which can be either "none" or "static").

  • The "reservedSystemCPUs" attribute lists the CPU cores that will not be used by K8S (will stay reserved for the Linux system).

  • The "topologyManagerPolicy" attribute specifies the selected policy for the topology manager (which can be either "none", "best-effort", "restricted" or "single-numa-node").

We will reserve some cores for the system, and make sure they belong to NUMA 0 (for our case):

Worker Node Console

Copy
Copied!
            

# vi /etc/kubernetes/kubelet-config.yaml ... cpuManagerPolicy: static cpuManagerReconcilePeriod: 10s reservedSystemCPUs: "0,1,2,3" topologyManagerPolicy: single-numa-node featureGates: CPUManager: true TopologyManager: true ...

When changing reservedSystemCPUs or cpuManagerPolicy, the file: /var/lib/kubelet/cpu_manager_state should be deleted and kubelet service should be restarted:

Worker Node Console

Copy
Copied!
            

# rm /var/lib/kubelet/cpu_manager_state # service kubelet restart

Validating the fabric

To validate the fabric, we will need to assign IP addresses to the servers. Each stretched VLAN acts as a local subnet to all the servers connected to it so all the servers connected to the same VLAN must have IP addresses in the same subnet.

Then we can check that we can ping between the servers.

We can also validate on the switches:

1) That the IP addresses of the VTEPs were successfully propagated by BGP to all the leaf switches.

Please repeat the following command on the leafs:

Leaf Switch Console

Copy
Copied!
            

cumulus@leaf1a:mgmt:~$ net show route show ip route ============= Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure S>* 0.0.0.0/0 [1/0] via 10.1.0.254, vlan1, weight 1, 00:01:09 B>* 10.0.0.1/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:30 B>* 10.0.0.2/32 [20/0] via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29 C>* 10.0.0.101/32 is directly connected, lo, 5d16h51m B>* 10.0.0.102/32 [200/0] via fe80::1e34:daff:feb4:620, peerlink.4094, weight 1, 00:01:18 B>* 10.0.0.103/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29 * via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29 B>* 10.0.0.104/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29 * via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29 C>* 10.0.1.1/32 is directly connected, lo, 00:01:44 C * 10.1.0.0/24 [0/1024] is directly connected, vlan1-v0, 00:01:43 C>* 10.1.0.0/24 is directly connected, vlan1, 00:01:43 C * 10.10.0.0/16 [0/1024] is directly connected, vlan10-v0, 00:01:43 C>* 10.10.0.0/16 is directly connected, vlan10, 00:01:43   show ipv6 route =============== Codes: K - kernel route, C - connected, S - static, R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C * fe80::/64 is directly connected, peerlink.4094, 00:01:20 C * fe80::/64 is directly connected, swp14, 00:01:30 C * fe80::/64 is directly connected, swp13, 00:01:31 C * fe80::/64 is directly connected, vlan10-v0, 00:01:43 C * fe80::/64 is directly connected, vlan1-v0, 00:01:43 C * fe80::/64 is directly connected, vlan20, 00:01:43 C * fe80::/64 is directly connected, vlan10, 00:01:43 C * fe80::/64 is directly connected, vlan1, 00:01:43 C>* fe80::/64 is directly connected, bridge, 00:01:43

2) That the ARP entries were successfully propagated by EVPN (best observed on the spine).

Please repeat the following command on the spines:

Spine Switch Console

Copy
Copied!
            

cumulus@spine1:mgmt:~$ net show bgp evpn route type macip BGP table version is 917, local router ID is 10.0.0.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP] EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP] EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]   Network Next Hop Metric LocPrf Weight Path Extended Community Route Distinguisher: 10.0.0.101:2 *> [2]:[0]:[48]:[1c:34:da:b4:06:20] 10.0.1.1 0 65101 i RT:65101:20 ET:8 MM:0, sticky MAC *> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920] 10.0.1.1 0 65101 i RT:65101:20 ET:8 Default Gateway ND:Router Flag Route Distinguisher: 10.0.0.101:3 *> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[1c:34:da:b4:06:20] 10.0.1.1 0 65101 i RT:65101:10 ET:8 MM:0, sticky MAC *> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[32]:[10.10.0.2] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[6a:1f:17:28:21:9b] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Route Distinguisher: 10.0.0.102:2 *> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[32]:[10.10.0.3] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[1c:34:da:b4:09:20] 10.0.1.1 0 65101 i RT:65101:10 ET:8 MM:0, sticky MAC *> [2]:[0]:[48]:[6a:1f:17:28:21:9b] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Route Distinguisher: 10.0.0.102:3 *> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620] 10.0.1.1 0 65101 i RT:65101:20 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[1c:34:da:b4:09:20] 10.0.1.1 0 65101 i RT:65101:20 ET:8 MM:0, sticky MAC Route Distinguisher: 10.0.0.103:2 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2] 10.0.0.103 0 65102 i RT:65102:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10] 10.0.0.103 0 65102 i RT:65102:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 Route Distinguisher: 10.0.0.103:3 * [2]:[0]:[48]:[5e:60:de:10:be:74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 *> [2]:[0]:[48]:[5e:60:de:10:be:74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 * [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 *> [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 Route Distinguisher: 10.0.0.104:2 * [2]:[0]:[48]:[06:e0:ca:50:81:a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[06:e0:ca:50:81:a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[32:98:4b:9b:91:03] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[32:98:4b:9b:91:03] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103] 10.0.0.104 0 65103 i RT:65103:20 ET:8 Route Distinguisher: 10.0.0.104:3 * [2]:[0]:[48]:[b8:59:9f:fa:87:6e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:6e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:be] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:be] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be] 10.0.0.104 0 65103 i RT:65103:10 ET:8   Displayed 40 prefixes (58 paths) (of requested type)

3) That the MLAG is functioning properly on the infrastructure rack leafs:

Border Router Switch Console

Copy
Copied!
            

cumulus@leaf1a:mgmt:~$ net show clag The peer is alive Our Priority, ID, and Role: 1000 1c:34:da:b4:09:20 primary Peer Priority, ID, and Role: 32768 1c:34:da:b4:06:20 secondary Peer Interface and IP: peerlink.4094 fe80::1e34:daff:feb4:620 (linklocal) VxLAN Anycast IP: 10.0.1.1 Backup IP: 10.0.0.102 (active) System MAC: 44:38:39:ff:ff:aa   CLAG Interfaces Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason ---------------- ---------------- ------- -------------------- ----------------- bond1 bond1 1 - - bond2 bond2 2 - - bond3 bond3 3 - - vni10 vni10 - - - vni20 vni20 - - -

Done!

Authors

ID-2.jpg

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

image2021-2-3_21-28-15.png

Shachar Dor

Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management.

Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company.

Last updated on Sep 12, 2023.