Scope
This R eference D eployment G uide ( RDG ) aims at providing a practical and scalable Ethernet fabric deployment that is suitable for high-performance workloads in K8s. This fabric provides both primary K8s network (e.g. Calico) and a secondary high-performance network for RDMA/DPDK, in conjunction with the SRIOV and RDMA plugins and CNIs.
The proposed fabric configuration supports up to 480 workload servers in its maximum scale and provides a non-blocking throughput of up to 200Gbps between pods.
Abbreviations and Acronyms
Term |
Definition |
Term |
Definition |
BGP |
Border Gateway Protocol |
MLAG |
Multi-Chassis Link Aggregation |
CNI |
Container Network Interface |
RDMA |
Remote Direct Memory Access |
DMA |
Direct Memory Access |
TOR |
Top of Rack |
EVPN |
Ethernet Virtual Private Network |
VLAN |
Virtual LAN (Local Area Network) |
ISL |
Inter-Switch Link |
VRR |
Virtual Router Redundancy |
K8s |
Kubernetes |
VTEP |
Virtual Tunnel End Point |
LACP |
Link Aggregation Control Protocol |
VXLAN |
Virtual Extensible LAN |
Introduction
K8s is the industry-standard platform for deploying and orchestrating cloud-native workloads.
The common K8s networking solutions (e.g. the commonly used Flannel and Calico CNI plugins) are not optimized for performance and do not utilize the current state-of-the-art networking technologies that are hardware-accelerated. Today's interconnect solutions from NVIDIA can provide up to 200Gbps of throughput at a very low latency with a minimal load on the server's CPU. To take advantage of these capabilities, provisioning of an additional network for the pods is needed - a high-speed RDMA-capable network.
This document demonstrates how to deploy, enable and configure a high-speed, hardware-accelerated network fabric in a K8s cluster, providing both the primary network and a secondary RDMA network on the same wire. The network fabric also includes highly-available border router functionality which provides in/out connectivity to the cluster (e.g. access to the Internet).
This document is intended for K8s administrators that want to enable a high-speed fabric for their applications running on top of K8s, such as big-data, machine learning, storage and database solutions, etc.
The document begins with the design of the fabric and of the K8s deployment, then continues with the actual deployment and configuration steps, concluding with a performance test that demonstrates the benefits of the solution.
References
Solution Architecture
Key Components and Technologies
NVIDIA ConnectX SmartNICs
10/25/40/50/100/200 and 400G Ethernet Network Adapters
The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.
The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.
NVIDIA Spectrum Ethernet Switches
Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.
NVIDIA combines the benefits of NVIDIA Spectrum™ switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® Linux , SONiC and NVIDIA Onyx®.
NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.
RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.
Kubernetes
Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.
Kubespray
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:A highly available cluster
Composable attributes
Support for most popular Linux distributions
Logical Design
The physical servers used in this document:
1 x Deployment Node
1 x Master Node
4 x Worker Nodes; each with 1 x ConnectX-6 NIC
The deployment of the fabric is based on a 2-level leaf-spine topology.
The deployment includes two separate physical networks:
A high-speed Ethernet fabric
An IPMI/bare-metal management network (not covered in this document)
This document covers a single K8s controller deployment scenario. For high-availability cluster deployment, please refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md
Network / Fabric Design
This document demonstrates a minimalistic scale of 2 workload racks with 2 servers each (as shown in the diagram below):
By using the same design, the fabric can be scaled to accommodate up to 480 workload servers using up to 30 workload racks with up to 16 servers each. Every workload rack uses a single leaf switch (TOR). The infrastructure rack consists of a highly-available border router (an MLAG pair) which provides a connection to an external gateway/router and to a maximum of 14 infrastructure servers.
The high-speed network consists of two logical segments:
The management network and the primary K8s network (used by Calico) - VLAN10
The secondary K8s network which provides RDMA to the pods - VLAN20
The fabric implements a VXLAN overlay network with a BGP EVPN control plane, which enables the "stretching" of the VLANs across all the racks.
Every leaf switch has a VTEP which takes care of VXLAN encapsulation and decapsulation. The communication between the VTEPs is done by routing through the spines, controlled by a BGP control plane.
The infrastructure rack (as seen on the left in the illustration below) has two leaf switches that act as a highly available border router which provides both highly available connectivity for the infrastructure servers (the deployment server and the K8s master node) and redundant routing into and out of the cluster through a gateway node. This high availability is achieved by an MLAG configuration, the use of LACP bonds, and a redundant router mechanism which uses VRR.
Below is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 200GbE to the host (30 racks, 16 servers each using 16 spines and 32 leafs).
Please note that in this setup, the MSN2100 switches in the infrastructure rack should be replaced by MSN2700 switches (having 32 ports instead of 16 ports):
In the case of a maximum scale fabric (as shown above), there will be 16 x 200Gbps links going up from each leaf to the spines and therefore a maximum of 16 x 200Gbps links going to servers in each rack.
Software Stack Components
Please make sure to upgrade all the NVIDIA software components to their latest released version.
Bill of Materials
Please note that older MSN2100 switches with hardware revision 0 (zero) do not support the functionality presented in this document. You can verify that your switch is newer by running the "decode-syseeprom" command and checking the "Device Version" field (must be greater than zero).
Deployment and Configuration
Node and Switch Definitions
These are the definitions and parameters used for deploying the demonstrated fabric:
Spines |
|||
hostname |
router id |
autonomous system |
downlinks |
spine1 (MSN3700) |
10.0.0.1/32 |
65100 |
swp1-6 |
spine2 (MSN3700) |
10.0.0.2/32 |
65100 |
swp1-6 |
Leafs |
||||
hostname |
router id |
autonomous system |
uplinks |
peers on spines |
leaf1a (MSN2100) |
10.0.0.101/32 |
65101 |
swp13-14 |
swp1 |
leaf1b (MSN2100) |
10.0.0.102/32 |
65101 |
swp13-14 |
swp2 |
leaf2 (MSN3700) |
10.0.0.103/32 |
65102 |
swp29-32 |
swp3-4 |
leaf3 (MSN3700) |
10.0.0.104/32 |
65103 |
swp29-32 |
swp5-6 |
Workload Server Ports |
|||
rack id |
vlan id |
access ports |
trunk ports |
2 |
10 |
swp1-4 |
|
2 |
20 |
swp1-4 |
|
3 |
10 |
swp1-4 |
|
3 |
20 |
swp1-4 |
Border Routers (Infrastructure Rack TORs) |
||||
hostname |
isl ports |
clag system mac |
clag priority |
vxlan anycast ip |
leaf1a |
swp15-16 |
44:38:39:FF:FF:AA |
1000 |
10.10.11.1 |
leaf1b |
swp15-16 |
44:38:39:FF:FF:AA |
32768 |
10.10.11.1 |
Border VLANs |
||||
vlan id |
virt mac |
virt ip |
primary router ip |
secondary router ip |
10 |
00:00:00:00:00:10 |
10.10.0.1/16 |
10.10.0.2/16 |
10.10.0.3/16 |
1 |
00:00:00:00:00:01 |
10.1.0.1/24 |
10.1.0.2/24 |
10.1.0.3/24 |
Infrastructure Server Ports |
||
vlan id |
port names |
bond names |
1 |
swp1 |
bond1 |
10 |
swp2, swp3 |
bond2, bond3 |
Hosts |
||||
Rack |
Server/Switch type |
Server/Switch name |
IP and NICs |
Default Gateway |
Rack1 (Infrastructure) |
Deployment Node |
depserver |
bond0 (enp197s0f0, enp197s0f1) 10.10.0.250/16 |
10.10.0.1 |
Rack1 (Infrastructure) |
Master Node |
node1 |
bond0 (enp197s0f0, enp197s0f1) 10.10.1.1/16 |
10.10.0.1 |
Rack2 |
Worker Node |
node2 |
enp197s0f0 10.10.1.2/16 |
10.10.0.1 |
Rack2 |
Worker Node |
node3 |
enp197s0f0 10.10.1.3/16 |
10.10.0.1 |
Rack3 |
Worker Node |
node4 |
enp197s0f0 10.10.1.4/16 |
10.10.0.1 |
Rack3 |
Worker Node |
node5 |
enp197s0f0 10.10.1.5/16 |
10.10.0.1 |
Wiring
This is the wiring principal for the workload racks:
Each server in the racks is wired to the leaf (or "TOR") switch
Every leaf is wired to all the spines
This is the wiring principal for the infrastructure rack:
Each server in the racks is wired to two leafs (or "TORs") switches
Every leaf is wired to all the spines
Fabric Configuration
Updating Cumulus Linux
As a best practice, make sure to use the latest released Cumulus Linux NOS version.
Please see this guide on how to upgrade Cumulus Linux.
Configuring the Cumulus Linux Switch
Make sure your Cumulus Linux switch has passed its initial configuration stages (please see the Quick-Start Guide for version 4.3 for additional information):
License installation
Creation of switch interfaces (e.g. swp1-32)
Following is the configuration for the switches:
Please note that you can add the command "net del all" before the following commands in order to clear any previous configuration.
Spine1 Console
net add bgp autonomous-system 65100
net add loopback lo ip address 10.0.0.1/32
net add bgp router-id 10.0.0.1
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add interface swp1 mtu 9216
net add bgp neighbor swp1 interface peer-group underlay
net add interface swp2 mtu 9216
net add bgp neighbor swp2 interface peer-group underlay
net add interface swp3 mtu 9216
net add bgp neighbor swp3 interface peer-group underlay
net add interface swp4 mtu 9216
net add bgp neighbor swp4 interface peer-group underlay
net add interface swp5 mtu 9216
net add bgp neighbor swp5 interface peer-group underlay
net add interface swp6 mtu 9216
net add bgp neighbor swp6 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn neighbor underlay activate
net add bgp l2vpn evpn advertise-all-vni
net commit
Spine2 Console
net add bgp autonomous-system 65100
net add loopback lo ip address 10.0.0.2/32
net add bgp router-id 10.0.0.2
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add interface swp1 mtu 9216
net add bgp neighbor swp1 interface peer-group underlay
net add interface swp2 mtu 9216
net add bgp neighbor swp2 interface peer-group underlay
net add interface swp3 mtu 9216
net add bgp neighbor swp3 interface peer-group underlay
net add interface swp4 mtu 9216
net add bgp neighbor swp4 interface peer-group underlay
net add interface swp5 mtu 9216
net add bgp neighbor swp5 interface peer-group underlay
net add interface swp6 mtu 9216
net add bgp neighbor swp6 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn neighbor underlay activate
net add bgp l2vpn evpn advertise-all-vni
net commit
Leaf1A Console
net add bgp autonomous-system 65101
net add bgp router-id 10.0.0.101
net add loopback lo ip address 10.0.0.101/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn neighbor underlay activate
net add bgp l2vpn evpn advertise-all-vni
net add bgp l2vpn evpn advertise ipv4 unicast
net add bridge bridge ports peerlink
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.101
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.101
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.101
net add bridge bridge ports vni20
net add loopback lo clag vxlan-anycast-ip 10.10.11.1
net add bgp l2vpn evpn advertise-default-gw
net add bond peerlink bond slaves swp15,swp16
net add interface peerlink.4094 clag args --initDelay 10
net add interface peerlink.4094 clag backup-ip 10.0.0.102
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag priority 1000
net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA
net add bgp neighbor peerlink.4094 interface remote-as internal
net add bgp l2vpn evpn neighbor peerlink.4094 activate
net add vlan 10 ip address 10.10.0.2/16
net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16
net add vlan 1 ip address 10.1.0.2/24
net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24
net commit
Leaf1B Console
net add bgp autonomous-system 65101
net add bgp router-id 10.0.0.102
net add loopback lo ip address 10.0.0.102/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp13 mtu 9216
net add bgp neighbor swp13 interface peer-group underlay
net add interface swp14 mtu 9216
net add bgp neighbor swp14 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn neighbor underlay activate
net add bgp l2vpn evpn advertise-all-vni
net add bgp l2vpn evpn advertise ipv4 unicast
net add bridge bridge ports peerlink
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.102
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.102
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.102
net add bridge bridge ports vni20
net add loopback lo clag vxlan-anycast-ip 10.10.11.1
net add bgp l2vpn evpn advertise-default-gw
net add bond peerlink bond slaves swp15,swp16
net add interface peerlink.4094 clag args --initDelay 10
net add interface peerlink.4094 clag backup-ip 10.0.0.101
net add interface peerlink.4094 clag peer-ip linklocal
net add interface peerlink.4094 clag priority 32768
net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA
net add bgp neighbor peerlink.4094 interface remote-as internal
net add bgp l2vpn evpn neighbor peerlink.4094 activate
net add vlan 10 ip address 10.10.0.3/16
net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16
net add vlan 1 ip address 10.1.0.3/24
net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24
net commit
Leaf2 Console
net add bgp autonomous-system 65102
net add bgp router-id 10.0.0.102
net add loopback lo ip address 10.0.0.103/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp29 mtu 9216
net add bgp neighbor swp29 interface peer-group underlay
net add interface swp30 mtu 9216
net add bgp neighbor swp30 interface peer-group underlay
net add interface swp31 mtu 9216
net add bgp neighbor swp31 interface peer-group underlay
net add interface swp32 mtu 9216
net add bgp neighbor swp32 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn neighbor underlay activate
net add bgp l2vpn evpn advertise-all-vni
net add bgp l2vpn evpn advertise ipv4 unicast
net add bridge bridge ports swp1,swp2,swp3,swp4
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.103
net add interface swp1,swp2,swp3,swp4 bridge pvid 10
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add interface swp1,swp2,swp3,swp4 bridge vids 20
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.103
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.103
net add bridge bridge ports vni20
net commit
Leaf3 Console
net add bgp autonomous-system 65103
net add bgp router-id 10.0.0.103
net add loopback lo ip address 10.0.0.104/32
net add routing defaults datacenter
net add routing log syslog informational
net add routing service integrated-vtysh-config
net add bgp bestpath as-path multipath-relax
net add bgp neighbor underlay peer-group
net add bgp neighbor underlay remote-as external
net add bgp neighbor underlay capability extended-nexthop
net add interface swp29 mtu 9216
net add bgp neighbor swp29 interface peer-group underlay
net add interface swp30 mtu 9216
net add bgp neighbor swp30 interface peer-group underlay
net add interface swp31 mtu 9216
net add bgp neighbor swp31 interface peer-group underlay
net add interface swp32 mtu 9216
net add bgp neighbor swp32 interface peer-group underlay
net add bgp ipv4 unicast redistribute connected
net add bgp ipv6 unicast neighbor underlay activate
net add bgp l2vpn evpn neighbor underlay activate
net add bgp l2vpn evpn advertise-all-vni
net add bgp l2vpn evpn advertise ipv4 unicast
net add bridge bridge ports swp1,swp2,swp3,swp4
net add bridge bridge vlan-aware
net add loopback lo vxlan local-tunnelip 10.0.0.104
net add interface swp1,swp2,swp3,swp4 bridge pvid 10
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add interface swp1,swp2,swp3,swp4 bridge vids 20
net add interface swp1,swp2,swp3,swp4 mtu 8950
net add bridge bridge vids 10
net add vlan 10 vlan-id 10
net add vlan 10 vlan-raw-device bridge
net add vxlan vni10 vxlan id 10
net add vxlan vni10 bridge access 10
net add vxlan vni10 bridge arp-nd-suppress on
net add vxlan vni10 bridge learning off
net add vxlan vni10 stp bpduguard
net add vxlan vni10 stp portbpdufilter
net add vxlan vni10 vxlan local-tunnelip 10.0.0.104
net add bridge bridge ports vni10
net add bridge bridge vids 20
net add vlan 20 vlan-id 20
net add vlan 20 vlan-raw-device bridge
net add vxlan vni20 vxlan id 20
net add vxlan vni20 bridge access 20
net add vxlan vni20 bridge arp-nd-suppress on
net add vxlan vni20 bridge learning off
net add vxlan vni20 stp bpduguard
net add vxlan vni20 stp portbpdufilter
net add vxlan vni20 vxlan local-tunnelip 10.0.0.104
net add bridge bridge ports vni20
net commit
Connecting the Infrastructure Servers
Infrastructure servers (deployment and K8s master servers) are placed in the infrastructure rack.
This will require the following additional configuration steps:
Adding the ports connected to the servers to an MLAG bond
Placing the bond in the relevant VLAN
In our case, the servers are connected to ports swp2 and swp3 on both leafs (Leaf1A and Leaf1B), and will be using VLAN10 that we created on the border leafs, the commands on both Leaf1A and Leaf1B will be:
Leaf1A and Leaf1B Console
net add interface swp2 mtu 8950
net add bond bond2 bond slaves swp2
net add bond bond2 mtu 8950
net add bond bond2 clag id 2
net add bond bond2 bridge access 10
net add bond bond2 bond lacp-bypass-allow
net add bond bond2 stp bpduguard
net add bond bond2 stp portadminedge
net add interface swp3 mtu 8950
net add bond bond3 bond slaves swp3
net add bond bond3 mtu 8950
net add bond bond3 clag id 3
net add bond bond3 bridge access 10
net add bond bond3 bond lacp-bypass-allow
net add bond bond3 stp bpduguard
net add bond bond3 stp portadminedge
net commit
Connecting an External Gateway to the Infrastructure Rack
In our setup, we will connect an external gateway machine (10.1.0.254/24) over an LACP bond to swp1 of both border leafs (via VLAN1).
This gateway will be used to access any external network (e.g. the Internet). The configuration commands on both border leafs are as follows:
Leaf1A Console
net add interface swp1 mtu 8950
net add bond bond1 bond slaves swp1
net add bond bond1 mtu 8950
net add bond bond1 clag id 1
net add bond bond1 bridge access 1
net add bond bond1 bond lacp-bypass-allow
net add bond bond1 stp bpduguard
net add bond bond1 stp portadminedge
net add routing route 0.0.0.0/0 10.1.0.254
net commit
Please note that the gateway machine should be configured statically to access our primary network (10.1.0.0/16) via its relevant interface.
Host Configuration
Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.
All Worker nodes must have the same PCIe placement for the NIC, and expose the same interface name.
Our host will be running Ubuntu Linux, the configuration is as follows:
Installing and Updating the OS
Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages, and create a non-root user account with sudo privileges without password.
Also make sure to assign the correct network configuration to the hosts (IP addresses, default gateway, DNS server, NTP server) and to create bonds on the nodes in the infrastructure rack (master node and deployment node).
Update the Ubuntu software packages by running the following commands:
Non-root User Account Prerequisites
In this solution we added the following line to the EOF /etc/sudoers:
Server Console
$ sudo vi /etc/sudoers
#includedir /etc/sudoers.d
#K8s cluster deployment user with sudo privileges without password
user ALL=(ALL) NOPASSWD:ALL
SR-IOV Activation and Virtual Functions Configuration
Use the following commands to install the mstflint tool and verify that SRIOV is enabled and that there are enough virtual functions on the NIC:
Worker Node Console
# apt install mstflint
# lspci | grep Mellanox
c5:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
c5:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
# mstconfig -d c5:00.0 q | grep SRIOV_EN
SRIOV_EN True(1)
# mstconfig -d c5:00.0 q | grep NUM_OF_VFS
NUM_OF_VFS 8
In case SRIOV is not configured or the number of VFs is insufficient, please configure using the following commands (and then reboot the machine):
Worker Node Console
# mstconfig -d c5:00.0 -y set SRIOV_EN=True NUM_OF_VFS=8
# reboot
The above operation activated SRIOV and defined the maximum number of VFs supported. Below we will perform the actual activation of the virtual functions.
Installing rdma-core and Setting RDMA to "Exclusive Mode"
Install the rdma-core package:
Worker Node Console
# apt install rdma-core -y
Set netns to exclusive mode for providing namespace isolation on the high-speed interface. This way, each pod can only see and access its own virtual functions.
Create the following file:
Worker Node Console
# vi /etc/modprobe.d/ib_core.conf
# Set netns to exclusive mode for namespace isolation
options ib_core netns_mode=0
Then run the commands below:
Worker Node Console
# update-initramfs -u
# reboot
After the node comes back, check netns mode:
Worker Node Console
# rdma system
netns exclusive
Setting MTU on the Physical Port
We need to set the MTU on the physical port of the server to allow for optimized throughput.
Since the fabric is using VXLAN overlay, we will use the maximum MTU of 9216 on the core links and an MTU of 8950 on the edge links (servers links), making sure that the VXLAN header added to the packets will not cause fragmentation.
In order to configure the MTU on the server ports, please edit the netplan config file (in this example on node2):
Worker Node Console
# vi /etc/netplan/00-installer-config.yaml
network:
ethernets:
enp197s0f0:
addresses:
- 10.10.1.2
gateway4: 10.10.0.1
mtu: 8950
version: 2
Please note that you can use the "rdma link" command to identify the name assigned to the high-speed interface, for example:
# rdma link
link rocep197s0f0/1 state ACTIVE physical_state LINK_UP netdev enp197s0f0
Then apply it:
Worker Node Console
# netplan apply
Virtual Function Activation
Now we will activate 8 virtual functions using the following command:
Worker Node Console
# PF_NAME=enp197s0f0
# echo 8 > /sys/class/net/${PF_NAME}/device/sriov_numvfs
Please note that the above configuration is not persistent!
NIC Firmware Upgrade
It is recommended that you upgrade the NIC firmware on the worker nodes to the latest released version.
Please make sure to use the root account using:
Worker Node Console
$ sudo su -
Please make sure to download the "mlxup" program to each Worker Node and install the latest firmware for the NIC (requires Internet connectivity, please check the official download page)
Worker Node Console
# wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup
# chmod 777 mlxup
# ./mlxup -u --online
K8s Cluster Deployment and Configuration
The K8s cluster in this solution will be installed using Kubespray with a non-root user account from the Deployment Node.
SSH Private Key and SSH Passwordless Login
Login to the Deployment Node as a deployment user (in this case - user) and create an SSH private key for configuring the password-less authentication on your computer by running the following commands:
Deployment Node Console
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa):
Created directory '/home/user/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/user/.ssh/id_rsa.
Your public key has been saved in /home/user/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@depl-node
The key's randomart image is:
+---[RSA 2048]----+
| ...+oo+o..o|
| .oo .o. o|
| . .. . o +.|
| E . o + . +|
| . S = + o |
| . o = + o .|
| . o.o + o|
| ..+.*. o+o|
| oo*ooo.++|
+----[SHA256]-----+
Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command (example):
Deployment Node Console
$ ssh-copy-id -i ~/.ssh/id_rsa user@10.10.1.1
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub"
The authenticity of host '10.10.1.1 (10.10.1.1)' can't be established.
ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
user@10.10.1.1's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'user@10.10.1.1'"
and check to make sure that only the key(s) you wanted were added.
Verify that you have password-less SSH connectivity to all nodes in your deployment by running the following command (example):
Deployment Node Console
$ ssh user@10.10.1.1
Kubespray Deployment and Configuration
To install dependencies for running Kubespray with Ansible on the Deployment server please run following commands:
Deployment Node Console
$ cd ~
$ sudo apt -y install python3-pip jq
$ git clone https://github.com/kubernetes-sigs/kubespray.git
$ cd kubespray
$ sudo pip3 install -r requirements.txt
Create a new cluster configuration. The default folder for subsequent commands is ~/kubespray.
Replace the IP addresses below with your nodes' IP addresses:
Deployment Node Console
$ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(10.10.1.1 10.10.1.2 10.10.1.3 10.10.1.4 10.10.1.5)
$ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:
inventory/mycluster/hosts.yaml
$ sudo vi inventory/mycluster/hosts.yaml
all:
hosts:
node1:
ansible_host: 10.10.1.1
ip: 10.10.1.1
access_ip: 10.10.1.1
node2:
ansible_host: 10.10.1.2
ip: 10.10.1.2
access_ip: 10.10.1.2
node3:
ansible_host: 10.10.1.3
ip: 10.10.1.3
access_ip: 10.10.1.3
node4:
ansible_host: 10.10.1.4
ip: 10.10.1.4
access_ip: 10.10.1.4
node5:
ansible_host: 10.10.1.5
ip: 10.10.1.5
access_ip: 10.10.1.5
children:
kube_control_plane:
hosts:
node1:
kube_node:
hosts:
node2:
node3:
node4:
node5:
etcd:
hosts:
node1:
k8s_cluster:
children:
kube_control_plane:
kube_node:
calico_rr:
hosts: {}
Review and change cluster installation parameters in the files inventory/mycluster/group_vars/all/all.yml and inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml
In inventory/mycluster/group_vars/all/all.yml remove the comment from following line so the metrics can receive data about the use of cluster resources:
Deployment Node Console
$ sudo vi inventory/mycluster/group_vars/all/all.yml
## The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable.
kube_read_only_port: 10255
In inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml set the value of kube_version to v1.21.0, set the container_manager to containerd and enable multi_networking by setting kube_network_plugin_multus: true.
Deployment Node Console
$ sudo vi inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
…
## Change this to use another Kubernetes version, e.g. a current beta release
kube_version: v1.21.0
…
## Container runtime
## docker for docker, crio for cri-o and containerd for containerd.
container_manager: containerd
…
# Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni
kube_network_plugin_multus: true
…
In inventory/mycluster/group_vars/etcd.yml set the etcd_deployment_type to host:
Deployment Node Console
$ sudo vi inventory/mycluster/group_vars/etcd.yml
...
## Settings for etcd deployment type
etcd_deployment_type: host
Deploying the cluster using Kubespray Ansible Playbook
Run the following line to start the deployment process:
Deployment Node Console
$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml
It takes a while for this deployment to complete, please make sure no errors are encountered.
A successful result should look something like the following:
Deployment Node Console
PLAY RECAP ***********************************************************************************************************************************************************************************
localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
node1 : ok=584 changed=133 unreachable=0 failed=0 skipped=1151 rescued=0 ignored=2
node2 : ok=387 changed=86 unreachable=0 failed=0 skipped=634 rescued=0 ignored=1
node3 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1
node4 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1
node5 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1
Thursday 20 May 2021 07:59:23 +0000 (0:00:00.071) 0:11:57.632 **********
===============================================================================
kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 77.14s
kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------------------------------------------- 36.82s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 32.52s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 25.75s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.73s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.15s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.00s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 20.24s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 16.27s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 15.36s
container-engine/containerd : ensure containerd packages are installed --------------------------------------------------------------------------------------------------------------- 13.29s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 12.29s
kubernetes/preinstall : Install packages requirements -------------------------------------------------------------------------------------------------------------------------------- 12.15s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 11.40s
download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 11.05s
download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 10.19s
kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 10.02s
download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 9.36s
download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 9.15s
reload etcd --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.65s
Now that the K8s cluster is deployed, connect to the K8s Master Node for the following sections.
Please make sure to use the root account:
Master Node Console
$ sudo su -
K8s Deployment Verification
Below is an output example of a K8s cluster with the deployment information, with default Kubespray configuration using the Calico K8s CNI plugin.
To ensure that the K8s cluster is installed correctly, run the following commands:
Master Node Console
# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node1 Ready control-plane,master 6h40m v1.21.0 10.10.1.1 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4
node2 Ready <none> 6h39m v1.21.0 10.10.1.2 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4
node3 Ready <none> 6h39m v1.21.0 10.10.1.3 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4
node4 Ready <none> 6h39m v1.21.0 10.10.1.4 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4
node5 Ready <none> 6h39m v1.21.0 10.10.1.5 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4
$ kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-7797d7b677-4kndh 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none>
calico-node-6xqxn 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none>
calico-node-7st5x 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none>
calico-node-8qdpx 1/1 Running 0 6h40m 10.10.1.1 node1 <none> <none>
calico-node-qjflr 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none>
calico-node-x68rz 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none>
coredns-7fcf4fd7c7-7p6k5 1/1 Running 0 6h7m 10.233.92.1 node3 <none> <none>
coredns-7fcf4fd7c7-mwfd6 1/1 Running 0 6h39m 10.233.90.1 node1 <none> <none>
dns-autoscaler-7df78bfcfb-xl48v 1/1 Running 0 6h39m 10.233.90.2 node1 <none> <none>
kube-apiserver-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none>
kube-controller-manager-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none>
kube-multus-ds-amd64-8dmpv 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none>
kube-multus-ds-amd64-b74t4 1/1 Running 1 6h39m 10.10.1.5 node5 <none> <none>
kube-multus-ds-amd64-nvrl9 1/1 Running 2 6h39m 10.10.1.4 node4 <none> <none>
kube-multus-ds-amd64-s9lr4 1/1 Running 0 6h39m 10.10.1.2 node2 <none> <none>
kube-multus-ds-amd64-zrxcs 1/1 Running 0 6h39m 10.10.1.1 node1 <none> <none>
kube-proxy-bq9xg 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none>
kube-proxy-bs8br 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none>
kube-proxy-fxs88 1/1 Running 0 6h40m 10.10.1.1 node1 <none> <none>
kube-proxy-rts6t 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none>
kube-proxy-vml29 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none>
kube-scheduler-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none>
nginx-proxy-node2 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none>
nginx-proxy-node3 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none>
nginx-proxy-node4 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none>
nginx-proxy-node5 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none>
nodelocaldns-kdsg5 1/1 Running 2 6h39m 10.10.1.4 node4 <none> <none>
nodelocaldns-mhh9g 1/1 Running 0 6h39m 10.10.1.2 node2 <none> <none>
nodelocaldns-nbhnr 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none>
nodelocaldns-nkj9h 1/1 Running 0 6h39m 10.10.1.1 node1 <none> <none>
nodelocaldns-rfnqk 1/1 Running 1 6h39m 10.10.1.5 node5 <none> <none>
Installing the Whereabouts CNI
You can install this plugin with a daemon set, using the following commands:
Master Node Console
# kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/daemonset-install.yaml
# kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/whereabouts.cni.cncf.io_ippools.yaml
To ensure the plugin is installed correctly, run the following command:
Master Node Console
# kubectl get pods -A | grep whereabouts
kube-system whereabouts-74nwr 1/1 Running 0 6h4m
kube-system whereabouts-7pq2l 1/1 Running 0 6h4m
kube-system whereabouts-gbpht 1/1 Running 0 6h4m
kube-system whereabouts-slbnj 1/1 Running 0 6h4m
kube-system whereabouts-tw7dc 1/1 Running 0 6h4m
Deploying the SRIOV Device Plugin and CNI
Prepare the following files and apply them:
Master Node Console
# vi configMap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: sriovdp-config
namespace: kube-system
data:
config.json: |
{
"resourceList": [
{
"resourceName": "sriov_rdma",
"resourcePrefix": "nvidia.com",
"selectors": {
"vendors": ["15b3"],
"pfNames": ["enp197s0f0"],
"isRdma": true
}
}
]
}
sriovdp-daemonset.yaml
# vi sriovdp-daemonset.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: sriov-device-plugin
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-sriov-device-plugin-amd64
namespace: kube-system
labels:
tier: node
app: sriovdp
spec:
selector:
matchLabels:
name: sriov-device-plugin
template:
metadata:
labels:
name: sriov-device-plugin
tier: node
app: sriovdp
spec:
hostNetwork: true
nodeSelector:
beta.kubernetes.io/arch: amd64
serviceAccountName: sriov-device-plugin
containers:
- name: kube-sriovdp
image: docker.io/nfvpe/sriov-device-plugin:v3.3
imagePullPolicy: IfNotPresent
args:
- --log-dir=sriovdp
- --log-level=10
securityContext:
privileged: true
resources:
requests:
cpu: "250m"
memory: "40Mi"
limits:
cpu: 1
memory: "200Mi"
volumeMounts:
- name: devicesock
mountPath: /var/lib/kubelet/
readOnly: false
- name: log
mountPath: /var/log
- name: config-volume
mountPath: /etc/pcidp
- name: device-info
mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp
volumes:
- name: devicesock
hostPath:
path: /var/lib/kubelet/
- name: log
hostPath:
path: /var/log
- name: device-info
hostPath:
path: /var/run/k8s.cni.cncf.io/devinfo/dp
type: DirectoryOrCreate
- name: config-volume
configMap:
name: sriovdp-config
items:
- key: config.json
path: config.json
sriov-cni-daemonset.yaml
# vi sriov-cni-daemonset.yaml
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-sriov-cni-ds-amd64
namespace: kube-system
labels:
tier: node
app: sriov-cni
spec:
selector:
matchLabels:
name: sriov-cni
template:
metadata:
labels:
name: sriov-cni
tier: node
app: sriov-cni
spec:
nodeSelector:
beta.kubernetes.io/arch: amd64
containers:
- name: kube-sriov-cni
image: nfvpe/sriov-cni:v2.3
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
requests:
cpu: "100m"
memory: "50Mi"
limits:
cpu: "100m"
memory: "50Mi"
volumeMounts:
- name: cnibin
mountPath: /host/opt/cni/bin
volumes:
- name: cnibin
hostPath:
path: /opt/cni/bin
Master Node Console
# kubectl apply -f configMap.yaml
# kubectl apply -f sriovdp-daemonset.yaml
# kubectl apply -f sriov-cni-daemonset.yaml
Deploying the RDMA CNI
The RDMA CNI enables namespace isolation for the virtual functions.
Deploy the RDMA CNI using the following YAML file:
rdma-cni-daemonset.yaml
# vi rdma-cni-daemonset.yaml
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: kube-rdma-cni-ds
namespace: kube-system
labels:
tier: node
app: rdma-cni
name: rdma-cni
spec:
selector:
matchLabels:
name: rdma-cni
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
tier: node
app: rdma-cni
name: rdma-cni
spec:
hostNetwork: true
containers:
- name: rdma-cni
image: mellanox/rdma-cni
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
resources:
requests:
cpu: "100m"
memory: "50Mi"
limits:
cpu: "100m"
memory: "50Mi"
volumeMounts:
- name: cnibin
mountPath: /host/opt/cni/bin
volumes:
- name: cnibin
hostPath:
path: /opt/cni/bin
Master Node Console
# kubectl apply -f rdma-cni-daemonset.yaml
Applying Network Attachment Definitions
Apply the following YAML file to configure the network attachment for the pods:
netattdef.yaml
# vi netattdef.yaml
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: nvidia.com/sriov_rdma
name: sriov20
namespace: default
spec:
config: |-
{
"cniVersion": "0.3.1",
"name": "sriov-rdma",
"plugins": [
{
"type": "sriov",
"vlan": 20,
"spoofchk": "off",
"vlanQoS": 0,
"ipam": {
"datastore": "kubernetes",
"kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"},
"log_file": "/tmp/whereabouts.log",
"log_level": "debug",
"type": "whereabouts",
"range": "192.168.20.0/24"
}
},
{
"type": "rdma"
},
{
"mtu": 8950,
"type": "tuning"
}
]
}
Master Node Console
# kubectl apply -f netattdef.yaml
Creating a Test Deployment
Create a test daemon set using the following YAML. It will create a pod on every node that we can use to test RDMA connectivity and performance over the high-speed network.
Please notice that it adds an annotation referencing the required network ("sriov20") and has resource requests for the sriov virtual function resource ("nvidia.com/sriov_rdma").
Container image specified below should include NVIDIA user space drivers and perftest.
simple-daemon.yaml
# vi simple-daemon.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: example-daemon
labels:
app: example-dae
spec:
selector:
matchLabels:
app: example-dae
template:
metadata:
labels:
app: example-dae
annotations:
k8s.v1.cni.cncf.io/networks: sriov20
spec:
containers:
- image: < container image >
name: example-dae-pod
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
memory: 16Gi
cpu: 8
nvidia.com/sriov_rdma: '1'
requests:
memory: 16Gi
cpu: 8
nvidia.com/sriov_rdma: '1'
command:
- sleep
- inf
Apply the resource:
Master Node Console
# kubectl apply -f simple-daemon.yaml
Validate daemon set is running successfully, you should see four pods running, one on each worker node:
Master Node Console
# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
example-daemon-2p7t2 1/1 Running 0 5h21m 10.233.92.3 node3 <none> <none>
example-daemon-g8mcx 1/1 Running 0 5h21m 10.233.96.84 node2 <none> <none>
example-daemon-kf56h 1/1 Running 0 5h21m 10.233.105.4 node4 <none> <none>
example-daemon-zdmz8 1/1 Running 0 5h21m 10.233.70.5 node5 <none> <none>
Please refer to the appendix for running an RDMA performance test between the two pods in your test deployment.
Appendix
Performance Testing
Now that we have our test daemonset running, we can run a performance test to check the RDMA performance between the two pods running on two different worker nodes:
In one console window, connect to the master node and make sure to use the root account by using:
Master Node Console
$ sudo su -
Connect to one of the pods in the daemonset (example):
Master Node Console
# kubectl exec -it example-daemon-2p7t2 -- bash
From within the container, check its IP address on the high-speed network interface (net1):
First pod console
# ip address show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default
link/ether 0e:e8:a8:d6:f7:3c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.233.92.3/32 brd 10.233.92.3 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::ce8:a8ff:fed6:f73c/64 scope link
valid_lft forever preferred_lft forever
26: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000
link/ether ea:fe:9f:4a:28:8e brd ff:ff:ff:ff:ff:ff
inet 192.168.20.88/24 brd 192.168.20.255 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::e8fe:9fff:fe4a:288e/64 scope link
Then, start the ib_write_bw server side:
First pod console
# ib_write_bw -a --report_gbits
************************************
* Waiting for client to connect... *
************************************
Using another console window, connect again to the master node and connect to the second pod in the deployment (example):
Master Node Console
$ sudo su -
# kubectl exec -it example-daemon-zdmz8 -- bash
From within the container, start the ib_write_bw client (using the IP address taken from the receiving container).
Please verify that the maximum bandwidth between containers reaches more than 190 Gb/s:
Second pod console
# ib_write_bw -a -F --report_gbits 192.168.20.88
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : rocep197s0f0v0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 2
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0122 PSN 0x3fdd80 RKey 0x02031e VAddr 0x007fb2a4731000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:91
remote address: LID 0000 QPN 0x0164 PSN 0xa38679 RKey 0x03031f VAddr 0x007fe0387d1000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:88
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
2 5000 0.041157 0.040923 2.557717
4 5000 0.089667 0.089600 2.799999
8 5000 0.18 0.18 2.795828
16 5000 0.36 0.36 2.799164
32 5000 0.72 0.72 2.801682
64 5000 1.08 1.07 2.089307
128 5000 2.15 2.08 2.031467
256 5000 4.30 4.30 2.097492
512 5000 8.56 8.56 2.089221
1024 5000 17.09 17.02 2.077250
2048 5000 33.89 33.83 2.065115
4096 5000 85.32 66.30 2.023458
8192 5000 163.84 136.83 2.087786
16384 5000 184.12 167.11 1.274956
32768 5000 190.44 180.83 0.689819
65536 5000 190.26 182.66 0.348395
131072 5000 193.71 179.10 0.170803
262144 5000 192.64 191.31 0.091222
524288 5000 192.62 191.29 0.045608
1048576 5000 192.82 192.75 0.022977
2097152 5000 192.38 192.22 0.011457
4194304 5000 192.80 192.78 0.005745
8388608 5000 192.67 192.65 0.002871
---------------------------------------------------------------------------------------
Optimizing worker nodes for performance
In order to accommodate performance-sensitive applications, we can optimize the worker nodes for better performance by enabling pod scheduling on cores that are mapped to the same NUMA node of the NIC:
On the worker node, please make sure to use the root account by using:
Worker Node Console
$ sudo su -
Check to which NUMA node the NIC is wired:
Worker Node Console
# cat /sys/class/net/enp197s0f0/device/numa_node
1
In this example, the NIC is wired to NUMA node 1.
Check the NUMA nodes of the CPU and which cores are in NUMA node 1:
Worker Node Console
# lscpu | grep NUMA
NUMA node(s): 2
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
In this example case, the cores that are in NUMA node 1 are: 24-47.
Now we need to configure K8s on the worker node (kubelet):
The "cpuManagerPolicy" attribute specifies the selected CPU manger policy (which can be either "none" or "static").
The "reservedSystemCPUs" attribute lists the CPU cores that will not be used by K8S (will stay reserved for the Linux system).
The "topologyManagerPolicy" attribute specifies the selected policy for the topology manager (which can be either "none", "best-effort", "restricted" or "single-numa-node").
We will reserve some cores for the system, and make sure they belong to NUMA 0 (for our case):
Worker Node Console
# vi /etc/kubernetes/kubelet-config.yaml
...
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 10s
reservedSystemCPUs: "0,1,2,3"
topologyManagerPolicy: single-numa-node
featureGates:
CPUManager: true
TopologyManager: true
...
When changing reservedSystemCPUs or cpuManagerPolicy, the file: /var/lib/kubelet/cpu_manager_state should be deleted and kubelet service should be restarted:
Worker Node Console
# rm /var/lib/kubelet/cpu_manager_state
# service kubelet restart
Validating the fabric
To validate the fabric, we will need to assign IP addresses to the servers. Each stretched VLAN acts as a local subnet to all the servers connected to it so all the servers connected to the same VLAN must have IP addresses in the same subnet.
Then we can check that we can ping between the servers.
We can also validate on the switches:
1) That the IP addresses of the VTEPs were successfully propagated by BGP to all the leaf switches.
Please repeat the following command on the leafs:
Leaf Switch Console
cumulus@leaf1a:mgmt:~$ net show route
show ip route
=============
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
S>* 0.0.0.0/0 [1/0] via 10.1.0.254, vlan1, weight 1, 00:01:09
B>* 10.0.0.1/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:30
B>* 10.0.0.2/32 [20/0] via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29
C>* 10.0.0.101/32 is directly connected, lo, 5d16h51m
B>* 10.0.0.102/32 [200/0] via fe80::1e34:daff:feb4:620, peerlink.4094, weight 1, 00:01:18
B>* 10.0.0.103/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29
* via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29
B>* 10.0.0.104/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29
* via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29
C>* 10.0.1.1/32 is directly connected, lo, 00:01:44
C * 10.1.0.0/24 [0/1024] is directly connected, vlan1-v0, 00:01:43
C>* 10.1.0.0/24 is directly connected, vlan1, 00:01:43
C * 10.10.0.0/16 [0/1024] is directly connected, vlan10-v0, 00:01:43
C>* 10.10.0.0/16 is directly connected, vlan10, 00:01:43
show ipv6 route
===============
Codes: K - kernel route, C - connected, S - static, R - RIPng,
O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
C * fe80::/64 is directly connected, peerlink.4094, 00:01:20
C * fe80::/64 is directly connected, swp14, 00:01:30
C * fe80::/64 is directly connected, swp13, 00:01:31
C * fe80::/64 is directly connected, vlan10-v0, 00:01:43
C * fe80::/64 is directly connected, vlan1-v0, 00:01:43
C * fe80::/64 is directly connected, vlan20, 00:01:43
C * fe80::/64 is directly connected, vlan10, 00:01:43
C * fe80::/64 is directly connected, vlan1, 00:01:43
C>* fe80::/64 is directly connected, bridge, 00:01:43
2) That the ARP entries were successfully propagated by EVPN (best observed on the spine).
Please repeat the following command on the spines:
Spine Switch Console
cumulus@spine1:mgmt:~$ net show bgp evpn route type macip
BGP table version is 917, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
Network Next Hop Metric LocPrf Weight Path
Extended Community
Route Distinguisher: 10.0.0.101:2
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]
10.0.1.1 0 65101 i
RT:65101:20 ET:8 MM:0, sticky MAC
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920]
10.0.1.1 0 65101 i
RT:65101:20 ET:8 Default Gateway ND:Router Flag
Route Distinguisher: 10.0.0.101:3
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 MM:0, sticky MAC
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[32]:[10.10.0.2]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
Route Distinguisher: 10.0.0.102:2
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
*> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[32]:[10.10.0.3]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]
10.0.1.1 0 65101 i
RT:65101:10 ET:8 MM:0, sticky MAC
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
*> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1]
10.0.1.1 0 65101 i
RT:65101:10 ET:8
Route Distinguisher: 10.0.0.102:3
*> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620]
10.0.1.1 0 65101 i
RT:65101:20 ET:8 Default Gateway ND:Router Flag
*> [2]:[0]:[48]:[1c:34:da:b4:09:20]
10.0.1.1 0 65101 i
RT:65101:20 ET:8 MM:0, sticky MAC
Route Distinguisher: 10.0.0.103:2
* [2]:[0]:[48]:[b8:59:9f:fa:87:8e]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e]
10.0.0.103 0 65102 i
RT:65102:10 ET:8
Route Distinguisher: 10.0.0.103:3
* [2]:[0]:[48]:[5e:60:de:10:be:74]
10.0.0.103 0 65102 i
RT:65102:20 ET:8
*> [2]:[0]:[48]:[5e:60:de:10:be:74]
10.0.0.103 0 65102 i
RT:65102:20 ET:8
* [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74]
10.0.0.103 0 65102 i
RT:65102:20 ET:8
*> [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74]
10.0.0.103 0 65102 i
RT:65102:20 ET:8
Route Distinguisher: 10.0.0.104:2
* [2]:[0]:[48]:[06:e0:ca:50:81:a3]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
*> [2]:[0]:[48]:[06:e0:ca:50:81:a3]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
* [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
*> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
* [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
*> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
* [2]:[0]:[48]:[32:98:4b:9b:91:03]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
*> [2]:[0]:[48]:[32:98:4b:9b:91:03]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
* [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
*> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
* [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
*> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103]
10.0.0.104 0 65103 i
RT:65103:20 ET:8
Route Distinguisher: 10.0.0.104:3
* [2]:[0]:[48]:[b8:59:9f:fa:87:6e]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:be]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:be]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
*> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be]
10.0.0.104 0 65103 i
RT:65103:10 ET:8
Displayed 40 prefixes (58 paths) (of requested type)
3) That the MLAG is functioning properly on the infrastructure rack leafs:
Border Router Switch Console
cumulus@leaf1a:mgmt:~$ net show clag
The peer is alive
Our Priority, ID, and Role: 1000 1c:34:da:b4:09:20 primary
Peer Priority, ID, and Role: 32768 1c:34:da:b4:06:20 secondary
Peer Interface and IP: peerlink.4094 fe80::1e34:daff:feb4:620 (linklocal)
VxLAN Anycast IP: 10.0.1.1
Backup IP: 10.0.0.102 (active)
System MAC: 44:38:39:ff:ff:aa
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
bond1 bond1 1 - -
bond2 bond2 2 - -
bond3 bond3 3 - -
vni10 vni10 - - -
vni20 vni20 - - -
Done!
Authors
|
Vitaliy Razinkov Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website. |
|
Shachar Dor Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management. Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company. |