Scope
This Reference Deployment Guide (RDG) aims at providing a practical and scalable Ethernet fabric deployment that is suitable for high-performance workloads in K8s. This fabric provides both primary K8s network (e.g. Calico) and a secondary high-performance network for RDMA/DPDK, in conjunction with the SRIOV and RDMA plugins and CNIs.
The proposed fabric configuration supports up to 480 workload servers in its maximum scale and provides a non-blocking throughput of up to 200Gbps between pods.
Abbreviations and Acronyms
Term | Definition | Term | Definition |
---|---|---|---|
BGP | Border Gateway Protocol | MLAG | Multi-Chassis Link Aggregation |
CNI | Container Network Interface | RDMA | Remote Direct Memory Access |
DMA | Direct Memory Access | TOR | Top of Rack |
EVPN | Ethernet Virtual Private Network | VLAN | Virtual LAN (Local Area Network) |
ISL | Inter-Switch Link | VRR | Virtual Router Redundancy |
K8s | Kubernetes | VTEP | Virtual Tunnel End Point |
LACP | Link Aggregation Control Protocol | VXLAN | Virtual Extensible LAN |
Introduction
K8s is the industry-standard platform for deploying and orchestrating cloud-native workloads.
The common K8s networking solutions (e.g. the commonly used Flannel and Calico CNI plugins) are not optimized for performance and do not utilize the current state-of-the-art networking technologies that are hardware-accelerated. Today's interconnect solutions from NVIDIA can provide up to 200Gbps of throughput at a very low latency with a minimal load on the server's CPU. To take advantage of these capabilities, provisioning of an additional network for the pods is needed - a high-speed RDMA-capable network.
This document demonstrates how to deploy, enable and configure a high-speed, hardware-accelerated network fabric in a K8s cluster, providing both the primary network and a secondary RDMA network on the same wire. The network fabric also includes highly-available border router functionality which provides in/out connectivity to the cluster (e.g. access to the Internet).
This document is intended for K8s administrators that want to enable a high-speed fabric for their applications running on top of K8s, such as big-data, machine learning, storage and database solutions, etc.
The document begins with the design of the fabric and of the K8s deployment, then continues with the actual deployment and configuration steps, concluding with a performance test that demonstrates the benefits of the solution.
References
Solution Architecture
Key Components and Technologies
NVIDIA ConnectX SmartNICs
10/25/40/50/100/200 and 400G Ethernet Network Adapters
The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.
NVIDIA LinkX Cables
The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.
Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.
NVIDIA combines the benefits of NVIDIA Spectrum™ switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® Linux, SONiC and NVIDIA Onyx®.
NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.
RDMA
RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.
Kubernetes
Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.
Logical Design
The physical servers used in this document:
- 1 x Deployment Node
- 1 x Master Node
- 4 x Worker Nodes; each with 1 x ConnectX-6 NIC
The deployment of the fabric is based on a 2-level leaf-spine topology.
The deployment includes two separate physical networks:
- A high-speed Ethernet fabric
- An IPMI/bare-metal management network (not covered in this document)
This document covers a single K8s controller deployment scenario. For high-availability cluster deployment, please refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md
Network / Fabric Design
This document demonstrates a minimalistic scale of 2 workload racks with 2 servers each (as shown in the diagram below):
By using the same design, the fabric can be scaled to accommodate up to 480 workload servers using up to 30 workload racks with up to 16 servers each. Every workload rack uses a single leaf switch (TOR). The infrastructure rack consists of a highly-available border router (an MLAG pair) which provides a connection to an external gateway/router and to a maximum of 14 infrastructure servers.
The high-speed network consists of two logical segments:
- The management network and the primary K8s network (used by Calico) - VLAN10
- The secondary K8s network which provides RDMA to the pods - VLAN20
The fabric implements a VXLAN overlay network with a BGP EVPN control plane, which enables the "stretching" of the VLANs across all the racks.
Every leaf switch has a VTEP which takes care of VXLAN encapsulation and decapsulation. The communication between the VTEPs is done by routing through the spines, controlled by a BGP control plane.
The infrastructure rack (as seen on the left in the illustration below) has two leaf switches that act as a highly available border router which provides both highly available connectivity for the infrastructure servers (the deployment server and the K8s master node) and redundant routing into and out of the cluster through a gateway node. This high availability is achieved by an MLAG configuration, the use of LACP bonds, and a redundant router mechanism which uses VRR.
Below is a diagram demonstrating the maximum possible scale for a non-blocking deployment that uses 200GbE to the host (30 racks, 16 servers each using 16 spines and 32 leafs).
Please note that in this setup, the MSN2100 switches in the infrastructure rack should be replaced by MSN2700 switches (having 32 ports instead of 16 ports):
In the case of a maximum scale fabric (as shown above), there will be 16 x 200Gbps links going up from each leaf to the spines and therefore a maximum of 16 x 200Gbps links going to servers in each rack.
Software Stack Components
Please make sure to upgrade all the NVIDIA software components to their latest released version.
Bill of Materials
Please note that older MSN2100 switches with hardware revision 0 (zero) do not support the functionality presented in this document. You can verify that your switch is newer by running the "decode-syseeprom" command and checking the "Device Version" field (must be greater than zero).
Deployment and Configuration
Node and Switch Definitions
These are the definitions and parameters used for deploying the demonstrated fabric:
Spines | |||
hostname | router id | autonomous system | downlinks |
spine1 (MSN3700) | 10.0.0.1/32 | 65100 | swp1-6 |
spine2 (MSN3700) | 10.0.0.2/32 | 65100 | swp1-6 |
Leafs | ||||
hostname | router id | autonomous system | uplinks | peers on spines |
leaf1a (MSN2100) | 10.0.0.101/32 | 65101 | swp13-14 | swp1 |
leaf1b (MSN2100) | 10.0.0.102/32 | 65101 | swp13-14 | swp2 |
leaf2 (MSN3700) | 10.0.0.103/32 | 65102 | swp29-32 | swp3-4 |
leaf3 (MSN3700) | 10.0.0.104/32 | 65103 | swp29-32 | swp5-6 |
Workload Server Ports | |||
rack id | vlan id | access ports | trunk ports |
2 | 10 | swp1-4 | |
2 | 20 | swp1-4 | |
3 | 10 | swp1-4 | |
3 | 20 | swp1-4 |
Border Routers (Infrastructure Rack TORs) | ||||
hostname | isl ports | clag system mac | clag priority | vxlan anycast ip |
leaf1a | swp15-16 | 44:38:39:FF:FF:AA | 1000 | 10.10.11.1 |
leaf1b | swp15-16 | 44:38:39:FF:FF:AA | 32768 | 10.10.11.1 |
Border VLANs | ||||
vlan id | virt mac | virt ip | primary router ip | secondary router ip |
10 | 00:00:00:00:00:10 | 10.10.0.1/16 | 10.10.0.2/16 | 10.10.0.3/16 |
1 | 00:00:00:00:00:01 | 10.1.0.1/24 | 10.1.0.2/24 | 10.1.0.3/24 |
Infrastructure Server Ports | ||
vlan id | port names | bond names |
1 | swp1 | bond1 |
10 | swp2, swp3 | bond2, bond3 |
Hosts | ||||
Rack | Server/Switch type | Server/Switch name | IP and NICs | Default Gateway |
Rack1 (Infrastructure) | Deployment Node | depserver | bond0 (enp197s0f0, enp197s0f1) 10.10.0.250/16 | 10.10.0.1 |
Rack1 (Infrastructure) | Master Node | node1 | bond0 (enp197s0f0, enp197s0f1) 10.10.1.1/16 | 10.10.0.1 |
Rack2 | Worker Node | node2 | enp197s0f0 10.10.1.2/16 | 10.10.0.1 |
Rack2 | Worker Node | node3 | enp197s0f0 10.10.1.3/16 | 10.10.0.1 |
Rack3 | Worker Node | node4 | enp197s0f0 10.10.1.4/16 | 10.10.0.1 |
Rack3 | Worker Node | node5 | enp197s0f0 10.10.1.5/16 | 10.10.0.1 |
Wiring
This is the wiring principal for the workload racks:
- Each server in the racks is wired to the leaf (or "TOR") switch
- Every leaf is wired to all the spines
This is the wiring principal for the infrastructure rack:
- Each server in the racks is wired to two leafs (or "TORs") switches
- Every leaf is wired to all the spines
Fabric Configuration
Updating Cumulus Linux
As a best practice, make sure to use the latest released Cumulus Linux NOS version.
Please see this guide on how to upgrade Cumulus Linux.
Configuring the Cumulus Linux Switch
Make sure your Cumulus Linux switch has passed its initial configuration stages (please see the Quick-Start Guide for version 4.3 for additional information):
- License installation
- Creation of switch interfaces (e.g. swp1-32)
Following is the configuration for the switches:
Please note that you can add the command "net del all" before the following commands in order to clear any previous configuration.
net add bgp autonomous-system 65100 net add loopback lo ip address 10.0.0.1/32 net add bgp router-id 10.0.0.1 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add interface swp1 mtu 9216 net add bgp neighbor swp1 interface peer-group underlay net add interface swp2 mtu 9216 net add bgp neighbor swp2 interface peer-group underlay net add interface swp3 mtu 9216 net add bgp neighbor swp3 interface peer-group underlay net add interface swp4 mtu 9216 net add bgp neighbor swp4 interface peer-group underlay net add interface swp5 mtu 9216 net add bgp neighbor swp5 interface peer-group underlay net add interface swp6 mtu 9216 net add bgp neighbor swp6 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net commit
net add bgp autonomous-system 65100 net add loopback lo ip address 10.0.0.2/32 net add bgp router-id 10.0.0.2 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add interface swp1 mtu 9216 net add bgp neighbor swp1 interface peer-group underlay net add interface swp2 mtu 9216 net add bgp neighbor swp2 interface peer-group underlay net add interface swp3 mtu 9216 net add bgp neighbor swp3 interface peer-group underlay net add interface swp4 mtu 9216 net add bgp neighbor swp4 interface peer-group underlay net add interface swp5 mtu 9216 net add bgp neighbor swp5 interface peer-group underlay net add interface swp6 mtu 9216 net add bgp neighbor swp6 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net commit
net add bgp autonomous-system 65101 net add bgp router-id 10.0.0.101 net add loopback lo ip address 10.0.0.101/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports peerlink net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.101 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.101 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.101 net add bridge bridge ports vni20 net add loopback lo clag vxlan-anycast-ip 10.10.11.1 net add bgp l2vpn evpn advertise-default-gw net add bond peerlink bond slaves swp15,swp16 net add interface peerlink.4094 clag args --initDelay 10 net add interface peerlink.4094 clag backup-ip 10.0.0.102 net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag priority 1000 net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA net add bgp neighbor peerlink.4094 interface remote-as internal net add bgp l2vpn evpn neighbor peerlink.4094 activate net add vlan 10 ip address 10.10.0.2/16 net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16 net add vlan 1 ip address 10.1.0.2/24 net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24 net commit
net add bgp autonomous-system 65101 net add bgp router-id 10.0.0.102 net add loopback lo ip address 10.0.0.102/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp13 mtu 9216 net add bgp neighbor swp13 interface peer-group underlay net add interface swp14 mtu 9216 net add bgp neighbor swp14 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports peerlink net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.102 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.102 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.102 net add bridge bridge ports vni20 net add loopback lo clag vxlan-anycast-ip 10.10.11.1 net add bgp l2vpn evpn advertise-default-gw net add bond peerlink bond slaves swp15,swp16 net add interface peerlink.4094 clag args --initDelay 10 net add interface peerlink.4094 clag backup-ip 10.0.0.101 net add interface peerlink.4094 clag peer-ip linklocal net add interface peerlink.4094 clag priority 32768 net add interface peerlink.4094 clag sys-mac 44:38:39:FF:FF:AA net add bgp neighbor peerlink.4094 interface remote-as internal net add bgp l2vpn evpn neighbor peerlink.4094 activate net add vlan 10 ip address 10.10.0.3/16 net add vlan 10 ip address-virtual 00:00:00:00:00:10 10.10.0.1/16 net add vlan 1 ip address 10.1.0.3/24 net add vlan 1 ip address-virtual 00:00:00:00:00:01 10.1.0.1/24 net commit
net add bgp autonomous-system 65102 net add bgp router-id 10.0.0.102 net add loopback lo ip address 10.0.0.103/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp29 mtu 9216 net add bgp neighbor swp29 interface peer-group underlay net add interface swp30 mtu 9216 net add bgp neighbor swp30 interface peer-group underlay net add interface swp31 mtu 9216 net add bgp neighbor swp31 interface peer-group underlay net add interface swp32 mtu 9216 net add bgp neighbor swp32 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports swp1,swp2,swp3,swp4 net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.103 net add interface swp1,swp2,swp3,swp4 bridge pvid 10 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add interface swp1,swp2,swp3,swp4 bridge vids 20 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.103 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.103 net add bridge bridge ports vni20 net commit
net add bgp autonomous-system 65103 net add bgp router-id 10.0.0.103 net add loopback lo ip address 10.0.0.104/32 net add routing defaults datacenter net add routing log syslog informational net add routing service integrated-vtysh-config net add bgp bestpath as-path multipath-relax net add bgp neighbor underlay peer-group net add bgp neighbor underlay remote-as external net add bgp neighbor underlay capability extended-nexthop net add interface swp29 mtu 9216 net add bgp neighbor swp29 interface peer-group underlay net add interface swp30 mtu 9216 net add bgp neighbor swp30 interface peer-group underlay net add interface swp31 mtu 9216 net add bgp neighbor swp31 interface peer-group underlay net add interface swp32 mtu 9216 net add bgp neighbor swp32 interface peer-group underlay net add bgp ipv4 unicast redistribute connected net add bgp ipv6 unicast neighbor underlay activate net add bgp l2vpn evpn neighbor underlay activate net add bgp l2vpn evpn advertise-all-vni net add bgp l2vpn evpn advertise ipv4 unicast net add bridge bridge ports swp1,swp2,swp3,swp4 net add bridge bridge vlan-aware net add loopback lo vxlan local-tunnelip 10.0.0.104 net add interface swp1,swp2,swp3,swp4 bridge pvid 10 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add interface swp1,swp2,swp3,swp4 bridge vids 20 net add interface swp1,swp2,swp3,swp4 mtu 8950 net add bridge bridge vids 10 net add vlan 10 vlan-id 10 net add vlan 10 vlan-raw-device bridge net add vxlan vni10 vxlan id 10 net add vxlan vni10 bridge access 10 net add vxlan vni10 bridge arp-nd-suppress on net add vxlan vni10 bridge learning off net add vxlan vni10 stp bpduguard net add vxlan vni10 stp portbpdufilter net add vxlan vni10 vxlan local-tunnelip 10.0.0.104 net add bridge bridge ports vni10 net add bridge bridge vids 20 net add vlan 20 vlan-id 20 net add vlan 20 vlan-raw-device bridge net add vxlan vni20 vxlan id 20 net add vxlan vni20 bridge access 20 net add vxlan vni20 bridge arp-nd-suppress on net add vxlan vni20 bridge learning off net add vxlan vni20 stp bpduguard net add vxlan vni20 stp portbpdufilter net add vxlan vni20 vxlan local-tunnelip 10.0.0.104 net add bridge bridge ports vni20 net commit
Connecting the Infrastructure Servers
Infrastructure servers (deployment and K8s master servers) are placed in the infrastructure rack.
This will require the following additional configuration steps:
- Adding the ports connected to the servers to an MLAG bond
- Placing the bond in the relevant VLAN
In our case, the servers are connected to ports swp2 and swp3 on both leafs (Leaf1A and Leaf1B), and will be using VLAN10 that we created on the border leafs, the commands on both Leaf1A and Leaf1B will be:
net add interface swp2 mtu 8950 net add bond bond2 bond slaves swp2 net add bond bond2 mtu 8950 net add bond bond2 clag id 2 net add bond bond2 bridge access 10 net add bond bond2 bond lacp-bypass-allow net add bond bond2 stp bpduguard net add bond bond2 stp portadminedge net add interface swp3 mtu 8950 net add bond bond3 bond slaves swp3 net add bond bond3 mtu 8950 net add bond bond3 clag id 3 net add bond bond3 bridge access 10 net add bond bond3 bond lacp-bypass-allow net add bond bond3 stp bpduguard net add bond bond3 stp portadminedge net commit
Connecting an External Gateway to the Infrastructure Rack
In our setup, we will connect an external gateway machine (10.1.0.254/24) over an LACP bond to swp1 of both border leafs (via VLAN1).
This gateway will be used to access any external network (e.g. the Internet). The configuration commands on both border leafs are as follows:
net add interface swp1 mtu 8950 net add bond bond1 bond slaves swp1 net add bond bond1 mtu 8950 net add bond bond1 clag id 1 net add bond bond1 bridge access 1 net add bond bond1 bond lacp-bypass-allow net add bond bond1 stp bpduguard net add bond bond1 stp portadminedge net add routing route 0.0.0.0/0 10.1.0.254 net commit
Please note that the gateway machine should be configured statically to access our primary network (10.1.0.0/16) via its relevant interface.
Host Configuration
Make sure that the BIOS settings on the worker nodes servers have SR-IOV enabled and that the servers are tuned for maximum performance.
All Worker nodes must have the same PCIe placement for the NIC, and expose the same interface name.
Our host will be running Ubuntu Linux, the configuration is as follows:
Installing and Updating the OS
Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages, and create a non-root user account with sudo privileges without password.
Also make sure to assign the correct network configuration to the hosts (IP addresses, default gateway, DNS server, NTP server) and to create bonds on the nodes in the infrastructure rack (master node and deployment node).
Update the Ubuntu software packages by running the following commands:
Non-root User Account Prerequisites
In this solution we added the following line to the EOF /etc/sudoers:
$ sudo vi /etc/sudoers #includedir /etc/sudoers.d #K8s cluster deployment user with sudo privileges without password user ALL=(ALL) NOPASSWD:ALL
SR-IOV Activation and Virtual Functions Configuration
Use the following commands to install the mstflint tool and verify that SRIOV is enabled and that there are enough virtual functions on the NIC:
# apt install mstflint # lspci | grep Mellanox c5:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] c5:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] # mstconfig -d c5:00.0 q | grep SRIOV_EN SRIOV_EN True(1) # mstconfig -d c5:00.0 q | grep NUM_OF_VFS NUM_OF_VFS 8
In case SRIOV is not configured or the number of VFs is insufficient, please configure using the following commands (and then reboot the machine):
# mstconfig -d c5:00.0 -y set SRIOV_EN=True NUM_OF_VFS=8 # reboot
The above operation activated SRIOV and defined the maximum number of VFs supported. Below we will perform the actual activation of the virtual functions.
Installing rdma-core and Setting RDMA to "Exclusive Mode"
Install the rdma-core package:
# apt install rdma-core -y
Set netns to exclusive mode for providing namespace isolation on the high-speed interface. This way, each pod can only see and access its own virtual functions.
Create the following file:
# vi /etc/modprobe.d/ib_core.conf # Set netns to exclusive mode for namespace isolation options ib_core netns_mode=0
Then run the commands below:
# update-initramfs -u # reboot
After the node comes back, check netns mode:
# rdma system netns exclusive
Setting MTU on the Physical Port
We need to set the MTU on the physical port of the server to allow for optimized throughput.
Since the fabric is using VXLAN overlay, we will use the maximum MTU of 9216 on the core links and an MTU of 8950 on the edge links (servers links), making sure that the VXLAN header added to the packets will not cause fragmentation.
In order to configure the MTU on the server ports, please edit the netplan config file (in this example on node2):
# vi /etc/netplan/00-installer-config.yaml network: ethernets: enp197s0f0: addresses: - 10.10.1.2 gateway4: 10.10.0.1 mtu: 8950 version: 2
Please note that you can use the "rdma link" command to identify the name assigned to the high-speed interface, for example:
# rdma link
link rocep197s0f0/1 state ACTIVE physical_state LINK_UP netdev enp197s0f0
Then apply it:
# netplan apply
Virtual Function Activation
Now we will activate 8 virtual functions using the following command:
# PF_NAME=enp197s0f0 # echo 8 > /sys/class/net/${PF_NAME}/device/sriov_numvfs
Please note that the above configuration is not persistent!
NIC Firmware Upgrade
It is recommended that you upgrade the NIC firmware on the worker nodes to the latest released version.
Please make sure to use the root account using:
$ sudo su -
Please make sure to download the "mlxup" program to each Worker Node and install the latest firmware for the NIC (requires Internet connectivity, please check the official download page)
# wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup # chmod 777 mlxup # ./mlxup -u --online
K8s Cluster Deployment and Configuration
The K8s cluster in this solution will be installed using Kubespray with a non-root user account from the Deployment Node.
SSH Private Key and SSH Passwordless Login
Login to the Deployment Node as a deployment user (in this case - user) and create an SSH private key for configuring the password-less authentication on your computer by running the following commands:
$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Created directory '/home/user/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@depl-node The key's randomart image is: +---[RSA 2048]----+ | ...+oo+o..o| | .oo .o. o| | . .. . o +.| | E . o + . +| | . S = + o | | . o = + o .| | . o.o + o| | ..+.*. o+o| | oo*ooo.++| +----[SHA256]-----+
Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command (example):
$ ssh-copy-id -i ~/.ssh/id_rsa user@10.10.1.1 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub" The authenticity of host '10.10.1.1 (10.10.1.1)' can't be established. ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys user@10.10.1.1's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'user@10.10.1.1'" and check to make sure that only the key(s) you wanted were added.
Verify that you have password-less SSH connectivity to all nodes in your deployment by running the following command (example):
$ ssh user@10.10.1.1
Kubespray Deployment and Configuration
To install dependencies for running Kubespray with Ansible on the Deployment server please run following commands:
$ cd ~ $ sudo apt -y install python3-pip jq $ git clone https://github.com/kubernetes-sigs/kubespray.git $ cd kubespray $ sudo pip3 install -r requirements.txt
Create a new cluster configuration. The default folder for subsequent commands is ~/kubespray.
Replace the IP addresses below with your nodes' IP addresses:
$ cp -rfp inventory/sample inventory/mycluster $ declare -a IPS=(10.10.1.1 10.10.1.2 10.10.1.3 10.10.1.4 10.10.1.5) $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:
$ sudo vi inventory/mycluster/hosts.yaml all: hosts: node1: ansible_host: 10.10.1.1 ip: 10.10.1.1 access_ip: 10.10.1.1 node2: ansible_host: 10.10.1.2 ip: 10.10.1.2 access_ip: 10.10.1.2 node3: ansible_host: 10.10.1.3 ip: 10.10.1.3 access_ip: 10.10.1.3 node4: ansible_host: 10.10.1.4 ip: 10.10.1.4 access_ip: 10.10.1.4 node5: ansible_host: 10.10.1.5 ip: 10.10.1.5 access_ip: 10.10.1.5 children: kube_control_plane: hosts: node1: kube_node: hosts: node2: node3: node4: node5: etcd: hosts: node1: k8s_cluster: children: kube_control_plane: kube_node: calico_rr: hosts: {}
Review and change cluster installation parameters in the files inventory/mycluster/group_vars/all/all.yml and inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml
In inventory/mycluster/group_vars/all/all.yml remove the comment from following line so the metrics can receive data about the use of cluster resources:
$ sudo vi inventory/mycluster/group_vars/all/all.yml ## The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable. kube_read_only_port: 10255
In inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml set the value of kube_version to v1.21.0, set the container_manager to containerd and enable multi_networking by setting kube_network_plugin_multus: true.
$ sudo vi inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml … ## Change this to use another Kubernetes version, e.g. a current beta release kube_version: v1.21.0 … ## Container runtime ## docker for docker, crio for cri-o and containerd for containerd. container_manager: containerd … # Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni kube_network_plugin_multus: true …
In inventory/mycluster/group_vars/etcd.yml set the etcd_deployment_type to host:
$ sudo vi inventory/mycluster/group_vars/etcd.yml ... ## Settings for etcd deployment type etcd_deployment_type: host
Deploying the cluster using Kubespray Ansible Playbook
Run the following line to start the deployment process:
$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml
It takes a while for this deployment to complete, please make sure no errors are encountered.
A successful result should look something like the following:
PLAY RECAP *********************************************************************************************************************************************************************************** localhost : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 node1 : ok=584 changed=133 unreachable=0 failed=0 skipped=1151 rescued=0 ignored=2 node2 : ok=387 changed=86 unreachable=0 failed=0 skipped=634 rescued=0 ignored=1 node3 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1 node4 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1 node5 : ok=387 changed=86 unreachable=0 failed=0 skipped=633 rescued=0 ignored=1 Thursday 20 May 2021 07:59:23 +0000 (0:00:00.071) 0:11:57.632 ********** =============================================================================== kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 77.14s kubernetes/kubeadm : Join to cluster ------------------------------------------------------------------------------------------------------------------------------------------------- 36.82s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 32.52s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 25.75s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.73s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.15s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 22.00s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 20.24s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 16.27s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 15.36s container-engine/containerd : ensure containerd packages are installed --------------------------------------------------------------------------------------------------------------- 13.29s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 12.29s kubernetes/preinstall : Install packages requirements -------------------------------------------------------------------------------------------------------------------------------- 12.15s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 11.40s download_file | Download item -------------------------------------------------------------------------------------------------------------------------------------------------------- 11.05s download_container | Download image if required -------------------------------------------------------------------------------------------------------------------------------------- 10.19s kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 10.02s download_container | Download image if required --------------------------------------------------------------------------------------------------------------------------------------- 9.36s download_file | Download item --------------------------------------------------------------------------------------------------------------------------------------------------------- 9.15s reload etcd --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 8.65s
Now that the K8s cluster is deployed, connect to the K8s Master Node for the following sections.
Please make sure to use the root account:
$ sudo su -
K8s Deployment Verification
Below is an output example of a K8s cluster with the deployment information, with default Kubespray configuration using the Calico K8s CNI plugin.
To ensure that the K8s cluster is installed correctly, run the following commands:
# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready control-plane,master 6h40m v1.21.0 10.10.1.1 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node2 Ready <none> 6h39m v1.21.0 10.10.1.2 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node3 Ready <none> 6h39m v1.21.0 10.10.1.3 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node4 Ready <none> 6h39m v1.21.0 10.10.1.4 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 node5 Ready <none> 6h39m v1.21.0 10.10.1.5 <none> Ubuntu 20.04.2 LTS 5.4.0-73-generic containerd://1.4.4 $ kubectl get pod -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-kube-controllers-7797d7b677-4kndh 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none> calico-node-6xqxn 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none> calico-node-7st5x 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none> calico-node-8qdpx 1/1 Running 0 6h40m 10.10.1.1 node1 <none> <none> calico-node-qjflr 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none> calico-node-x68rz 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none> coredns-7fcf4fd7c7-7p6k5 1/1 Running 0 6h7m 10.233.92.1 node3 <none> <none> coredns-7fcf4fd7c7-mwfd6 1/1 Running 0 6h39m 10.233.90.1 node1 <none> <none> dns-autoscaler-7df78bfcfb-xl48v 1/1 Running 0 6h39m 10.233.90.2 node1 <none> <none> kube-apiserver-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none> kube-controller-manager-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none> kube-multus-ds-amd64-8dmpv 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none> kube-multus-ds-amd64-b74t4 1/1 Running 1 6h39m 10.10.1.5 node5 <none> <none> kube-multus-ds-amd64-nvrl9 1/1 Running 2 6h39m 10.10.1.4 node4 <none> <none> kube-multus-ds-amd64-s9lr4 1/1 Running 0 6h39m 10.10.1.2 node2 <none> <none> kube-multus-ds-amd64-zrxcs 1/1 Running 0 6h39m 10.10.1.1 node1 <none> <none> kube-proxy-bq9xg 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none> kube-proxy-bs8br 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none> kube-proxy-fxs88 1/1 Running 0 6h40m 10.10.1.1 node1 <none> <none> kube-proxy-rts6t 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none> kube-proxy-vml29 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none> kube-scheduler-node1 1/1 Running 0 6h41m 10.10.1.1 node1 <none> <none> nginx-proxy-node2 1/1 Running 0 6h40m 10.10.1.2 node2 <none> <none> nginx-proxy-node3 1/1 Running 0 6h40m 10.10.1.3 node3 <none> <none> nginx-proxy-node4 1/1 Running 2 6h40m 10.10.1.4 node4 <none> <none> nginx-proxy-node5 1/1 Running 1 6h40m 10.10.1.5 node5 <none> <none> nodelocaldns-kdsg5 1/1 Running 2 6h39m 10.10.1.4 node4 <none> <none> nodelocaldns-mhh9g 1/1 Running 0 6h39m 10.10.1.2 node2 <none> <none> nodelocaldns-nbhnr 1/1 Running 0 6h39m 10.10.1.3 node3 <none> <none> nodelocaldns-nkj9h 1/1 Running 0 6h39m 10.10.1.1 node1 <none> <none> nodelocaldns-rfnqk 1/1 Running 1 6h39m 10.10.1.5 node5 <none> <none>
Installing the Whereabouts CNI
You can install this plugin with a daemon set, using the following commands:
# kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/daemonset-install.yaml # kubectl apply -f https://raw.githubusercontent.com/dougbtv/whereabouts/master/doc/whereabouts.cni.cncf.io_ippools.yaml
To ensure the plugin is installed correctly, run the following command:
# kubectl get pods -A | grep whereabouts kube-system whereabouts-74nwr 1/1 Running 0 6h4m kube-system whereabouts-7pq2l 1/1 Running 0 6h4m kube-system whereabouts-gbpht 1/1 Running 0 6h4m kube-system whereabouts-slbnj 1/1 Running 0 6h4m kube-system whereabouts-tw7dc 1/1 Running 0 6h4m
Deploying the SRIOV Device Plugin and CNI
Prepare the following files and apply them:
# vi configMap.yaml apiVersion: v1 kind: ConfigMap metadata: name: sriovdp-config namespace: kube-system data: config.json: | { "resourceList": [ { "resourceName": "sriov_rdma", "resourcePrefix": "nvidia.com", "selectors": { "vendors": ["15b3"], "pfNames": ["enp197s0f0"], "isRdma": true } } ] }
# vi sriovdp-daemonset.yaml --- apiVersion: v1 kind: ServiceAccount metadata: name: sriov-device-plugin namespace: kube-system --- apiVersion: apps/v1 kind: DaemonSet metadata: name: kube-sriov-device-plugin-amd64 namespace: kube-system labels: tier: node app: sriovdp spec: selector: matchLabels: name: sriov-device-plugin template: metadata: labels: name: sriov-device-plugin tier: node app: sriovdp spec: hostNetwork: true nodeSelector: beta.kubernetes.io/arch: amd64 serviceAccountName: sriov-device-plugin containers: - name: kube-sriovdp image: docker.io/nfvpe/sriov-device-plugin:v3.3 imagePullPolicy: IfNotPresent args: - --log-dir=sriovdp - --log-level=10 securityContext: privileged: true resources: requests: cpu: "250m" memory: "40Mi" limits: cpu: 1 memory: "200Mi" volumeMounts: - name: devicesock mountPath: /var/lib/kubelet/ readOnly: false - name: log mountPath: /var/log - name: config-volume mountPath: /etc/pcidp - name: device-info mountPath: /var/run/k8s.cni.cncf.io/devinfo/dp volumes: - name: devicesock hostPath: path: /var/lib/kubelet/ - name: log hostPath: path: /var/log - name: device-info hostPath: path: /var/run/k8s.cni.cncf.io/devinfo/dp type: DirectoryOrCreate - name: config-volume configMap: name: sriovdp-config items: - key: config.json path: config.json
# vi sriov-cni-daemonset.yaml --- apiVersion: apps/v1 kind: DaemonSet metadata: name: kube-sriov-cni-ds-amd64 namespace: kube-system labels: tier: node app: sriov-cni spec: selector: matchLabels: name: sriov-cni template: metadata: labels: name: sriov-cni tier: node app: sriov-cni spec: nodeSelector: beta.kubernetes.io/arch: amd64 containers: - name: kube-sriov-cni image: nfvpe/sriov-cni:v2.3 imagePullPolicy: IfNotPresent securityContext: allowPrivilegeEscalation: false privileged: false readOnlyRootFilesystem: true capabilities: drop: - ALL resources: requests: cpu: "100m" memory: "50Mi" limits: cpu: "100m" memory: "50Mi" volumeMounts: - name: cnibin mountPath: /host/opt/cni/bin volumes: - name: cnibin hostPath: path: /opt/cni/bin
# kubectl apply -f configMap.yaml # kubectl apply -f sriovdp-daemonset.yaml # kubectl apply -f sriov-cni-daemonset.yaml
Deploying the RDMA CNI
The RDMA CNI enables namespace isolation for the virtual functions.
Deploy the RDMA CNI using the following YAML file:
# vi rdma-cni-daemonset.yaml --- apiVersion: apps/v1 kind: DaemonSet metadata: name: kube-rdma-cni-ds namespace: kube-system labels: tier: node app: rdma-cni name: rdma-cni spec: selector: matchLabels: name: rdma-cni updateStrategy: type: RollingUpdate template: metadata: labels: tier: node app: rdma-cni name: rdma-cni spec: hostNetwork: true containers: - name: rdma-cni image: mellanox/rdma-cni imagePullPolicy: IfNotPresent securityContext: privileged: true resources: requests: cpu: "100m" memory: "50Mi" limits: cpu: "100m" memory: "50Mi" volumeMounts: - name: cnibin mountPath: /host/opt/cni/bin volumes: - name: cnibin hostPath: path: /opt/cni/bin
# kubectl apply -f rdma-cni-daemonset.yaml
Applying Network Attachment Definitions
Apply the following YAML file to configure the network attachment for the pods:
# vi netattdef.yaml apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: nvidia.com/sriov_rdma name: sriov20 namespace: default spec: config: |- { "cniVersion": "0.3.1", "name": "sriov-rdma", "plugins": [ { "type": "sriov", "vlan": 20, "spoofchk": "off", "vlanQoS": 0, "ipam": { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"}, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.20.0/24" } }, { "type": "rdma" }, { "mtu": 8950, "type": "tuning" } ] }
# kubectl apply -f netattdef.yaml
Creating a Test Deployment
Create a test daemon set using the following YAML. It will create a pod on every node that we can use to test RDMA connectivity and performance over the high-speed network.
Please notice that it adds an annotation referencing the required network ("sriov20") and has resource requests for the sriov virtual function resource ("nvidia.com/sriov_rdma").
Container image specified below should include NVIDIA user space drivers and perftest.
# vi simple-daemon.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: example-daemon labels: app: example-dae spec: selector: matchLabels: app: example-dae template: metadata: labels: app: example-dae annotations: k8s.v1.cni.cncf.io/networks: sriov20 spec: containers: - image: < container image > name: example-dae-pod securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' requests: memory: 16Gi cpu: 8 nvidia.com/sriov_rdma: '1' command: - sleep - inf
Apply the resource:
# kubectl apply -f simple-daemon.yaml
Validate daemon set is running successfully, you should see four pods running, one on each worker node:
# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES example-daemon-2p7t2 1/1 Running 0 5h21m 10.233.92.3 node3 <none> <none> example-daemon-g8mcx 1/1 Running 0 5h21m 10.233.96.84 node2 <none> <none> example-daemon-kf56h 1/1 Running 0 5h21m 10.233.105.4 node4 <none> <none> example-daemon-zdmz8 1/1 Running 0 5h21m 10.233.70.5 node5 <none> <none>
Please refer to the appendix for running an RDMA performance test between the two pods in your test deployment.
Appendix
Performance Testing
Now that we have our test daemonset running, we can run a performance test to check the RDMA performance between the two pods running on two different worker nodes:
In one console window, connect to the master node and make sure to use the root account by using:
$ sudo su -
Connect to one of the pods in the daemonset (example):
# kubectl exec -it example-daemon-2p7t2 -- bash
From within the container, check its IP address on the high-speed network interface (net1):
# ip address show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 4: eth0@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default link/ether 0e:e8:a8:d6:f7:3c brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.92.3/32 brd 10.233.92.3 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::ce8:a8ff:fed6:f73c/64 scope link valid_lft forever preferred_lft forever 26: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq state UP group default qlen 1000 link/ether ea:fe:9f:4a:28:8e brd ff:ff:ff:ff:ff:ff inet 192.168.20.88/24 brd 192.168.20.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::e8fe:9fff:fe4a:288e/64 scope link
Then, start the ib_write_bw server side:
# ib_write_bw -a --report_gbits ************************************ * Waiting for client to connect... * ************************************
Using another console window, connect again to the master node and connect to the second pod in the deployment (example):
$ sudo su - # kubectl exec -it example-daemon-zdmz8 -- bash
From within the container, start the ib_write_bw client (using the IP address taken from the receiving container).
Please verify that the maximum bandwidth between containers reaches more than 190 Gb/s:
# ib_write_bw -a -F --report_gbits 192.168.20.88 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rocep197s0f0v0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 2 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x0122 PSN 0x3fdd80 RKey 0x02031e VAddr 0x007fb2a4731000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:91 remote address: LID 0000 QPN 0x0164 PSN 0xa38679 RKey 0x03031f VAddr 0x007fe0387d1000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:20:88 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 5000 0.041157 0.040923 2.557717 4 5000 0.089667 0.089600 2.799999 8 5000 0.18 0.18 2.795828 16 5000 0.36 0.36 2.799164 32 5000 0.72 0.72 2.801682 64 5000 1.08 1.07 2.089307 128 5000 2.15 2.08 2.031467 256 5000 4.30 4.30 2.097492 512 5000 8.56 8.56 2.089221 1024 5000 17.09 17.02 2.077250 2048 5000 33.89 33.83 2.065115 4096 5000 85.32 66.30 2.023458 8192 5000 163.84 136.83 2.087786 16384 5000 184.12 167.11 1.274956 32768 5000 190.44 180.83 0.689819 65536 5000 190.26 182.66 0.348395 131072 5000 193.71 179.10 0.170803 262144 5000 192.64 191.31 0.091222 524288 5000 192.62 191.29 0.045608 1048576 5000 192.82 192.75 0.022977 2097152 5000 192.38 192.22 0.011457 4194304 5000 192.80 192.78 0.005745 8388608 5000 192.67 192.65 0.002871 ---------------------------------------------------------------------------------------
Optimizing worker nodes for performance
In order to accommodate performance-sensitive applications, we can optimize the worker nodes for better performance by enabling pod scheduling on cores that are mapped to the same NUMA node of the NIC:
On the worker node, please make sure to use the root account by using:
$ sudo su -
Check to which NUMA node the NIC is wired:
# cat /sys/class/net/enp197s0f0/device/numa_node 1
In this example, the NIC is wired to NUMA node 1.
Check the NUMA nodes of the CPU and which cores are in NUMA node 1:
# lscpu | grep NUMA NUMA node(s): 2 NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47
In this example case, the cores that are in NUMA node 1 are: 24-47.
Now we need to configure K8s on the worker node (kubelet):
- The "cpuManagerPolicy" attribute specifies the selected CPU manger policy (which can be either "none" or "static").
- The "reservedSystemCPUs" attribute lists the CPU cores that will not be used by K8S (will stay reserved for the Linux system).
- The "topologyManagerPolicy" attribute specifies the selected policy for the topology manager (which can be either "none", "best-effort", "restricted" or "single-numa-node").
We will reserve some cores for the system, and make sure they belong to NUMA 0 (for our case):
# vi /etc/kubernetes/kubelet-config.yaml ... cpuManagerPolicy: static cpuManagerReconcilePeriod: 10s reservedSystemCPUs: "0,1,2,3" topologyManagerPolicy: single-numa-node featureGates: CPUManager: true TopologyManager: true ...
When changing reservedSystemCPUs or cpuManagerPolicy, the file: /var/lib/kubelet/cpu_manager_state should be deleted and kubelet service should be restarted:
# rm /var/lib/kubelet/cpu_manager_state # service kubelet restart
Validating the fabric
To validate the fabric, we will need to assign IP addresses to the servers. Each stretched VLAN acts as a local subnet to all the servers connected to it so all the servers connected to the same VLAN must have IP addresses in the same subnet.
Then we can check that we can ping between the servers.
We can also validate on the switches:
1) That the IP addresses of the VTEPs were successfully propagated by BGP to all the leaf switches.
Please repeat the following command on the leafs:
cumulus@leaf1a:mgmt:~$ net show route show ip route ============= Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure S>* 0.0.0.0/0 [1/0] via 10.1.0.254, vlan1, weight 1, 00:01:09 B>* 10.0.0.1/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:30 B>* 10.0.0.2/32 [20/0] via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29 C>* 10.0.0.101/32 is directly connected, lo, 5d16h51m B>* 10.0.0.102/32 [200/0] via fe80::1e34:daff:feb4:620, peerlink.4094, weight 1, 00:01:18 B>* 10.0.0.103/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29 * via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29 B>* 10.0.0.104/32 [20/0] via fe80::1e34:daff:feb3:ff70, swp13, weight 1, 00:01:29 * via fe80::1e34:daff:feb4:70, swp14, weight 1, 00:01:29 C>* 10.0.1.1/32 is directly connected, lo, 00:01:44 C * 10.1.0.0/24 [0/1024] is directly connected, vlan1-v0, 00:01:43 C>* 10.1.0.0/24 is directly connected, vlan1, 00:01:43 C * 10.10.0.0/16 [0/1024] is directly connected, vlan10-v0, 00:01:43 C>* 10.10.0.0/16 is directly connected, vlan10, 00:01:43 show ipv6 route =============== Codes: K - kernel route, C - connected, S - static, R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR, f - OpenFabric, > - selected route, * - FIB route, q - queued, r - rejected, b - backup t - trapped, o - offload failure C * fe80::/64 is directly connected, peerlink.4094, 00:01:20 C * fe80::/64 is directly connected, swp14, 00:01:30 C * fe80::/64 is directly connected, swp13, 00:01:31 C * fe80::/64 is directly connected, vlan10-v0, 00:01:43 C * fe80::/64 is directly connected, vlan1-v0, 00:01:43 C * fe80::/64 is directly connected, vlan20, 00:01:43 C * fe80::/64 is directly connected, vlan10, 00:01:43 C * fe80::/64 is directly connected, vlan1, 00:01:43 C>* fe80::/64 is directly connected, bridge, 00:01:43
2) That the ARP entries were successfully propagated by EVPN (best observed on the spine).
Please repeat the following command on the spines:
cumulus@spine1:mgmt:~$ net show bgp evpn route type macip BGP table version is 917, local router ID is 10.0.0.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP] EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP] EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP] EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP] EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP] Network Next Hop Metric LocPrf Weight Path Extended Community Route Distinguisher: 10.0.0.101:2 *> [2]:[0]:[48]:[1c:34:da:b4:06:20] 10.0.1.1 0 65101 i RT:65101:20 ET:8 MM:0, sticky MAC *> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920] 10.0.1.1 0 65101 i RT:65101:20 ET:8 Default Gateway ND:Router Flag Route Distinguisher: 10.0.0.101:3 *> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[1c:34:da:b4:06:20] 10.0.1.1 0 65101 i RT:65101:10 ET:8 MM:0, sticky MAC *> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[32]:[10.10.0.2] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[1c:34:da:b4:09:20]:[128]:[fe80::1e34:daff:feb4:920] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[6a:1f:17:28:21:9b] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Route Distinguisher: 10.0.0.102:2 *> [2]:[0]:[48]:[00:00:00:00:00:10]:[32]:[10.10.0.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[00:00:00:00:00:10]:[128]:[fe80::200:ff:fe00:10] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[12:a3:e7:7f:18:c1]:[32]:[10.10.0.250] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[32]:[10.10.0.3] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway *> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[1c:34:da:b4:09:20] 10.0.1.1 0 65101 i RT:65101:10 ET:8 MM:0, sticky MAC *> [2]:[0]:[48]:[6a:1f:17:28:21:9b] 10.0.1.1 0 65101 i RT:65101:10 ET:8 *> [2]:[0]:[48]:[6a:1f:17:28:21:9b]:[32]:[10.10.1.1] 10.0.1.1 0 65101 i RT:65101:10 ET:8 Route Distinguisher: 10.0.0.102:3 *> [2]:[0]:[48]:[1c:34:da:b4:06:20]:[128]:[fe80::1e34:daff:feb4:620] 10.0.1.1 0 65101 i RT:65101:20 ET:8 Default Gateway ND:Router Flag *> [2]:[0]:[48]:[1c:34:da:b4:09:20] 10.0.1.1 0 65101 i RT:65101:20 ET:8 MM:0, sticky MAC Route Distinguisher: 10.0.0.103:2 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.2] 10.0.0.103 0 65102 i RT:65102:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[32]:[10.10.1.10] 10.0.0.103 0 65102 i RT:65102:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:8e]:[128]:[fe80::ba59:9fff:fefa:878e] 10.0.0.103 0 65102 i RT:65102:10 ET:8 Route Distinguisher: 10.0.0.103:3 * [2]:[0]:[48]:[5e:60:de:10:be:74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 *> [2]:[0]:[48]:[5e:60:de:10:be:74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 * [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 *> [2]:[0]:[48]:[5e:60:de:10:be:74]:[128]:[fe80::5c60:deff:fe10:be74] 10.0.0.103 0 65102 i RT:65102:20 ET:8 Route Distinguisher: 10.0.0.104:2 * [2]:[0]:[48]:[06:e0:ca:50:81:a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[06:e0:ca:50:81:a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[32]:[192.168.20.91] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[06:e0:ca:50:81:a3]:[128]:[fe80::4e0:caff:fe50:81a3] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[32:98:4b:9b:91:03] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[32:98:4b:9b:91:03] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[32]:[192.168.20.92] 10.0.0.104 0 65103 i RT:65103:20 ET:8 * [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103] 10.0.0.104 0 65103 i RT:65103:20 ET:8 *> [2]:[0]:[48]:[32:98:4b:9b:91:03]:[128]:[fe80::3098:4bff:fe9b:9103] 10.0.0.104 0 65103 i RT:65103:20 ET:8 Route Distinguisher: 10.0.0.104:3 * [2]:[0]:[48]:[b8:59:9f:fa:87:6e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:6e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[32]:[10.10.1.4] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:6e]:[128]:[fe80::ba59:9fff:fefa:876e] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:be] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:be] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[32]:[10.10.1.5] 10.0.0.104 0 65103 i RT:65103:10 ET:8 * [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be] 10.0.0.104 0 65103 i RT:65103:10 ET:8 *> [2]:[0]:[48]:[b8:59:9f:fa:87:be]:[128]:[fe80::ba59:9fff:fefa:87be] 10.0.0.104 0 65103 i RT:65103:10 ET:8 Displayed 40 prefixes (58 paths) (of requested type)
3) That the MLAG is functioning properly on the infrastructure rack leafs:
cumulus@leaf1a:mgmt:~$ net show clag The peer is alive Our Priority, ID, and Role: 1000 1c:34:da:b4:09:20 primary Peer Priority, ID, and Role: 32768 1c:34:da:b4:06:20 secondary Peer Interface and IP: peerlink.4094 fe80::1e34:daff:feb4:620 (linklocal) VxLAN Anycast IP: 10.0.1.1 Backup IP: 10.0.0.102 (active) System MAC: 44:38:39:ff:ff:aa CLAG Interfaces Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason ---------------- ---------------- ------- -------------------- ----------------- bond1 bond1 1 - - bond2 bond2 2 - - bond3 bond3 3 - - vni10 vni10 - - - vni20 vni20 - - -
Done!
Authors
Vitaliy Razinkov Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference design guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website. |
Shachar Dor Shachar Dor joined the Solutions Lab team after working more than ten years as a software architect at NVIDIA Networking (previously Mellanox Technologies), where he was responsible for the architecture of network management products and solutions. Shachar's focus is on networking technologies, especially around fabric bring-up, configuration, monitoring, and life-cycle management. Shachar has a strong background in software architecture, design, and programming through his work on multiple projects and technologies also prior to joining the company. |
Related Documents
1 Comment
Vitaliy Razinkov