Created on Nov 16, 2020 by Boris Kovalev, Vitaliy Razinkov
Scope
This Reference Deployment Guide (RDG) explains how to build the highest performing Kubernetes (K8s) cluster capable of hosting the most demanding distributed workloads, running on top of an NVIDIA GPU and an NVIDIA end-to-end InfiniBand fabric.
Abbreviation and Acronyms
Term | Definition | Term | Definition |
---|---|---|---|
AOC | Active Optical Cable | IB | InfiniBand |
AI | Artificial Intelligence | K8s | Kubernetes |
CNI | Container Network Interface | ML | Machine Learning |
CR | Custome Resource | MOFED | Mellanox OpenFabrics Enterprise Distribution |
DAC | Direct Attach Copper cable | PF | Physical Function |
DHCP | Dynamic Host Configuration Protocol | RDMA | Remote Direct Memory Access |
EDR | Enhanced Data Rate - 100Gb/s | QSG | Quick Start Guide |
GPU | Graphics Processing Unit | SR-IOV | Single Root Input Output Virtualization |
HDR | High Data Rate - 200Gb/s | VF | Virtual Function |
HPC | High Performance Computing |
References
- NVIDIA T4 GPU
- NVIDIA OpenFabrics Enterprise Distribution for Linux (MLNX_OFED)
- What is Kubernetes?
- NVIDIA GPU Operator
- Whereabouts-CNI
- SR-IOV Network Operator
Introduction
Provisioning of Machine Learning (ML) and High Performance Computing (HPC) cloud solutions may become a very complicated task. Proper design, and software and hardware component selection may become a gating task toward successful deployment.
This document will guide you through a complete solution cycle including design, component selection, technology overview and deployment steps.
The solution will be provisioned on top of GPU enabled servers over an NVIDIA end-to-end InfiniBand fabric.
NVIDIA GPU and SR-IOV Network Operators allow to run GPU accelerated and native RDMA workloads on the InfiniBand fabric such as HPC, Big Data, ML, AI and other applications.
The following processes are described below:
- K8s cluster deployment by Kubespray over bare metal nodes with Ubuntu 20.04 OS.
- NVIDIA GPU Operator deployment.
- InfiniBand fabric configuration.
- POD deployment example.
This document covers a single Kubernetes controller deployment scenario.
For high-availability cluster deployment, please refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md
Solution Architecture
Key Components and Technologies
NVIDIA® T4 GPU
The NVIDIA® T4 GPU is based on the NVIDIA Turing™ architecture and packaged in an energy-efficient 70-watt small PCIe form factor. T4 is optimized for mainstream computing environments, and features multi-precision Turing Tensor Cores and RT Cores. Combined with accelerated containerized software stacks from NGC, T4 delivers revolutionary performance at scale to accelerate cloud workloads, such as high-performance computing, deep learning training and inference, machine learning, data analytics, and graphics.- NVIDIA MLNX-OS®
NVIDIA MLNX-OS is Mellanox's InfiniBand/VPI switch operating system for data centers with storage, enterprise, high-performance, machine learning, Big Data computing and cloud fabrics. - NVIDIA ConnectX InfiniBand adapters
NVIDIA® ConnectX® InfiniBand smart adapters with acceleration engines deliver best-in-class network performance and efficiency, enabling low-latency, high throughput and high message rates for applications at SDR, QDR, DDR, FDR, EDR and HDR InfiniBand speeds. - NVIDIA smart InfiniBand switch systems
NVIDIA smart InfiniBand switch systems deliver the highest performance and port density for high performance computing (HPC), AI, Web 2.0, big data, clouds, and enterprise data centers. Support for 36 to 800-port configurations at up to 200Gb/s per port, allows compute clusters and converged data centers to operate at any scale, reducing operational costs and infrastructure complexity. - NVIDIA LinkX® InfiniBand Cables
NVIDIA Mellanox LinkX cables and transceivers are designed to maximize the performance of High Performance Computing networks, requiring high-bandwidth, low-latency connections between compute nodes and switch nodes. DAC is available up to 7m. AOCs are available in <30m OM2 fiber lowest-cost lengths; OM3/OM4 multimode to 100m. DACs and AOCs data rates of QDR(40G), FDR10(40G), FDR(56G), EDR(100G), HDR100 (100G) and HDR (200G). - Kubernetes
Kubernetes (K8s) is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications. - Kubespray (From Kubernetes.io)
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:- A highly available cluster
- Composable attributes
- Support for most popular Linux distributions
- NVIDIA GPU Operator
NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPUs.
These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others. - RDMA
Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
Like locally based Direct Memory Access (DMA), RDMA improves throughput and performance and frees up compute resources. - SR-IOV Network Operator
SR-IOV Network Operator is designed to help the user to provision and configure SR-IOV CNI plugin and Device plugin in the Openshift and Kubernetes clusters.
Logical Design
The logical design includes the following layers:
- One compute layer:
- Deployment node
- K8s Master node
- 2 x K8s Worker nodes with two NVIDIA T4 GPUs and one Mellanox ConnectX adapter.
- Two separate networking layers:
- Management network
- High-speed InfiniBand (IB) fabric
Fabric Design
In this RDG we will describe a small scale solution with only one switch.
Simple Setup with One Switch
In a single switch case, by using an NVIDIA QM8700 InfiniBand HDR Switch System you can connect up to 40 servers with NVIDIA LinkX HDR 200Gb/s QSFP56 DAC cables.
Scaled Setup for InfiniBand Fabric
For assistance in designing the scaled InfiniBand topology, use the NVIDIA InfiniBand Topology Generator, an online cluster configuration tool that offers flexible cluster configurations and sizes.
For a scaled setup we recommend using NVIDIA Unified Fabric Manager (UFM®).
Bill of Materials (BoM)
The following hardware setup is utilized in the distributed K8s configuration described in this guide:
The above table does not contain Kubernetes Management network connectivity components.
Deployment and Configuration
The deployment is validated using Ubuntu 20.04 OS and Kubespray v2.14.2.
Wiring
The first port of each NVIDIA HCA on each Worker node is wired to the NVIDIA switch using NVIDIA LinkX HDR 200Gb/s QSFP56 DAC cables.
Network
Prerequisites
- InfiniBand fabric
- Switch
NVIDIA QM8700 - Switch OS
NVIDIA MLNX-OS®
- Switch
- Management Network
DHCP and DNS services are part of the IT infrastructure. The component installation and configuration are not covered in this guide.
Network Configuration
Below are the server names with their relevant network configurations.
| Server/Switch name | IP and NICS | |
High-speed network HDR | Management network 1 GigE | ||
Master Node | node1 | eno0: DHCP | |
Worker Node | node2 | ibs6f0: none | eno0: DHCP 192.168.1.10 |
Worker Node | node3 | ibs6f0: none | eno0: DHCP |
Deployment Node | sl-depl-node | eno0: DHCP 192.168.1.43 | |
High-speed switch | swx-mld-ib67 | none | mgmt0: From DHCP 192.168.1.38 |
InfiniBand Fabric Configuration
Below is a list of recommendations and prerequisites that are important for the configuration process:
- Refer to the MLNX-OS User Manual to become familiar with the switch software (located at enterprise-support.nvidia.com/s/)
- Upgrade the switch software to the latest MLNX-OS version
- InfiniBand Subnet Manager (SM) is required to configure InfiniBand fabric properly
There are three ways to run an InfiniBand SM in the InfiniBand fabric:
- Start the SM on one or more managed switches. This is a very convenient and quick operation which allows for easier InfiniBand ‘plug & play'.
- Run OpenSM daemon on one or more servers by executing the /etc/init.d/opensmd command. It is recommended to run the SM on a server in case there are 648 nodes or more.
- Use Unified Fabric Management (UFM®).
UFM is a powerful platform for scale-out computing, eliminates the complexity of fabric management, provides deep visibility into traffic, and optimizes fabric performance.
In this guide, we will launch the InfiniBand SM on the InfiniBand switch (Method num. 1). Below are the configuration steps for the chosen method.
To enable the SM on one of the managed switches:
Login to the switch and enter the next configuration commands (swx-mld-ib67 is our switch name):
IB switch configurationMellanox MLNX-OS Switch Management switch login: admin Password: swx-mld-ib67 [standalone: master] > enable swx-mld-ib67 [standalone: master] # configure terminal swx-mld-ib67 [standalone: master] (config) # ib smnode swx-mld-ib67 enable swx-mld-ib67 [standalone: master] (config) # ib smnode swx-mld-ib67 sm-priority 0 swx-mld-ib67 [standalone: master] (config) # ib sm virt enable swx-mld-ib67 [standalone: master] (config) # write memory swx-mld-ib67 [standalone: master] (config) # reload
Once the switch reboots, check the switch configuration. It should look like the following:
Switch config example Expand sourceMellanox MLNX-OS Switch Management switch login: admin Password: swx-mld-ib67 [standalone: master] > enable swx-mld-ib67 [standalone: master] # configure terminal swx-mld-ib67 [standalone: master] (config) # show running-config ## ## Running database "initial" ## Generated at 2020/12/16 17:40:41 +0000 ## Hostname: swx-mld-ib67 ## Product release: 3.9.1600 ## ## ## Running-config temporary prefix mode setting ## no cli default prefix-modes enable ## ## Subnet Manager configuration ## ib sm virt enable ## ## Other IP configuration ## hostname swx-mld-ib67 ## ## Other IPv6 configuration ## no ipv6 enable ## ## Local user account configuration ## username admin password 7 $6$6GZ8Q0RF$FZW9pc23JJkwwOJTq85xZe1BJgqQV/m6APQNPkagZlTEUgKMWLr5X3Jq2hsUyB.K5nrGdDNUaSLiK2xupnIJo1 username monitor password 7 $6$z1.r4Kl7$TIwaNf7uXNxZ9UdGdUpOO9kVug0shRqGtu75s3dSrY/wY1v1mGjrqQLNPHvHYh5HAhVuUz5wKzD6H/beYeEqL. ## ## AAA remote server configuration ## # ldap bind-password ******** # radius-server key ******** # tacacs-server key ******** ## ## Network management configuration ## # web proxy auth basic password ******** ## ## X.509 certificates configuration ## # # Certificate name system-self-signed, ID 12d0989d8623825b71bc25f9bc02de813fc9fe2a # (public-cert config omitted since private-key config is hidden) ## ## IB nodename to GUID mapping ## ib smnode swx-mld-ib67 create ib smnode swx-mld-ib67 enable ib smnode swx-mld-ib67 sm-priority 0 ## ## Persistent prefix mode setting ## cli default prefix-modes enable
Nodes Configuration
General Prerequisites:
- Hardware
All the K8s Worker nodes have the same hardware specification (see BoM for details). - Host BIOS
Verify that you are using a SR-IOV supported server platform for K8s Worker nodes, and review the BIOS settings in the hardware documentation to enable SR-IOV in the BIOS. - Host OS
Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages. - Experience with Kubernetes
Make sure to familiarize yourself with the Kubernetes Cluster architecture.
Host OS Prerequisites
Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages, and create a non-root user account with sudo privileges without password.
Update the Ubuntu software packages by running the following commands:
$ sudo apt-get update $ sudo apt-get upgrade -y $ sudo reboot
Non-root User Account Prerequisites
In this solution we added the following line to the EOF /etc/sudoers:
$ sudo vim /etc/sudoers #includedir /etc/sudoers.d #K8s cluster deployment user with sudo privileges without password user ALL=(ALL) NOPASSWD:ALL
Software Prerequisites
Disable/blacklist Nouveau NVIDIA driver on the Worker node servers by running the commands below or paste each line into the terminal:
Server Console$ sudo su - # lsmod |grep nouv # bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf" # bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf" # update-initramfs -u # reboot $ lsmod |grep nouv
Install NVIDIA MLNX_OFED and upgrade firmware on the Worker node servers by running the commands below or paste each line into the terminal:
Server Console$ sudo su - # apt-get install rdma-core # wget -qO - https://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | sudo apt-key add - # curl https://linux.mellanox.com/public/repo/mlnx_ofed/latest/ubuntu20.04/mellanox_mlnx_ofed.list --output /etc/apt/sources.list.d/mellanox_mlnx_ofed.list # apt update # apt install -y mlnx-ofed-kernel-only # wget http://www.mellanox.com/downloads/firmware/mlxup/4.15.2/SFX/linux_x64/mlxup # chmod +x mlxup # ./mlxup --online -u # reboot
Set Up IB port link on the Worker node servers.
Server Consoleroot@node2:~# ibdev2netdev ... mlx5_2 port 1 ==> ibs6f0 (Down) mlx5_3 port 1 ==> ibs6f1 (Down) ... root@node2:~# vim /etc/netplan/00-installer-config.yaml # This is the network config written by 'subiquity' network: ethernets: ibs6f0: {} eno1: dhcp4: true version: 2 root@node2:~# netplan apply root@node2:~# ibdev2netdev ... mlx5_2 port 1 ==> ibs6f0 (Up) mlx5_3 port 1 ==> ibs6f1 (Down) ...
Set netns to exclusive mode for allows network namespace isolation for RDMA workloads on the Worker node servers.
Server Consoleroot@node2:~# vim /etc/modprobe.d/ib_core.conf # Set netns to exclusive mode for namespace isolation options ib_core netns_mode=0 root@node2:~# update-initramfs -u root@node2:~# reboot
Check netns mode and InfiniBand devices on the Worker node servers.
Server Console Expand source$ rdma system netns exclusive $ ls -la /dev/infiniband/ total 0 drwxr-xr-x 2 root root 300 Jan 26 16:26 . drwxr-xr-x 22 root root 5100 Jan 26 16:55 .. crw------- 1 root root 231, 64 Jan 26 16:26 issm0 crw------- 1 root root 231, 65 Jan 26 16:26 issm1 crw------- 1 root root 231, 66 Jan 26 16:26 issm2 crw------- 1 root root 231, 67 Jan 26 16:26 issm3 crw-rw-rw- 1 root root 10, 57 Jan 26 16:26 rdma_cm crw------- 1 root root 231, 0 Jan 26 16:26 umad0 crw------- 1 root root 231, 1 Jan 26 16:26 umad1 crw------- 1 root root 231, 2 Jan 26 16:26 umad2 crw------- 1 root root 231, 3 Jan 26 16:26 umad3 crw-rw-rw- 1 root root 231, 192 Jan 26 16:26 uverbs0 crw-rw-rw- 1 root root 231, 193 Jan 26 16:26 uverbs1 crw-rw-rw- 1 root root 231, 194 Jan 26 16:26 uverbs2 crw-rw-rw- 1 root root 231, 195 Jan 26 16:26 uverbs3 $ ls -la /sys/class/infiniband total 0 drwxr-xr-x 2 root root 0 Jan 11 13:52 . drwxr-xr-x 82 root root 0 Jan 11 13:52 .. lrwxrwxrwx 1 root root 0 Jan 11 13:53 mlx5_0 -> ../../devices/pci0000:11/0000:11:02.0/0000:13:00.0/infiniband/mlx5_0 lrwxrwxrwx 1 root root 0 Jan 11 13:53 mlx5_1 -> ../../devices/pci0000:11/0000:11:02.0/0000:13:00.1/infiniband/mlx5_1 lrwxrwxrwx 1 root root 0 Jan 11 13:52 mlx5_2 -> ../../devices/pci0000:ae/0000:ae:00.0/0000:af:00.0/infiniband/mlx5_2 lrwxrwxrwx 1 root root 0 Jan 11 13:52 mlx5_3 -> ../../devices/pci0000:ae/0000:ae:00.0/0000:af:00.1/infiniband/mlx5_3
All Worker nodes must have the same configuration and the same PCIe card placement.
Check that IB interface is UP.
K8s Cluster Deployment and Configuration
The Kubernetes cluster in this solution will be installed using Kubespray with a non-root user account from a Deployment node.
SSH Private Key and SSH Passwordless Login
Login to the Deployment node as a deployment user (in this case - user) and create an SSH private key for configuring the password-less authentication on your computer by running the following commands:
Deployment Node Console$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Created directory '/home/user/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: SHA256:PaZkvxV4K/h8q32zPWdZhG1VS0DSisAlehXVuiseLgA user@sl-depl-node The key's randomart image is: +---[RSA 2048]----+ | ...+oo+o..o| | .oo .o. o| | . .. . o +.| | E . o + . +| | . S = + o | | . o = + o .| | . o.o + o| | ..+.*. o+o| | oo*ooo.++| +----[SHA256]-----+
Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command. Sample:
Deployment Node ConsoleSample: $ ssh-copy-id -i ~/.ssh/id_rsa user@192.168.1.40 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/user/.ssh/id_rsa.pub" The authenticity of host '192.168.1.40 (192.168.1.40)' can't be established. ECDSA key fingerprint is SHA256:uyglY5g0CgPNGDm+XKuSkFAbx0RLaPijpktANgXRlD8. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys user@192.168.1.40's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'user@192.168.1.40'" and check to make sure that only the key(s) you wanted were added.
Check SSH connectivity to all nodes in your deployment by running the following command:
Deployment Node ConsoleSample: $ ssh user@192.168.1.40 Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-52-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage System information as of Mon Jan 11 17:23:23 IST 2021 System load: 0.0 Processes: 216 Usage of /: 6.5% of 68.40GB Users logged in: 1 Memory usage: 2% IP address for ens160: 192.168.1.40 Swap usage: 0% * Introducing self-healing high availability clusters in MicroK8s. Simple, hardened, Kubernetes for production, from RaspberryPi to DC. https://microk8s.io/high-availability 8 packages can be updated. 8 of these updates are security updates. To see these additional updates run: apt list --upgradable New release '20.04.1 LTS' available. Run 'do-release-upgrade' to upgrade to it. Your Hardware Enablement Stack (HWE) is supported until April 2023. Last login: Mon Jan 11 17:04:04 2021 from 192.168.1.43 user@node1:~$ exit
Install dependencies for running Kubespray with Ansible on the Deployment server.
Deployment Node Console$ cd ~ $ sudo apt -y install python3-pip jq $ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.14.2.tar.gz $ tar -zxf v2.14.2.tar.gz $ cd kubespray-2.14.2 $ sudo pip3 install -r requirements.txt
The default folder for subsequent commands is ~/kubespray-2.14.2.Create a new cluster configuration.
Deployment Node Console$ cp -rfp inventory/sample inventory/mycluster $ declare -a IPS=(192.168.1.40 192.168.1.10 192.168.1.11) $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
As a result, theinventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration file - inventory/mycluster/hosts.yaml.
Below is an example for this deployment.Deployment Node Console$ sudo vim inventory/mycluster/hosts.yaml all: hosts: node1: ansible_host: 192.168.1.40 ip: 192.168.1.40 access_ip: 192.168.1.40 node2: ansible_host: 192.168.1.10 ip: 192.168.1.10 access_ip: 192.168.1.10 node3: ansible_host: 192.168.1.11 ip: 192.168.1.11 access_ip: 192.168.1.11 children: kube-master: hosts: node1: kube-node: hosts: node2: node3: etcd: hosts: node1: k8s-cluster: children: kube-master: kube-node: calico-rr: hosts: {}
Review and change cluster installation parameters in the files:
> inventory/mycluster/group_vars/all/all.yml
> inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.ymlIn inventory/mycluster/group_vars/all/all.yml uncomment the following line so the metrics can receive data about the use of cluster resources:
Deployment Node Console$ sudo vim inventory/mycluster/group_vars/all/all.yml ## The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable. kube_read_only_port: 10255
In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml set a default Kubernetes CNI by setting the desired kube_network_plugin value (default: calico) parameter and enable multi_networking by setting kube_network_plugin_multus: true.
Deployment Node Console$ sudo vim inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml ... # Choose network plugin (cilium, calico, contiv, weave or flannel. Use cni for generic cni plugin) # Can also be set to 'cloud', which lets the cloud provider setup appropriate routing kube_network_plugin: calico # Setting multi_networking to true will install Multus: https://github.com/intel/multus-cni kube_network_plugin_multus: true ...
Deploy K8s Cluster by Kubespray Ansible Playbook
$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml
Example of a successful completion of the playbooks looks like:
PLAY RECAP *************************************************************************************************************************************** localhost : ok=1 changed=0 unreachable=0 failed=0 node1 : ok=617 changed=101 unreachable=0 failed=0 node2 : ok=453 changed=58 unreachable=0 failed=0 node3 : ok=410 changed=53 unreachable=0 failed=0 Monday 30 November 2020 10:48:14 +0300 (0:00:00.265) 0:13:49.321 ********** =============================================================================== kubernetes/master : kubeadm | Initialize first master ------------------------------------------------------------------------------------ 55.94s kubernetes/kubeadm : Join to cluster ----------------------------------------------------------------------------------------------------- 37.65s kubernetes/master : Master | wait for kube-scheduler ------------------------------------------------------------------------------------- 21.97s download : download_container | Download image if required ------------------------------------------------------------------------------- 21.34s kubernetes-apps/ansible : Kubernetes Apps | Start Resources ------------------------------------------------------------------------------ 14.85s kubernetes/preinstall : Update package management cache (APT) ---------------------------------------------------------------------------- 12.49s download : download_file | Download item ------------------------------------------------------------------------------------------------- 11.45s etcd : Install | Copy etcdctl binary from docker container ------------------------------------------------------------------------------- 10.57s download : download_file | Download item -------------------------------------------------------------------------------------------------- 9.37s kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------- 9.18s etcd : wait for etcd up ------------------------------------------------------------------------------------------------------------------- 8.78s etcd : Configure | Check if etcd cluster is healthy --------------------------------------------------------------------------------------- 8.62s download : download_file | Download item -------------------------------------------------------------------------------------------------- 8.24s kubernetes-apps/network_plugin/multus : Multus | Start resources -------------------------------------------------------------------------- 7.32s download : download_container | Download image if required -------------------------------------------------------------------------------- 6.61s policy_controller/calico : Start of Calico kube controllers ------------------------------------------------------------------------------- 4.92s download : download_file | Download item -------------------------------------------------------------------------------------------------- 4.76s kubernetes-apps/cluster_roles : Apply workaround to allow all nodes with cert O=system:nodes to register ---------------------------------- 4.56s download : download_container | Download image if required -------------------------------------------------------------------------------- 4.48s download : download | Download files / images --------------------------------------------------------------------------------------------- 4.28s
Label Worker nodes using node-role.kubernetes.io/worker
label, run on the K8s Master node.
# kubectl label nodes node2 node-role.kubernetes.io/worker= # kubectl label nodes node3 node-role.kubernetes.io/worker=
K8s Deployment Verification
Verifying the Kubernetes cluster deployment can be done through the ROOT user account on the K8s Master node.
Below is an output example of a K8s cluster with the deployment information, with default Kubespray configuration using the Calico Kubernetes CNI plugin.
To ensure that the Kubernetes cluster is installed correctly, run the following commands:
root@node1:~# kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME node1 Ready master 16d v1.19.2 192.168.1.40 <none> Ubuntu 18.04.5 LTS 5.4.0-52-generic docker://19.3.12 node2 Ready worker 16d v1.19.2 192.168.1.10 <none> Ubuntu 20.04.1 LTS 5.4.0-56-generic docker://19.3.12 node3 Ready worker 16d v1.19.2 192.168.1.11 <none> Ubuntu 20.04.1 LTS 5.4.0-56-generic docker://19.3.12 root@node1:~# kubectl get pod -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-kube-controllers-b885f5f4-8cr8s 1/1 Running 0 23m 192.168.1.10 node2 <none> <none> calico-node-8bb6p 1/1 Running 1 24m 192.168.1.10 node2 <none> <none> calico-node-9hnd4 1/1 Running 0 24m 192.168.1.40 node1 <none> <none> calico-node-qm7z9 1/1 Running 1 24m 192.168.1.11 node3 <none> <none> coredns-dff8fc7d-5n645 1/1 Running 0 30s 10.233.92.4 node3 <none> <none> coredns-dff8fc7d-6qqcc 1/1 Running 0 32s 10.233.96.1 node2 <none> <none> dns-autoscaler-66498f5c5f-vhz22 1/1 Running 0 23m 10.233.90.2 node1 <none> <none> kube-apiserver-node1 1/1 Running 0 25m 192.168.1.40 node1 <none> <none> kube-controller-manager-node1 1/1 Running 0 25m 192.168.1.40 node1 <none> <none> kube-multus-ds-amd64-cgz57 1/1 Running 0 50s 192.168.1.40 node1 <none> <none> kube-multus-ds-amd64-jwhwj 1/1 Running 0 50s 192.168.1.10 node2 <none> <none> kube-multus-ds-amd64-qj4dh 1/1 Running 0 50s 192.168.1.11 node3 <none> <none> kube-proxy-ddjjm 1/1 Running 0 24m 192.168.1.11 node3 <none> <none> kube-proxy-j4228 1/1 Running 0 24m 192.168.1.10 node2 <none> <none> kube-proxy-qsb2g 1/1 Running 0 25m 192.168.1.40 node1 <none> <none> kube-scheduler-node1 1/1 Running 0 25m 192.168.1.40 node1 <none> <none> kubernetes-dashboard-667c4c65f8-7xdxf 1/1 Running 0 23m 10.233.92.1 node3 <none> <none> kubernetes-metrics-scraper-54fbb4d595-6mtgd 1/1 Running 0 23m 10.233.92.2 node3 <none> <none> nginx-proxy-node2 1/1 Running 0 23m 192.168.1.10 node2 <none> <none> nginx-proxy-node3 1/1 Running 0 24m 192.168.1.11 node3 <none> <none> nodelocaldns-67s2w 1/1 Running 0 23m 192.168.1.10 node2 <none> <none> nodelocaldns-mmb2r 1/1 Running 0 23m 192.168.1.11 node3 <none> <none> nodelocaldns-zxlzl 1/1 Running 0 23m 192.168.1.40 node1 <none> <none>
NVIDIA GPU Operator Installation for K8s cluster
The preferred method to deploy the device plugin is as a daemonset using helm from K8s Master Node. Install Helm from the official installer script.
K8s Master Node Console# curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 # chmod 700 get_helm.sh # ./get_helm.sh
Add the NVIDIA Helm repository.
K8s Master Node Console# helm repo add nvidia https://nvidia.github.io/gpu-operator # helm repo update
Deploy NVIDIA GPU Operator.
K8s Master Node Console# helm install --wait --generate-name nvidia/gpu-operator "nvidia" has been added to your repositories root@sl-k8s-master:~# helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "nvidia" chart repository Update Complete. ⎈Happy Helming!⎈ root@sl-k8s-master:~# helm install --wait --generate-name nvidia/gpu-operator NAME: gpu-operator-1610381204 LAST DEPLOYED: Mon Jan 11 18:06:50 2021 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: None
K8s Master Node Console# helm ls NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION gpu-operator-1610381204 default 1 2021-01-11 18:06:50.465874914 +0200 IST deployed gpu-operator-1.4.0 1.4.0
Verify the NVIDIA GPU Operator installation (wait ~ 5-10 minutes for the operator installation will finished).
K8s Master Node Console# kubectl get pod -A -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default gpu-operator-1610455631-node-feature-discovery-master-c8dbgrnpf 1/1 Running 0 6m52s 10.233.90.5 node1 <none> <none> default gpu-operator-1610455631-node-feature-discovery-worker-24zlr 1/1 Running 0 6m52s 10.233.92.4 node3 <none> <none> default gpu-operator-1610455631-node-feature-discovery-worker-47mbw 1/1 Running 0 6m52s 10.233.90.4 node1 <none> <none> default gpu-operator-1610455631-node-feature-discovery-worker-qmnmj 1/1 Running 0 6m52s 10.233.96.1 node2 <none> <none> default gpu-operator-7d4649d96c-2d2xj 1/1 Running 4 6m52s 10.233.90.3 node1 <none> <none> gpu-operator-resources gpu-feature-discovery-4h8dh 1/1 Running 0 75s 10.233.92.11 node3 <none> <none> gpu-operator-resources gpu-feature-discovery-c4fzh 1/1 Running 0 75s 10.233.96.5 node2 <none> <none> gpu-operator-resources nvidia-container-toolkit-daemonset-5hpng 1/1 Running 0 4m19s 10.233.96.2 node2 <none> <none> gpu-operator-resources nvidia-container-toolkit-daemonset-n7mkv 1/1 Running 0 4m19s 10.233.92.5 node3 <none> <none> gpu-operator-resources nvidia-dcgm-exporter-mjpg7 1/1 Running 0 2m5s 10.233.92.10 node3 <none> <none> gpu-operator-resources nvidia-dcgm-exporter-smmpp 1/1 Running 0 2m5s 10.233.96.4 node2 <none> <none> gpu-operator-resources nvidia-device-plugin-daemonset-7tvqh 1/1 Running 0 3m5s 10.233.92.7 node3 <none> <none> gpu-operator-resources nvidia-device-plugin-daemonset-p9djf 1/1 Running 0 3m5s 10.233.96.3 node2 <none> <none> gpu-operator-resources nvidia-device-plugin-validation 0/1 Completed 0 2m8s 10.233.92.8 node3 <none> <none> gpu-operator-resources nvidia-driver-daemonset-5cxb7 1/1 Running 0 5m41s 192.168.1.10 node2 <none> <none> gpu-operator-resources nvidia-driver-daemonset-b5dlv 1/1 Running 0 5m41s 192.168.1.11 node3 <none> <none> gpu-operator-resources nvidia-driver-validation 0/1 Completed 2 3m54s 10.233.92.6 node3 <none> <none> ...
To run a Sample GPU Application: https://github.com/NVIDIA/gpu-operator#running-a-sample-gpu-application
For GPU monitoring: https://github.com/NVIDIA/gpu-operator#gpu-monitoringSR-IOV Network Operator Installation for K8s Cluster
SR-IOV network is an additional feature of a Kubernetes cluster.
To make it work, you need to provision and configure different components.
SR-IOV Network Operator Deployment Steps
- Initialize the supported SR-IOV NIC types on selected nodes.
- Provision SR-IOV device plugin executable on selected nodes.
- Provision SR-IOV CNI plugin executable on selected nodes.
- Manage configuration of SR-IOV device plugin on host.
- Generate net-att-def CRs for SR-IOV CNI plugin.
Prerequisites
Install general dependencies on the Master node server, run the commands below.
# apt-get install jq make gcc -y # snap install skopeo --edge --devmode # snap install go --classic # export GOPATH=$HOME/go # export PATH=$GOPATH/bin:$PATH
Below is a detailed step-by-step description of an SR-IOV Network Operator installation.
Install Whereabouts CNI.
You can install this plugin with a Daemonset, using the following commands:
K8s Master Node Console# kubectl apply -f https://raw.githubusercontent.com/openshift/whereabouts-cni/master/doc/daemonset-install.yaml # kubectl apply -f https://raw.githubusercontent.com/openshift/whereabouts-cni/master/doc/whereabouts.cni.cncf.io_ippools.yaml # kubectl apply -f https://raw.githubusercontent.com/openshift/whereabouts-cni/master/doc/whereabouts.cni.cncf.io_overlappingrangeipreservations.yaml
To ensure the plugin is installed correctly, run the following command:
K8s Master Node Console# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE ....... kube-system whereabouts-nsw6x 1/1 Running 0 22d kube-system whereabouts-pnhvn 1/1 Running 1 27d kube-system whereabouts-pv694 1/1 Running 0 27d
Clone this GitHub repository.
K8s Master Node Console# cd /root # go get github.com/k8snetworkplumbingwg/sriov-network-operator
Deploy the operator.
By default, the operator will be deployed in namespace 'sriov-network-operator' for a Kubernetes cluster. You can check if the deployment is finished successfully.
K8s Master Node Console# cd go/src/github.com/k8snetworkplumbingwg/sriov-network-operator/ # make deploy-setup-k8s
- Checking the status of SriovNetworkNodeState CRs to find out all the SR-IOV capable devices in our cluster.
In our deployment we choose IB interface with name ibs6f0.
K8s Master Node Console# kubectl -n sriov-network-operator get sriovnetworknodestates.sriovnetwork.openshift.io node2 -o yaml ... deviceID: 101b driver: mlx5_core linkType: IB mac: 00:00:03:87:fe:80:00:00:00:00:00:00:98:03:9b:03:00:9f:cd:b6 mtu: 4092 name: ibs6f0 numVfs: 8 pciAddress: 0000:af:00.0 totalvfs: 8 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkType: IB mac: 00:00:0b:0f:fe:80:00:00:00:00:00:00:98:03:9b:03:00:9f:cd:b7 mtu: 4092 name: ibs6f1 pciAddress: 0000:af:00.1 totalvfs: 8 vendor: 15b3 ...
With the chosen IB interface we create SriovNetworkNodePolicy CR.
K8s Master Node Console# cd /root # mkdir YAMLs # cd YAMLs/ # vim policy.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib0 namespace: sriov-network-operator spec: resourceName: "mlnx_ib0" nodeSelector: feature.node.kubernetes.io/custom-rdma.available: "true" priority: 10 numVfs: 8 nicSelector: vendor: "15b3" deviceID: "101b" pfNames: [ "ibs6f0" ] isRdma: true linkType: ib
Apply the SriovNetworkNodePolicy.
K8s Master Node Console# kubectl apply -f policy.yaml
Check the Operator deployment after the police activation.
K8s Master Node Console# kubectl -n sriov-network-operator get all NAME READY STATUS RESTARTS AGE pod/sriov-cni-bzdsv 2/2 Running 0 59s pod/sriov-cni-vsjbt 2/2 Running 0 9m6s pod/sriov-device-plugin-9ghjx 1/1 Running 0 9m6s pod/sriov-device-plugin-hkzct 1/1 Running 0 12s pod/sriov-network-config-daemon-8x749 1/1 Running 0 22m pod/sriov-network-config-daemon-k7plr 1/1 Running 0 61s pod/sriov-network-operator-79b8bb586f-ptgr6 1/1 Running 0 22m NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/sriov-cni 2 2 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker= 9m6s daemonset.apps/sriov-device-plugin 2 2 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker= 9m6s daemonset.apps/sriov-network-config-daemon 2 2 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/worker= 22m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/sriov-network-operator 1/1 1 1 22m NAME DESIRED CURRENT READY AGE replicaset.apps/sriov-network-operator-79b8bb586f 1 1 1 22m
Create a Network Attachment Definition with file name sriov-ib0.yaml.
File sample# vim sriov-ib0.yaml apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: openshift.io/mlnx_ib0 name: sriovib0 namespace: default spec: config: |- { "cniVersion": "0.3.1", "name": "sriovib0", "plugins": [ { "type": "ib-sriov", "link_state": "enable", "rdmaIsolation": true, "ibKubernetesEnabled": false, "ipam": { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.101.0/24" } } ] }
Apply the Network Attachment Definition.
K8s Master Node Console# kubectl apply -f sriov-ib0.yaml
Verify the Network Attachment Definition installation.
K8s Master Node Console# kubectl get network-attachment-definitions.k8s.cni.cncf.io NAME AGE sriovib0 28d
Check Worker node 2.
Worker Node 2 Expand source# kubectl describe nodes node2 Name: node2 Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true feature.node.kubernetes.io/cpu-cpuid.AVX512F=true feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.HLE=true feature.node.kubernetes.io/cpu-cpuid.IBPB=true feature.node.kubernetes.io/cpu-cpuid.MPX=true feature.node.kubernetes.io/cpu-cpuid.RTM=true feature.node.kubernetes.io/cpu-cpuid.STIBP=true feature.node.kubernetes.io/cpu-cpuid.VMX=true feature.node.kubernetes.io/cpu-rdt.RDTCMT=true feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true feature.node.kubernetes.io/cpu-rdt.RDTMBA=true feature.node.kubernetes.io/cpu-rdt.RDTMBM=true feature.node.kubernetes.io/cpu-rdt.RDTMON=true feature.node.kubernetes.io/custom-rdma.available=true feature.node.kubernetes.io/custom-rdma.capable=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true feature.node.kubernetes.io/kernel-version.full=5.4.0-56-generic feature.node.kubernetes.io/kernel-version.major=5 feature.node.kubernetes.io/kernel-version.minor=4 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/memory-numa=true feature.node.kubernetes.io/pci-0300_102b.present=true feature.node.kubernetes.io/pci-0302_10de.present=true feature.node.kubernetes.io/pci-0302_10de.sriov.capable=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=ubuntu feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04 kubernetes.io/arch=amd64 kubernetes.io/hostname=node2 kubernetes.io/os=linux node-role.kubernetes.io/worker= nvidia.com/gpu.present=true Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock nfd.node.kubernetes.io/extended-resources: nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-... nfd.node.kubernetes.io/worker.version: v0.6.0 node.alpha.kubernetes.io/ttl: 0 sriovnetwork.openshift.io/state: Idle volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 01 Dec 2020 17:22:46 +0200 Taints: <none> Unschedulable: false Lease: HolderIdentity: node2 AcquireTime: <unset> RenewTime: Wed, 30 Dec 2020 14:30:59 +0200 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 07 Dec 2020 16:00:20 +0200 Mon, 07 Dec 2020 16:00:20 +0200 CalicoIsUp Calico is running on this node MemoryPressure False Wed, 30 Dec 2020 14:31:06 +0200 Mon, 07 Dec 2020 15:59:48 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 30 Dec 2020 14:31:06 +0200 Mon, 07 Dec 2020 15:59:48 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 30 Dec 2020 14:31:06 +0200 Mon, 07 Dec 2020 15:59:48 +0200 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 30 Dec 2020 14:31:06 +0200 Mon, 07 Dec 2020 15:59:53 +0200 KubeletReady kubelet is posting ready status. AppArmor enabled Addresses: InternalIP: 192.168.1.10 Hostname: node2 Capacity: cpu: 32 ephemeral-storage: 229700940Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 197754972Ki nvidia.com/gpu: 2 openshift.io/mlnx_ib0: 8 pods: 110 Allocatable: cpu: 31900m ephemeral-storage: 211692385954 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 197402572Ki nvidia.com/gpu: 2 openshift.io/mlnx_ib0: 8 pods: 110 System Info: Machine ID: 646aa8cc13d14c47ac112babe9daf77c System UUID: 37383638-3330-5a43-3238-3435304d3647 Boot ID: 049266f0-98a5-48a8-b225-e118d1508ae1 Kernel Version: 5.4.0-56-generic OS Image: Ubuntu 20.04.1 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://19.3.12 Kubelet Version: v1.19.2 Kube-Proxy Version: v1.19.2 PodCIDR: 10.233.65.0/24 PodCIDRs: 10.233.65.0/24 Non-terminated Pods: (15 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- default gpu-operator-1606837056-node-feature-discovery-worker-sjh9c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d gpu-operator-resources gpu-feature-discovery-lfzpz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d gpu-operator-resources nvidia-container-toolkit-daemonset-z8zbj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d gpu-operator-resources nvidia-dcgm-exporter-d4s79 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d gpu-operator-resources nvidia-device-plugin-daemonset-jm4sm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d gpu-operator-resources nvidia-driver-daemonset-7kj6c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d kube-system calico-node-9st8w 150m (0%) 300m (0%) 64M (0%) 500M (0%) 28d kube-system kube-multus-ds-amd64-xhzwv 100m (0%) 100m (0%) 90Mi (0%) 90Mi (0%) 28d kube-system kube-proxy-l45cj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d kube-system nginx-proxy-node2 25m (0%) 0 (0%) 32M (0%) 0 (0%) 28d kube-system nodelocaldns-lvqwb 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 28d kube-system whereabouts-nsw6x 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 22d sriov-network-operator sriov-cni-r2vlr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9m40s sriov-network-operator sriov-device-plugin-zschv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m50s sriov-network-operator sriov-network-config-daemon-7qtmt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 475m (1%) 500m (1%) memory 316200960 (0%) 825058560 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 openshift.io/mlnx_ib0 0 0 Events: <none>
- Check Worker node 3.
# kubectl describe nodes node3 Name: node3 Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true feature.node.kubernetes.io/cpu-cpuid.AVX512F=true feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.HLE=true feature.node.kubernetes.io/cpu-cpuid.IBPB=true feature.node.kubernetes.io/cpu-cpuid.MPX=true feature.node.kubernetes.io/cpu-cpuid.RTM=true feature.node.kubernetes.io/cpu-cpuid.STIBP=true feature.node.kubernetes.io/cpu-cpuid.VMX=true feature.node.kubernetes.io/cpu-hardware_multithreading=true feature.node.kubernetes.io/cpu-rdt.RDTCMT=true feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true feature.node.kubernetes.io/cpu-rdt.RDTMBA=true feature.node.kubernetes.io/cpu-rdt.RDTMBM=true feature.node.kubernetes.io/cpu-rdt.RDTMON=true feature.node.kubernetes.io/custom-rdma.available=true feature.node.kubernetes.io/custom-rdma.capable=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true feature.node.kubernetes.io/kernel-version.full=5.4.0-56-generic feature.node.kubernetes.io/kernel-version.major=5 feature.node.kubernetes.io/kernel-version.minor=4 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/memory-numa=true feature.node.kubernetes.io/pci-0300_102b.present=true feature.node.kubernetes.io/pci-0302_10de.present=true feature.node.kubernetes.io/pci-0302_10de.sriov.capable=true feature.node.kubernetes.io/storage-nonrotationaldisk=true feature.node.kubernetes.io/system-os_release.ID=ubuntu feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04 kubernetes.io/arch=amd64 kubernetes.io/hostname=node3 kubernetes.io/os=linux node-role.kubernetes.io/worker= nvidia.com/gpu.present=true Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock nfd.node.kubernetes.io/extended-resources: nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-... nfd.node.kubernetes.io/worker.version: v0.6.0 node.alpha.kubernetes.io/ttl: 0 sriovnetwork.openshift.io/state: Idle volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 01 Dec 2020 17:22:53 +0200 Taints: <none> Unschedulable: false Lease: HolderIdentity: node3 AcquireTime: <unset> RenewTime: Wed, 30 Dec 2020 14:36:15 +0200 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Mon, 07 Dec 2020 15:40:51 +0200 Mon, 07 Dec 2020 15:40:51 +0200 CalicoIsUp Calico is running on this node MemoryPressure False Wed, 30 Dec 2020 14:36:19 +0200 Mon, 07 Dec 2020 15:40:42 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 30 Dec 2020 14:36:19 +0200 Mon, 07 Dec 2020 15:40:42 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 30 Dec 2020 14:36:19 +0200 Mon, 07 Dec 2020 15:40:42 +0200 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 30 Dec 2020 14:36:19 +0200 Mon, 07 Dec 2020 15:40:44 +0200 KubeletReady kubelet is posting ready status. AppArmor enabled Addresses: InternalIP: 192.168.1.11 Hostname: node3 Capacity: cpu: 64 ephemeral-storage: 229698892Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 197747532Ki nvidia.com/gpu: 2 openshift.io/mlnx_ib0: 7 pods: 110 Allocatable: cpu: 63900m ephemeral-storage: 211690498517 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 197395132Ki nvidia.com/gpu: 2 openshift.io/mlnx_ib0: 0 pods: 110 System Info: Machine ID: c9f34445383f445eb44cd27fb90634e8 System UUID: 37383638-3330-5a43-3238-3435304d3643 Boot ID: 20be7b74-ce7d-4180-b904-48135f823819 Kernel Version: 5.4.0-56-generic OS Image: Ubuntu 20.04.1 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://19.3.12 Kubelet Version: v1.19.2 Kube-Proxy Version: v1.19.2 PodCIDR: 10.233.66.0/24 PodCIDRs: 10.233.66.0/24 Non-terminated Pods: (17 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- default gpu-operator-1606837056-node-feature-discovery-worker-4mxl8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d default rdma-test-pod 0 (0%) 0 (0%) 0 (0%) 0 (0%) 22d gpu-operator-resources gpu-feature-discovery-kbkwt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d gpu-operator-resources nvidia-container-toolkit-daemonset-fmvgk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d gpu-operator-resources nvidia-dcgm-exporter-nlwhx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d gpu-operator-resources nvidia-device-plugin-daemonset-k7c99 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d gpu-operator-resources nvidia-driver-daemonset-sslgt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d kube-system calico-node-hmsml 150m (0%) 300m (0%) 64M (0%) 500M (0%) 28d kube-system coredns-84646c885d-zh86b 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 22d kube-system kube-multus-ds-amd64-4r25b 100m (0%) 100m (0%) 90Mi (0%) 90Mi (0%) 28d kube-system kube-proxy-bbd99 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d kube-system nginx-proxy-node3 25m (0%) 0 (0%) 32M (0%) 0 (0%) 28d kube-system nodelocaldns-xw9tq 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 28d kube-system whereabouts-pnhvn 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 28d sriov-network-operator sriov-cni-kcz44 0 (0%) 0 (0%) 0 (0%) 0 (0%) 10m sriov-network-operator sriov-device-plugin-gmmdr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4s sriov-network-operator sriov-network-config-daemon-xq9c9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28d Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 575m (0%) 500m (0%) memory 389601280 (0%) 1003316480 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 openshift.io/mlnx_ib0 1 1 Events: <none>
Deployment Verification
Create a sample Deployment (Container image must include Cuda and InfiniBand performance tools):
K8s Master Node Console# vim sample-depl.yaml apiVersion: apps/v1 kind: Deployment metadata: name: sample-pod labels: app: sriov spec: replicas: 2 selector: matchLabels: app: sriov template: metadata: labels: app: sriov annotations: k8s.v1.cni.cncf.io/networks: sriovib0 spec: containers: - image: <Container Image Name> name: mlnx-inbox-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: openshift.io/mlnx_ib0: '1' nvidia.com/gpu: 1 limits: openshift.io/mlnx_ib0: '1' nvidia.com/gpu: 1 command: - sh - -c - sleep inf
Deploy the sample POD.
K8s Master Node Console# kubectl apply -f sample-depl.yaml
Verify the POD is running.
K8s Master Node Console# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-operator-1610455631-node-feature-discovery-master-c8dbgrnpf 1/1 Running 0 20h 10.233.90.5 node1 <none> <none> gpu-operator-1610455631-node-feature-discovery-worker-24zlr 1/1 Running 4 20h 10.233.92.31 node3 <none> <none> gpu-operator-1610455631-node-feature-discovery-worker-47mbw 1/1 Running 1 20h 10.233.90.4 node1 <none> <none> gpu-operator-1610455631-node-feature-discovery-worker-qmnmj 1/1 Running 2 20h 10.233.96.20 node2 <none> <none> gpu-operator-7d4649d96c-2d2xj 1/1 Running 4 20h 10.233.90.3 node1 <none> <none> sample-pod-65b94586b4-8k784 1/1 Running 0 17h 10.233.92.37 node3 <none> <none> sample-pod-65b94586b4-8xn6m 1/1 Running 0 17h 10.233.96.27 node2 <none> <none>
Check GPU in a container.
K8s Master Node Console# kubectl exec -it sample-pod-65b94586b4-8k784 -- bash root@sample-pod-65b94586b4-8k784:/tmp# nvidia-smi Wed Jan 13 09:38:49 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:37:00.0 Off | 0 | | N/A 48C P8 16W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ root@sample-pod-65b94586b4-8k784:/# exit exit
Check network adapters.
K8s Master Node Console# kubectl exec -it sample-pod-65b94586b4-8k784 -- bash root@sample-pod-65b94586b4-8k784:/tmp# ip a s 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 4: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 8a:87:13:3b:bd:c4 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.92.37/32 scope global eth0 valid_lft forever preferred_lft forever 49: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:0e:e3:fe:80:00:00:00:00:00:00:60:cc:fa:35:1d:14:a4:cc brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.168.101.1/24 brd 192.168.101.255 scope global net1 valid_lft forever preferred_lft forever root@sample-pod-65b94586b4-8k784:/tmp# ibdev2netdev mlx5_9 port 1 ==> net1 (Up) root@sample-pod-65b94586b4-8k784:/# exit exit
Run an RDMA Write - ib_write_bw bandwidth stress benchmark over IB.
Server
ib_write_bw -a -d mlx5_0 &
Client
ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits
Open 2 consoles to K8s Master node.In a first console (Server side) to K8s Master node run the following commands:
K8s Master Node Console# kubectl exec -it sample-pod-65b94586b4-8k784 -- bash root@sample-pod-65b94586b4-8k784:/tmp# ibdev2netdev mlx5_9 port 1 ==> net1 (Up) root@sample-pod-65b94586b4-8k784:/tmp# ib_write_bw -a -d mlx5_9 & [1] 1081 root@sample-pod-65b94586b4-8k784:/tmp# ************************************ * Waiting for client to connect... * ************************************
In a second console (Client side) to K8s Master node run the following commands:
K8s Master Node Console# kubectl exec -it sample-pod-65b94586b4-8xn6m -- bash root@sample-pod-65b94586b4-8xn6m:/tmp# ibdev2netdev mlx5_7 port 1 ==> net1 (Up) root@sample-pod-65b94586b4-8xn6m:/tmp# ib_write_bw -a -F 192.168.101.1 -d mlx5_7 --report_gbits
Results:
K8s Master Node ConsoleServer: --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_9 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0d QPN 0x0ec6 PSN 0xa49cc3 RKey 0x0e0400 VAddr 0x007fa17b1ef000 remote address: LID 0x0c QPN 0x0bac PSN 0xa6c47a RKey 0x0a0400 VAddr 0x007f54c7554000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 8388608 5000 96.58 96.54 0.001439 --------------------------------------------------------------------------------------- Client: --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_7 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0c QPN 0x0ba9 PSN 0xf563c8 RKey 0x0a0400 VAddr 0x007fd3ff9cb000 remote address: LID 0x0d QPN 0x0ec3 PSN 0x9445de RKey 0x0e0400 VAddr 0x007fac5f879000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 5000 0.11 0.11 6.680528 4 5000 0.23 0.22 6.961376 8 5000 0.48 0.43 6.746956 16 5000 0.96 0.86 6.703131 32 5000 1.90 1.80 7.021876 64 5000 3.82 3.59 7.014856 128 5000 7.45 7.02 6.853576 256 5000 14.69 14.38 7.019255 512 5000 28.51 27.75 6.774283 1024 5000 54.31 48.41 5.909477 2048 5000 82.91 80.73 4.927545 4096 5000 95.75 95.62 2.918237 8192 5000 95.88 95.88 1.462960 16384 5000 96.18 96.15 0.733546 32768 5000 96.49 96.37 0.367604 65536 5000 96.54 96.53 0.184124 131072 5000 96.56 96.55 0.092081 262144 5000 96.56 96.56 0.046041 524288 5000 96.57 96.57 0.023024 1048576 5000 96.58 96.57 0.011512 2097152 5000 96.58 96.52 0.005753 4194304 5000 96.58 96.56 0.002878 8388608 5000 96.58 96.56 0.001439 ---------------------------------------------------------------------------------------
Delete the sample deployment by running:
K8s Master Node Console# root@sample-pod-65b94586b4-lz5x9:/tmp# exit # kubectl delete -f sample-depl.yaml
Done !
Boris Kovalev Boris Kovalev has worked for the past several years as a Solutions Architect, focusing on NVIDIA Networking/Mellanox technology, and is responsible for complex machine learning, Big Data and advanced VMware-based cloud research and design. Boris previously spent more than 20 years as a senior consultant and solutions architect at multiple companies, most recently at VMware. He has written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions which are available at the Mellanox Documents website. |
Vitaliy Razinkov Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website. |
Related Documents
1 Comment
Erez Cohen