RDG for Accelerated K8s Cluster over NVIDIA DGX A100 Servers and 200Gbps Ethernet Network Fabric

Created on May 15, 2022.

Scope

The following Reference Deployment Guide (RDG) guides you through setting up a highly available GPU and Network accelerated Kubernetes (K8s) cluster over 200Gb/s NVIDIA network. A high availability cluster consists of multiple control plane nodes (K8s master nodes), multiple worker nodes (DGX A100 servers) and a load balancer application (HAProxy). This guide provides examples on how-to run ML/DL applications over NVIDIA DGX A100 server platform with Kubeflow training operators.

Abbreviations and Acronyms

Term

Definition

Term

Definition

CNI

Container Network Interface

NFD

Node Feature Discovery

CR

Custom Resources

NCCL

NVIDIA Collective Communication Library

CRD

Custom Resources Definition

OCI

Open Container Initiative

CRI

Container Runtime Interface

PF

Physical Function

DHCP

Dynamic Host Configuration Protocol

QSG

Quick Start Guide

DNS

Domain Name System

RDG

Reference Deployment Guide

DL

Deep Learning

RDMA

Remote Direct Memory Access

DP

Device Plugin

RoCE

RDMA over Converged Ethernet

IPAM

IP Address Management

SR-IOV

Single Root Input Output Virtualization

K8s

Kubernetes

TF

TensorFlow

LLDP

Link Layer Discovery Protocol

VF

Virtual Function

ML

Machine Learning

Introduction

Provisioning the highly available Kubernetes cluster to run ML/DL applications workloads may become an extremely complicated task.

This guide provides a complete solution cycle of K8s cluster deployment including technology overview, design, component selection, deployment steps and ML/DL workload examples. The solution will be delivered on top of standard servers for control plane and DGX A100 servers as K8s worker nodes. The NVIDIA 200Gb/s end-to-end Ethernet infrastructure is used to handle the workload while 100Gb/s network is used as a primary network.
In this guide, we use the NVIDIA GPU Operator and the NVIDIA Network Operator, who are responsible for deploying and configuring GPU and Network components in the K8s cluster. These components allow you to accelerate ML/DL tasks using CUDA, RDMA and GPUDirect technologies.

A Greenfield deployment is assumed for this guide.
This guide shows the design of a K8s cluster with two to eight Worker Nodes and supplies detailed instructions for deploying a four K8s Worker Nodes cluster.

References

Solution Architecture

Key Components and Technologies

  • NVIDIA DGX A100

    NVIDIA DGX™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. NVIDIA DGX A100 features the world’s most advanced accelerator, the NVIDIA A100 Tensor Core GPU, enabling enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.

  • NVIDIA ConnectX SmartNICs
    10/25/40/50/100/200 and 400G Ethernet Network Adapters
    The industry-leading NVIDIA® ConnectX® family of smart network interface cards (SmartNICs) offer advanced hardware offloads and accelerations.
    NVIDIA Ethernet adapters enable the highest ROI and lowest Total Cost of Ownership for hyperscale, public and private clouds, storage, machine learning, AI, big data, and telco platforms.

  • NVIDIA LinkX Cables

    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA Spectrum Ethernet Switches

    Flexible form-factors with 16 to 128 physical ports, supporting 1GbE through 400GbE speeds.
    Based on a ground-breaking silicon technology optimized for performance and scalability, NVIDIA Spectrum switches are ideal for building high-performance, cost-effective, and efficient Cloud Data Center Networks, Ethernet Storage Fabric, and Deep Learning Interconnects.
    NVIDIA combines the benefits of NVIDIA Spectrum switches, based on an industry-leading application-specific integrated circuit (ASIC) technology, with a wide variety of modern network operating system choices, including NVIDIA Cumulus® Linux , SONiC and NVIDIA Onyx®.

  • NVIDIA Cumulus Linux

    NVIDIA® Cumulus® Linux is the industry's most innovative open network operating system that allows you to automate, customize, and scale your data center network like no other.

  • Kubernetes
    Kubernetes is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.

  • Kubespray
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:

    • A highly available cluster

    • Composable attributes

    • Support for most popular Linux distributions

  • NVIDIA GPU Operator

    The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM-based monitoring, and more.

  • NVIDIA Network Operator

    An analog to the NVIDIA GPU Operator, the NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

  • NVIDIA CUDA

    CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers can dramatically speed up computing applications by harnessing the power of GPUs. In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute-intensive portion of the application runs on thousands of GPU cores in parallel.

  • RDMA

    RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.

    Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.

  • GPUDirect RDMA

    GPUDirect (GDR) RDMA provides a direct P2P (Peer-to-Peer) data path between the GPU memory directly to and from NVIDIA HCA devices. This reduces GPU-to-GPU communication latency and completely offloads the CPU, removing it from all GPU-to-GPU communications across the network.

gpu.png

Logical Design

The logical design includes the following parts:

  • Deployment node running Kubespray that deploys Kubernetes clusters and HAProxy load-balancer

  • K8s Master nodes running all Kubernetes management components

  • NVIDIA DGX A100 K8s Worker nodes

  • High-speed Ethernet fabric (Secondary K8s network with RoCE support)

  • Deployment and K8s Management networks

image2022-3-13_21-40-13.png

Network / Fabric Design

The high-performance network is a secondary network for Kubernetes cluster and requires the L2 network topology.

This RDG describes two options with multiple K8s Worker Nodes:

  • Design for 2-4 Worker Nodes
    In this solution, all K8s Worker Nodes are connected to a single switch, which provides a K8s secondary network.

  • Design for 5-8 Worker Nodes
    In this solution, all K8s Worker Nodes are connected to two independent switches, which provide a K8s secondary network.

The Deployment and Kubernetes Management networks are parts of the IT infrastructure and are beyond the scope of this document.

Design for K8s Cluster of 2-4 Worker Nodes

All Nodes are connected to the MGMT switch by a single 100GbE cable, and all Data port from the K8s worker nodes are connected to Data switch by 200GbE cables. All server remote management ports and switch management ports are connected to 1GbE switch.

4.png

Design for K8s Cluster of 5-8 Worker Nodes

All Nodes are connected to the MGMT switch by a single 100GbE cable, and all Data port from the K8s worker nodes are connected to both Data switches by 200GbE cables: the first four data ports are connected to Data Switch1, and the remaining four data ports are connected to Data Switch2. See the Worker Node4 connections as an example. All server remote management ports and switch management ports are connected to 1GbE switch.

8.png

Software Stack Components

sw.png

Bill of Materials

The following hardware setup is utilized in this guide to build K8s cluster with 4 K8s Worker nodes.

bom.png

The following hardware setup is utilized in this guide to build a K8s cluster with 8 K8s Worker nodes.

bom2.png

Note

Server remote management and switch management BOM for 1GbE network is beyond the scope of this document.

Deployment and Configuration

Wiring

On each K8s Worker Node, all the networking ports of each NVIDIA Network Adapter is wired to an NVIDIA switch in high-performance fabric using NVIDIA LinkX DAC cables.

The below figure illustrates the required wiring for building a K8s cluster with 4 K8s Worker nodes.

w4.png

The below figure illustrates the required wiring for b uilding a K8s cluster with 8 K8s Worker nodes.

w8.png

Note

Server remote management and switch management wiring over 1GbE network is beyond the scope of this document.

Network / Fabric

General Prerequisites

Deployment/Management network topology and DNS/DHCP network services are part of the IT infrastructure. The components installation procedure and configuration are not covered in this guide.

Network and Fabric Configuration for Clusters up to 4 DGX A100 Worker Nodes

Prerequisites

  • High-performance Ethernet fabric

    • Single switch - NVIDIA SN3700

    • Switch OS - Cumulus Linux v4.3 and above

Network Configuration

Below are the server names with their relevant network configurations.

Server/Switch Type

Server/Switch Name

IP and NICS

High-Speed Network 200GbE

Management Network 100GbE

Deployment node

depserver

N/A

eth0: DHCP

192.168.222.110

Master node1

Node1

N/A

eth0: DHCP

192.168.222.111

Master node2

Node2

N/A

eth0: DHCP

192.168.222.112

Master node3

Node3

N/A

eth0: DHCP

192.168.222.113

Worker node1

clx-host-081

enp12s0: no IP set enp18s0: no IP set

enp75s0: no IP set enp84s0: no IP set

enp141s0: no IP set enp148s0: no IP set

enp186s0: no IP set enp204s0: no IP set

enp225s0f0: DHCP

192.168.222.101

Worker node2

clx-host-082

enp12s0: no IP set enp18s0: no IP set

enp75s0: no IP set enp84s0: no IP set

enp141s0: no IP set enp148s0: no IP set

enp186s0: no IP set enp204s0: no IP set

enp225s0f0: DHCP

192.168.222.102

Worker node3

clx-host-083

enp12s0: no IP set enp18s0: no IP set

enp75s0: no IP set enp84s0: no IP set

enp141s0: no IP set enp148s0: no IP set

enp186s0: no IP set enp204s0: no IP set

enp225s0f0: DHCP

192.168.222.103

Worker node4

clx-host-084

enp12s0: no IP set enp18s0: no IP set

enp75s0: no IP set enp84s0: no IP set

enp141s0: no IP set enp148s0: no IP set

enp186s0: no IP set enp204s0: no IP set

enp225s0f0: DHCP

192.168.222.104

High-speed switch

hs-sw01

N/A

mgmt0: DHCP

192.168.222.201

enpXXXs0 high-speed network interfaces do not require additional configuration.

Fabric Configuration

This solution is based on Cumulus Linux v4.3 switch operation system.

A Greenfield deployment is assumed for this guide.

As a best practice, make sure to use the latest released Cumulus Linux NOS version. Please see this guide on how to upgrade Cumulus Linux.

Ensure that your Cumulus Linux switch has passed its initial configuration stages (please see the Quick-Start Guide for version 4.3 for more information):

Fabric configuration steps:

  1. Administratively enable all physical ports

  2. Create a bridge and configure front panel ports as members of the bridge

  3. Create VLANs

  4. Add VLANs to bridge

  5. Commit configuration

Switch configuration steps:

Copy
Copied!
            

Linux hs-sw01 4.19.0-cl-1-amd64 #1 SMP Cumulus 4.19.149-1+cl4.3u1 (2021-01-28) x86_64   Welcome to NVIDIA Cumulus (R) Linux (R)   For support and online technical documentation, visit http://www.cumulusnetworks.com/support   The registered trademark Linux (R) is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis. cumulus@hs-sw01:mgmt:~$ net show version NCLU_VERSION=1.0-cl4.3.0u4 DISTRIB_ID="Cumulus Linux" DISTRIB_RELEASE=4.3.0 DISTRIB_DESCRIPTION="Cumulus Linux 4.3.0" cumulus@hs-sw01:mgmt:~$ net add interface swp1-32 cumulus@hs-sw01:mgmt:~$ net add bridge bridge ports swp1-32 cumulus@hs-sw01:mgmt:~$ net add vlan 11 vlan-id 11 cumulus@hs-sw01:mgmt:~$ net add vlan 12 vlan-id 12 cumulus@hs-sw01:mgmt:~$ net add vlan 13 vlan-id 13 cumulus@hs-sw01:mgmt:~$ net add vlan 14 vlan-id 14 cumulus@hs-sw01:mgmt:~$ net add vlan 15 vlan-id 15 cumulus@hs-sw01:mgmt:~$ net add vlan 16 vlan-id 16 cumulus@hs-sw01:mgmt:~$ net add vlan 17 vlan-id 17 cumulus@hs-sw01:mgmt:~$ net add vlan 18 vlan-id 18 cumulus@hs-sw01:mgmt:~$ net add bridge bridge vids 11-18 cumulus@hs-sw01:mgmt:~$ net commit

To view link status, use the net show interface all command. The following examples show the output of ports in admin down , down , and up modes.

Copy
Copied!
            

cumulus@hs-sw01:mgmt:~$ net show interface   State Name Spd MTU Mode LLDP Summary ----- ------- ---- ----- --------- ---------------------------------- ------------------------ UP lo N/A 65536 Loopback IP: 127.0.0.1/8 lo IP: ::1/128 UP eth0 1G 1500 Mgmt Master: mgmt(UP) eth0 IP: 192.168.222.201/24(DHCP) UP swp1 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp2 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP swp3 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp4 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP swp5 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp6 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP swp7 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp8 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP swp9 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp10 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp11 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp12 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp13 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp14 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp15 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp16 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp17 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp18 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp19 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp20 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp21 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp22 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp23 200G 9216 Trunk/L2 clx-host-083 Master: bridge(UP) UP swp24 200G 9216 Trunk/L2 clx-host-084 Master: bridge(UP) UP swp25 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp26 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP swp27 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp28 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP swp29 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp30 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP swp31 200G 9216 Trunk/L2 clx-host-081 Master: bridge(UP) UP swp32 200G 9216 Trunk/L2 clx-host-082 Master: bridge(UP) UP bridge N/A 9216 Bridge/L2 UP mgmt N/A 65536 VRF IP: 127.0.0.1/8 mgmt IP: ::1/128 UP vlan11 N/A 9216 Default UP vlan12 N/A 9216 Default UP vlan13 N/A 9216 Default UP vlan14 N/A 9216 Default UP vlan15 N/A 9216 Default UP vlan16 N/A 9216 Default UP vlan17 N/A 9216 Default UP vlan18 N/A 9216 Default

Nodes Configuration

General Prerequisites

  • Deployment Server and K8s Master Nodes

    Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages.

  • K8s Worker Nodes

    • All the K8s Worker Nodes have the same hardware specification (see BoM for details).

    • Verify that an SR-IOV supported server platform is being used and review the BIOS settings in the server platform vendor documentation to enable SR-IOV in the BIOS.

    • For AMD processors, NUMA Nodes per Socket (NPS) should be configured in NPS1.

    • All high-speed 200Gb/s ConnectX-6 single-port Adapter Cards should be configured in Ethernet mode.

Host OS Prerequisites

Ensure that the Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages, and create a non-root depuser account with sudo privileges without password.

Update the Ubuntu software packages by running the following commands:

Copy
Copied!
            

sudo apt-get update sudo apt-get upgrade -y sudo reboot

In this solution we added the following line to the EOF /etc/sudoers:

Copy
Copied!
            

sudo vim /etc/sudoers #includedir /etc/sudoers.d #K8s cluster deployment user with sudo privileges without password depuser ALL=(ALL) NOPASSWD:ALL

NVIDIA DGX A100 Server Firmware Update

It is recommended to update the DGX A100 server firmware to the latest GA release.

If you are unfamiliar with server firmware update procedure, please contact the NVIDIA Support team or visit DGX System Documentation page.

Deployment External Load-Balancer

In this deployment, t he topology of the high-available (HA) Kubernetes clusters is configured with stacked control plane nodes, where ETCD nodes are collocated with control plane nodes. More information about the HA topology options to use in Kubernetes cluster deployment can be found here.

The high availability cluster is built across multiple K8s control plane nodes (K8s master nodes), multiple Worker Nodes and a load balancer.

Adding load balancer to K8s cluster deployment makes the system more robust, since any K8s master node can fail without the application going offline or data being lost.

An illustration of this setup is shown below.

The ETCD cluster ensures that all data is synchronized across the master nodes, and that the load balancer regulates the traffic distribution. The cluster can therefore be accessed through one single entry point (the load balancer and the request are passed to an arbitrary node.

proxy.png

Reference: https://kubernetes.io/docs/setup/independent/ha-topology/#stacked-etcd-topology

An HAProxy standard package is used.

Installation steps on Deployment Node with root user account:

Copy
Copied!
            

apt-get -y install haproxy

Update /etc/haproxy/haproxy.cfg with following:

Copy
Copied!
            

global log /dev/log local0 log /dev/log local1 notice chroot /var/lib/haproxy stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners stats timeout 30s user haproxy group haproxy daemon   # Default SSL material locations ca-base /etc/ssl/certs crt-base /etc/ssl/private   # See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384 ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256 ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets   defaults log global mode http option httplog option dontlognull timeout connect 5000 timeout client 50000 timeout server 50000 errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http   frontend stats bind *:8404 stats enable stats uri /stats stats refresh 10s stats admin if LOCALHOST   listen kubernetes-apiserver-https bind 192.168.222.110:6443 mode tcp option log-health-checks timeout client 3h timeout server 3h server node1 192.168.222.111:6443 check check-ssl verify none inter 10000 server node2 192.168.222.112:6443 check check-ssl verify none inter 10000 server node3 192.168.222.113:6443 check check-ssl verify none inter 10000 balance roundrobin  

After updating the configuration file, restart the haproxy service.

Copy
Copied!
            

service haproxy restart

K8s Cluster Deployment and Configuration

The Kubernetes cluster in this solution is installed using Kubespray with a non-root depuser account from the Deployment Node.

SSH Private Key and SSH Passwordless Login

Log in to the Deployment Node as a deployment user (in this case, depuser) and create an SSH private key for configuring the passwordless authentication on your computer by running the following commands:

Copy
Copied!
            

ssh-keygen   Generating public/private rsa key pair. Enter file in which to save the key (/home/depuser/.ssh/id_rsa): Created directory '/home/depuser/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/depuser/.ssh/id_rsa Your public key has been saved in /home/depuser/.ssh/id_rsa.pub The key fingerprint is: SHA256:IfcjdT/spXVHVd3n6wm1OmaWUXGuHnPmvqoXZ6WZYl0 depuser@depserver The key's randomart image is: +---[RSA 3072]----+ | *| | .*| | . o . . o=| | o + . o +E| | S o .**O| | . .o=OX=| | . o%*.| | O.o.| | .*.ooo| +----[SHA256]-----+

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in the deployment by running the following command (example):

Copy
Copied!
            

ssh-copy-id depuser@192.168.222.111   /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/depuser/.ssh/id_rsa.pub" The authenticity of host '192.168.222.111 (192.168.222.111)' can't be established. ECDSA key fingerprint is SHA256:6nhUgRlt9gY2Y2ofukUqE0ltH+derQuLsI39dFHe0Ag. Are you sure you want to continue connecting (yes/no/[fingerprint])? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys depuser@192.168.222.111's password:   Number of key(s) added: 1   Now try logging into the machine, with: "ssh 'depuser@192.168.222.111'" and check to make sure that only the key(s) you wanted were added.

Verify that you have a passwordless SSH connectivity to all nodes in your deployment by running the following command (example):

Copy
Copied!
            

$ ssh depuser@192.168.222.111

Kubespray Deployment and Configuration

General Setting

To install dependencies for running Kubespray with Ansible on the Deployment Node, run the following commands:

Copy
Copied!
            

cd ~ sudo apt -y install python3-pip jq wget https://github.com/kubernetes-sigs/kubespray/archive/refs/tags/v2.18.0.tar.gz tar -zxf v2.18.0.tar.gz cd kubespray-2.18.0 sudo pip3 install -r requirements.txt

Warning

The default folder for subsequent commands is ~/kubespray-2.18.0.

Deployment Customization

Create a new cluster configuration and host configuration file .
Replace the IP addresses below with your nodes' IP addresses:

Copy
Copied!
            

cp -rfp inventory/sample inventory/mycluster declare -a IPS=(192.168.222.111 192.168.222.112 192.168.222.113 192.168.222.101 192.168.222.102 192.168.222.103 192.168.222.104) CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

As a result, the inventory/mycluster/hosts.yaml file will be created.
Review and change the host configuration in the file. Below is an example for this deployment:

inventory/mycluster/hosts.yaml

Copy
Copied!
            

all: hosts: node1: ansible_host: 192.168.222.111 ip: 192.168.222.111 access_ip: 192.168.222.111 node2: ansible_host: 192.168.222.112 ip: 192.168.222.112 access_ip: 192.168.222.112 node3: ansible_host: 192.168.222.113 ip: 192.168.222.113 access_ip: 192.168.222.113 clx-host-081: ansible_host: 192.168.222.101 ip: 192.168.222.101 access_ip: 192.168.222.101 clx-host-082: ansible_host: 192.168.222.102 ip: 192.168.222.102 access_ip: 192.168.222.102 clx-host-083: ansible_host: 192.168.222.103 ip: 192.168.222.103 access_ip: 192.168.222.103 clx-host-084: ansible_host: 192.168.222.104 ip: 192.168.222.104 access_ip: 192.168.222.104 children: kube_control_plane: hosts: node1: node2: node3: kube_node: hosts: clx-host-081: clx-host-082: clx-host-083: clx-host-084: etcd: hosts: node1: node2: node3: k8s_cluster: children: kube_control_plane: kube_node: calico_rr: hosts: {}

Review and change cluster installation parameters in the files:

  • inventory/mycluster/group_vars/all/all.yml

In inventory/mycluster/group_vars/all/all.yml, set the following settings to use an External loadbalancer and disable internally:

inventory/mycluster/group_vars/all/all.yml

Copy
Copied!
            

...     ## External LB example config apiserver_loadbalancer_domain_name: "ha-k8s.clx.labs.mlnx" loadbalancer_apiserver: address: 192.168.222.110 port: 6443   ## Internal loadbalancers for apiservers loadbalancer_apiserver_localhost: false   ...

Deploying the Cluster Using KubeSpray Ansible Playbook

Run the following line to start the deployment process:

Copy
Copied!
            

ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml

It takes a while for this deployment to complete, please make sure no errors are encountered.

A successful result should look something like the following:

Copy
Copied!
            

... PLAY RECAP *********************************************************************************************************************************************************************************** clx-host-081 : ok=401 changed=31 unreachable=0 failed=0 skipped=718 rescued=0 ignored=1 clx-host-082 : ok=401 changed=31 unreachable=0 failed=0 skipped=718 rescued=0 ignored=1 clx-host-083 : ok=401 changed=31 unreachable=0 failed=0 skipped=718 rescued=0 ignored=1 clx-host-084 : ok=401 changed=30 unreachable=0 failed=0 skipped=718 rescued=0 ignored=1 localhost : ok=4 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 node1 : ok=556 changed=62 unreachable=0 failed=0 skipped=1235 rescued=0 ignored=3 node2 : ok=505 changed=74 unreachable=0 failed=0 skipped=1080 rescued=0 ignored=2 node3 : ok=507 changed=53 unreachable=0 failed=0 skipped=1078 rescued=0 ignored=2   Thursday 17 February 2022 23:11:54 +0000 (0:00:00.265) 0:29:39.691 ***** =============================================================================== kubernetes/control-plane : Joining control plane node to the cluster. --------------------------------------------------------------------------------------------------------------- 810.38s kubernetes/control-plane : kubeadm | Initialize first master ------------------------------------------------------------------------------------------------------------------------- 41.98s kubernetes/control-plane : Master | wait for kube-scheduler -------------------------------------------------------------------------------------------------------------------------- 21.27s kubernetes-apps/ansible : Kubernetes Apps | Start Resources -------------------------------------------------------------------------------------------------------------------------- 15.54s policy_controller/calico : Start of Calico kube controllers -------------------------------------------------------------------------------------------------------------------------- 14.76s kubernetes/control-plane : Master | Remove controller manager container containerd/crio ---------------------------------------------------------------------------------------------- 11.30s kubernetes/control-plane : Master | Remove scheduler container containerd/crio ------------------------------------------------------------------------------------------------------- 11.25s kubernetes/preinstall : Update package management cache (APT) ------------------------------------------------------------------------------------------------------------------------ 10.33s kubernetes/node : install | Copy kubelet binary from download dir --------------------------------------------------------------------------------------------------------------------- 9.83s network_plugin/calico : Start Calico resources ---------------------------------------------------------------------------------------------------------------------------------------- 8.96s download : download | Download files / images ----------------------------------------------------------------------------------------------------------------------------------------- 8.52s kubernetes/kubeadm : Join to cluster -------------------------------------------------------------------------------------------------------------------------------------------------- 8.39s container-engine/crictl : extract_file | Unpacking archive ---------------------------------------------------------------------------------------------------------------------------- 8.35s container-engine/runc : download_file | Download item --------------------------------------------------------------------------------------------------------------------------------- 8.17s container-engine/crictl : download_file | Download item ------------------------------------------------------------------------------------------------------------------------------- 7.84s container-engine/containerd : download_file | Download item --------------------------------------------------------------------------------------------------------------------------- 7.80s container-engine/nerdctl : extract_file | Unpacking archive --------------------------------------------------------------------------------------------------------------------------- 7.63s network_plugin/calico : Calico | Create Calico Kubernetes datastore resources --------------------------------------------------------------------------------------------------------- 7.57s container-engine/nerdctl : extract_file | Unpacking archive --------------------------------------------------------------------------------------------------------------------------- 7.55s container-engine/nerdctl : download_file | Download item ------------------------------------------------------------------------------------------------------------------------------ 7.51s

K8s Cluster Customization and Verification

Now that the K8S cluster is deployed, connection to the K8s cluster can be done from any K8S Master Node with the root user account or from another server with installed KUBECTL command and configured KUBECONFIG=<path-to-config-file> to customize deployment.

In our guide we continue the deployment from depserver with the root user account:

Copy
Copied!
            

## Install KUBECTL snap install kubectl --channel=1.22/stable --classic

To start using your cluster, you need to run the following command as a regular user:

Copy
Copied!
            

mkdir -p $HOME/.kube scp -i depuser@node1:/etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config 

Label the Worker Nodes:

Master Node console

Copy
Copied!
            

kubectl label nodes clx-host-081 node-role.kubernetes.io/worker= kubectl label nodes clx-host-082 node-role.kubernetes.io/worker= kubectl label nodes clx-host-083 node-role.kubernetes.io/worker= kubectl label nodes clx-host-084 node-role.kubernetes.io/worker=

Important

K8s Worker Node labeling is required for a proper installation of the NVIDIA Network Operator.

Below is an output example of the K8s cluster deployment information using the Calico CNI plugin.

To ensure that the Kubernetes cluster is installed correctly, run the following commands:

Copy
Copied!
            

## Get cluster node status   kubectl get node -o wide   NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME clx-host-081 Ready    worker                 26h v1.22.5 192.168.222.101 <none> Ubuntu 20.04.4 LTS 5.4.0-100-generic containerd://1.5.8 clx-host-082 Ready    worker                 26h v1.22.5 192.168.222.102 <none> Ubuntu 20.04.4 LTS 5.4.0-100-generic containerd://1.5.8 clx-host-083 Ready    worker                 26h v1.22.5 192.168.222.103 <none> Ubuntu 20.04.4 LTS 5.4.0-100-generic containerd://1.5.8 clx-host-084 Ready    worker                 26h v1.22.5 192.168.222.104 <none> Ubuntu 20.04.4 LTS 5.4.0-100-generic containerd://1.5.8 node1 Ready control-plane,master 26h v1.22.5 192.168.222.111 <none> Ubuntu 20.04.4 LTS 5.4.0-100-generic containerd://1.5.8 node2 Ready control-plane,master 26h v1.22.5 192.168.222.112 <none> Ubuntu 20.04.3 LTS 5.4.0-100-generic containerd://1.5.8 node3 Ready control-plane,master 26h v1.22.5 192.168.222.113 <none> Ubuntu 20.04.3 LTS 5.4.0-100-generic containerd://1.5.8   ## Get system pods status   kubectl -n kube-system get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES calico-kube-controllers-5788f6558-d9zcd 1/1 Running 6 26h 192.168.222.103 clx-host-083 <none> <none> calico-node-7gdzm 1/1 Running 1 26h 192.168.222.104 clx-host-084 <none> <none> calico-node-f6wz4 1/1 Running 1 26h 192.168.222.103 clx-host-083 <none> <none> calico-node-fgtl7 1/1 Running 1 26h 192.168.222.102 clx-host-082 <none> <none> calico-node-tb7hg 1/1 Running 1 26h 192.168.222.113 node3 <none> <none> calico-node-v2hwz 1/1 Running 1 26h 192.168.222.101 clx-host-081 <none> <none> calico-node-v7w7m 1/1 Running 0 26h 192.168.222.111 node1 <none> <none> calico-node-vh984 1/1 Running 1 26h 192.168.222.112 node2 <none> <none> coredns-8474476ff8-5rkrd 1/1 Running 0 26h 10.233.74.1 clx-host-082 <none> <none> coredns-8474476ff8-crqh5 1/1 Running 0 26h 10.233.112.1 clx-host-084 <none> <none> coredns-8474476ff8-n567s 1/1 Running 0 26h 10.233.111.1 clx-host-081 <none> <none> coredns-8474476ff8-vr2ls 1/1 Running 0 26h 10.233.90.1 node1 <none> <none> coredns-8474476ff8-wmcgv 1/1 Running 0 26h 10.233.78.1 clx-host-083 <none> <none> dns-autoscaler-5ffdc7f89d-7fx8d 1/1 Running 0 26h 10.233.90.2 node1 <none> <none> etcd-node1 1/1 Running 2 26h 192.168.222.111 node1 <none> <none> etcd-node2 1/1 Running 1 26h 192.168.222.112 node2 <none> <none> etcd-node3 1/1 Running 1 26h 192.168.222.113 node3 <none> <none> kube-apiserver-node1 1/1 Running 4 26h 192.168.222.111 node1 <none> <none> kube-apiserver-node2 1/1 Running 1 26h 192.168.222.112 node2 <none> <none> kube-apiserver-node3 1/1 Running 1 26h 192.168.222.113 node3 <none> <none> kube-controller-manager-node1 1/1 Running 4 26h 192.168.222.111 node1 <none> <none> kube-controller-manager-node2 1/1 Running 3 26h 192.168.222.112 node2 <none> <none> kube-controller-manager-node3 1/1 Running 3 26h 192.168.222.113 node3 <none> <none> kube-proxy-7hrqw 1/1 Running 0 26h 192.168.222.101 clx-host-081 <none> <none> kube-proxy-9n5lh 1/1 Running 0 26h 192.168.222.111 node1 <none> <none> kube-proxy-b8mxv 1/1 Running 1 26h 192.168.222.113 node3 <none> <none> kube-proxy-bq6zs 1/1 Running 1 26h 192.168.222.112 node2 <none> <none> kube-proxy-cz7pz 1/1 Running 0 26h 192.168.222.104 clx-host-084 <none> <none> kube-proxy-jrrw2 1/1 Running 0 26h 192.168.222.103 clx-host-083 <none> <none> kube-proxy-rnt6g 1/1 Running 0 26h 192.168.222.102 clx-host-082 <none> <none> kube-scheduler-node1 1/1 Running 2 26h 192.168.222.111 node1 <none> <none> kube-scheduler-node2 1/1 Running 2 26h 192.168.222.112 node2 <none> <none> kube-scheduler-node3 1/1 Running 2 26h 192.168.222.113 node3 <none> <none> nodelocaldns-jf62n 1/1 Running 0 26h 192.168.222.104 clx-host-084 <none> <none> nodelocaldns-lpmn7 1/1 Running 1 26h 192.168.222.113 node3 <none> <none> nodelocaldns-pkhht 1/1 Running 0 26h 192.168.222.103 clx-host-083 <none> <none> nodelocaldns-rr6b2 1/1 Running 1 26h 192.168.222.112 node2 <none> <none> nodelocaldns-s2vnx 1/1 Running 0 26h 192.168.222.102 clx-host-082 <none> <none> nodelocaldns-sngtb 1/1 Running 0 26h 192.168.222.111 node1 <none> <none> nodelocaldns-x8nsf 1/1 Running 0 26h 192.168.222.101 clx-host-081 <none> <none>

NVIDIA GPU Operator Installation

The NVIDIA GPU Operator uses the operator framework within the Kubernetes to automate the management of all NVIDIA software components needed to provision the GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for the GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others. For information on platform support and getting started, visit the official documentation repository .

Helm is required for the GPU Operator deployment:

Copy
Copied!
            

## Install HELM snap install helm --classic

Add the NVIDIA Helm repository:

Copy
Copied!
            

## Add REPO helm repo add nvidia https://nvidia.github.io/gpu-operator \ && helm repo update

GPU Operator installation command in K8s cluster over DGX server platform:

Copy
Copied!
            

## Install GPU Operator helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set dcgm.enabled=false   ## Review installation helm ls -n gpu-operator NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION gpu-operator-1646920855 gpu-operator 1 2022-03-10 14:01:05.942790618 +0000 UTC deployed gpu-operator-v1.9.1 v1.9.1

Once the Helm chart is installed, check the status of the pods to ensure all the containers are running and the validation is complete:

Copy
Copied!
            

kubectl -n gpu-operator get pod -o wide   NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-25csp 1/1 Running 0 2m14s 10.233.74.5 clx-host-082 <none> <none> gpu-feature-discovery-4j5x2 1/1 Running 0 2m14s 10.233.78.7 clx-host-083 <none> <none> gpu-feature-discovery-dthsq 1/1 Running 0 2m14s 10.233.112.4 clx-host-084 <none> <none> gpu-feature-discovery-p7spz 1/1 Running 0 2m14s 10.233.111.4 clx-host-081 <none> <none> gpu-operator-1646920855-node-feature-discovery-master-58cdc4vsk 1/1 Running 0 4m2s 10.233.96.2 node2 <none> <none> gpu-operator-1646920855-node-feature-discovery-worker-24ws8 1/1 Running 0 4m2s 10.233.92.4 node3 <none> <none> gpu-operator-1646920855-node-feature-discovery-worker-4xhkb 1/1 Running 0 4m2s 10.233.78.3 clx-host-083 <none> <none> gpu-operator-1646920855-node-feature-discovery-worker-ct6r7 1/1 Running 0 4m2s 10.233.111.2 clx-host-081 <none> <none> gpu-operator-1646920855-node-feature-discovery-worker-pf2bx 1/1 Running 0 4m2s 10.233.74.2 clx-host-082 <none> <none> gpu-operator-1646920855-node-feature-discovery-worker-ppwq7 1/1 Running 0 4m2s 10.233.90.3 node1 <none> <none> gpu-operator-1646920855-node-feature-discovery-worker-qv8k9 1/1 Running 0 4m2s 10.233.96.3 node2 <none> <none> gpu-operator-1646920855-node-feature-discovery-worker-sqgww 1/1 Running 0 4m3s 10.233.112.2 clx-host-084 <none> <none> gpu-operator-84b88fc49c-98wb7 1/1 Running 0 4m2s 10.233.92.3 node3 <none> <none> nvidia-container-toolkit-daemonset-4mtwz 1/1 Running 0 2m13s 10.233.74.3 clx-host-082 <none> <none> nvidia-container-toolkit-daemonset-h9xzm 1/1 Running 0 2m13s 10.233.112.3 clx-host-084 <none> <none> nvidia-container-toolkit-daemonset-kqnsr 1/1 Running 0 2m13s 10.233.78.4 clx-host-083 <none> <none> nvidia-container-toolkit-daemonset-zwvd9 1/1 Running 0 2m12s 10.233.111.3 clx-host-081 <none> <none> nvidia-cuda-validator-c5lmr 0/1 Completed 0 110s 10.233.112.8 clx-host-084 <none> <none> nvidia-cuda-validator-qlj4z 0/1 Completed 0 100s 10.233.78.9 clx-host-083 <none> <none> nvidia-cuda-validator-rfdsd 0/1 Completed 0 98s 10.233.111.8 clx-host-081 <none> <none> nvidia-cuda-validator-xqh28 0/1 Completed 0 104s 10.233.74.8 clx-host-082 <none> <none> nvidia-dcgm-exporter-9rjqv 1/1 Running 0 2m16s 10.233.111.5 clx-host-081 <none> <none> nvidia-dcgm-exporter-bl24c 1/1 Running 0 2m16s 10.233.112.6 clx-host-084 <none> <none> nvidia-dcgm-exporter-nbn8z 1/1 Running 0 2m15s 10.233.74.7 clx-host-082 <none> <none> nvidia-dcgm-exporter-trclg 1/1 Running 0 2m16s 10.233.78.5 clx-host-083 <none> <none> nvidia-device-plugin-daemonset-72b9c 1/1 Running 0 2m14s 10.233.112.7 clx-host-084 <none> <none> nvidia-device-plugin-daemonset-cz89s 1/1 Running 0 2m15s 10.233.111.6 clx-host-081 <none> <none> nvidia-device-plugin-daemonset-nfrsr 1/1 Running 0 2m14s 10.233.78.8 clx-host-083 <none> <none> nvidia-device-plugin-daemonset-rrpxg 1/1 Running 0 2m14s 10.233.74.4 clx-host-082 <none> <none> nvidia-device-plugin-validator-2n686 0/1 Completed 0 89s 10.233.78.10 clx-host-083 <none> <none> nvidia-device-plugin-validator-bt55c 0/1 Completed 0 87s 10.233.111.9 clx-host-081 <none> <none> nvidia-device-plugin-validator-dczfx 0/1 Completed 0 103s 10.233.112.9 clx-host-084 <none> <none> nvidia-device-plugin-validator-kssds 0/1 Completed 0 93s 10.233.74.9 clx-host-082 <none> <none> nvidia-mig-manager-2wtr9 1/1 Running 0 79s 10.233.78.11 clx-host-083 <none> <none> nvidia-mig-manager-49vpk 1/1 Running 0 83s 10.233.74.10 clx-host-082 <none> <none> nvidia-mig-manager-4dktw 1/1 Running 0 79s 10.233.112.10 clx-host-084 <none> <none> nvidia-mig-manager-kh8qd 1/1 Running 0 80s 10.233.111.10 clx-host-081 <none> <none> nvidia-operator-validator-6dnpw 1/1 Running 0 2m16s 10.233.74.6 clx-host-082 <none> <none> nvidia-operator-validator-gztcz 1/1 Running 0 2m15s 10.233.112.5 clx-host-084 <none> <none> nvidia-operator-validator-vk98p 1/1 Running 0 2m16s 10.233.111.7 clx-host-081 <none> <none> nvidia-operator-validator-wdz79 1/1 Running 0 2m16s 10.233.78.6 clx-host-083 <none> <none>

NVIDIA Network Operator Installation

The NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage networking-related components in order to enable fast networking and RDMA for workloads in K8s cluster. The Fast Network is a secondary network of the K8s cluster for applications that require high bandwidth or low latency.

To make it work, several components need to be provisioned and configured. The Helm is required for the Network Operator deployment.

Add the NVIDIA Network Operator Helm repository:

Copy
Copied!
            

## Add REPO   helm repo add mellanox https://mellanox.github.io/network-operator \ && helm repo update

Create the values.yaml file to customize the Network Operator deployment (e xample):

values.yaml

Copy
Copied!
            

nfd: enabled: true sriovNetworkOperator: enabled: true deployCR: true ofedDriver: deploy: false nvPeerDriver: deploy: false rdmaSharedDevicePlugin: deploy: false sriovDevicePlugin: deploy: false secondaryNetwork: deploy: true cniPlugins: deploy: true multus: deploy: true ipamPlugin: deploy: true

Deploy the operator:

Copy
Copied!
            

helm install -f ./values.yaml -n network-operator --create-namespace --wait mellanox/network-operator --generate-name   NAME: network-operator-1646925670 LAST DEPLOYED: Thu Mar 10 15:21:22 2022 NAMESPACE: network-operator STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: Get Network Operator deployed resources by running the following commands:   $ kubectl -n network-operator get pods $ kubectl -n nvidia-network-operator-resources get pods

Once the Helm chart is installed, check the status of the pods to ensure all the containers are running:

Copy
Copied!
            

## POD status in namespace - network-operator kubectl -n network-operator get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-operator-1646925670-68d8f875f9-bzl4t 1/1 Running 0 3m36s 10.233.90.5 node1 <none> <none> network-operator-1646925670-node-feature-discovery-master-mrzvc 1/1 Running 0 3m36s 10.233.96.5 node2 <none> <none> network-operator-1646925670-node-feature-discovery-worker-2hszv 1/1 Running 0 3m36s 10.233.78.12 clx-host-083 <none> <none> network-operator-1646925670-node-feature-discovery-worker-4xtct 1/1 Running 0 3m36s 10.233.96.4 node2 <none> <none> network-operator-1646925670-node-feature-discovery-worker-62lhk 1/1 Running 0 3m36s 10.233.112.11 clx-host-084 <none> <none> network-operator-1646925670-node-feature-discovery-worker-8vbhk 1/1 Running 0 3m36s 10.233.74.11 clx-host-082 <none> <none> network-operator-1646925670-node-feature-discovery-worker-8vrqt 1/1 Running 0 3m36s 10.233.111.11 clx-host-081 <none> <none> network-operator-1646925670-node-feature-discovery-worker-cv9rc 1/1 Running 0 3m36s 10.233.90.4 node1 <none> <none> network-operator-1646925670-node-feature-discovery-worker-hbr7k 1/1 Running 0 3m36s 10.233.92.5 node3 <none> <none> network-operator-1646925670-sriov-network-operator-6b75fd8ng66c 1/1 Running 0 3m36s 10.233.90.6 node1 <none> <none> sriov-network-config-daemon-85dq5 3/3 Running 0          3m30s   192.168.222.103 clx-host-083 <none> <none> sriov-network-config-daemon-8hn6g 3/3 Running 0          3m20s   192.168.222.104 clx-host-084 <none> <none> sriov-network-config-daemon-9jb2j 3/3 Running 0          3m20s   192.168.222.101 clx-host-081 <none> <none> sriov-network-config-daemon-kd6bp 3/3 Running 0          3m10s   192.168.222.102 clx-host-082 <none> <none>   ## POD status in namespace - nvidia-network-operator-resources kubectl -n nvidia-network-operator-resources get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cni-plugins-ds-9mg2g 1/1 Running 0 3m27s 192.168.222.101 clx-host-081 <none> <none> cni-plugins-ds-lwzkn 1/1 Running 0 3m26s 192.168.222.103 clx-host-083 <none> <none> cni-plugins-ds-w4pvx 1/1 Running 0 3m26s 192.168.222.104 clx-host-084 <none> <none> cni-plugins-ds-w5hm8 1/1 Running 0 3m26s 192.168.222.102 clx-host-082 <none> <none> kube-multus-ds-2xwws 1/1 Running 0 3m26s 192.168.222.102 clx-host-082 <none> <none> kube-multus-ds-85cxw 1/1 Running 0 3m27s 192.168.222.101 clx-host-081 <none> <none> kube-multus-ds-vk6hq 1/1 Running 0 3m26s 192.168.222.103 clx-host-083 <none> <none> kube-multus-ds-xjx6x 1/1 Running 0 3m26s 192.168.222.104 clx-host-084 <none> <none> whereabouts-6ftfb 1/1 Running 0 3m25s 192.168.222.103 clx-host-083 <none> <none> whereabouts-89f2h 1/1 Running 0 3m25s 192.168.222.101 clx-host-081 <none> <none> whereabouts-k6w4s 1/1 Running 0 3m24s 192.168.222.102 clx-host-082 <none> <none> whereabouts-nqlb9 1/1 Running 0 3m25s 192.168.222.104 clx-host-084 <none> <none>

High-Speed Network Configuration

After installing the operator, please check the SriovNetworkNodeState CRs to see all SR-IOV-enabled devices in your node.
In this deployment, the network interfaces have been chosen with the following names: enp12s0, enp18s0, enp75s0, enp84s0, enp141s0, enp141s0, enp186s0 and enp204s0 .

To review the interface status please use the following command:

NICs status

Copy
Copied!
            

## NIC status kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io clx-host-081 -o yaml ... status: interfaces: - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:b1:f4:fc mtu: 1500 name: enp12s0 pciAddress: 0000:0c:00.0 totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:c0:02:b2 mtu: 1500 name: enp18s0 pciAddress: "0000:12:00.0" totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:b1:f6:c8 mtu: 1500 name: enp75s0 pciAddress: 0000:4b:00.0 totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:b1:f5:08 mtu: 1500 name: enp84s0 pciAddress: "0000:54:00.0" totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:b1:f2:d4 mtu: 1500 name: enp141s0 pciAddress: 0000:8d:00.0 totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:c0:00:e2 mtu: 1500 name: enp148s0 pciAddress: 0000:94:00.0 totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:b1:f6:f0 mtu: 1500 name: enp186s0 pciAddress: 0000:ba:00.0 totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 200000 Mb/s linkType: ETH mac: 04:3f:72:b1:f6:bc mtu: 1500 name: enp204s0 pciAddress: 0000:cc:00.0 totalvfs: 4 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkSpeed: 100000 Mb/s linkType: ETH mac: 04:3f:72:c1:cb:f0 mtu: 1500 name: enp225s0f0 pciAddress: 0000:e1:00.0 vendor: 15b3 - deviceID: 101b driver: mlx5_core linkType: ETH mac: 04:3f:72:c1:cb:f1 mtu: 1500 name: enp225s0f1 pciAddress: 0000:e1:00.1 vendor: 15b3 - deviceID: "1533" driver: igb linkType: ETH mac: 5c:ff:35:e2:1e:41 mtu: 1500 name: enp226s0 pciAddress: 0000:e2:00.0 vendor: "8086" syncStatus: Succeeded

Create SriovNetworkNodePolicy CR for each chosen network interface - policy.yaml file, by specifying the chosen interface in the 'nicSelector' (in this example, for the enp12s0 interface):

policy.yaml

Copy
Copied!
            

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw1 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw1 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp12s0" ] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw2 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw2 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp18s0" ] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw3 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw3 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp75s0" ] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw4 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw4 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp84s0" ] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw5 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw5 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp141s0" ] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw6 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw6 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp148s0" ] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw7 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw7 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp186s0" ] deviceType: netdevice isRdma: true --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlnxnics-sw8 namespace: network-operator spec: nodeSelector: feature.node.kubernetes.io/custom-rdma.capable: "true" resourceName: roce_sw8 priority: 99 mtu: 9000 numVfs: 8 nicSelector: pfNames: [ "enp204s0" ] deviceType: netdevice isRdma: true

Deploy policy.yaml:

Copy
Copied!
            

kubectl apply -f policy.yaml sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw1 created sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw2 created sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw3 created sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw4 created sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw5 created sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw6 created sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw7 created sriovnetworknodepolicy.sriovnetwork.openshift.io/mlnxnics-sw8 created

Important

This step takes a while. This depends on the amount of K8s Worker Nodes to apply the configuration, and the number of VFs for each selected network interface.

Create an SriovNetwork CR for each chosen network interface - network.yaml file which refers to the 'resourceName' defined in SriovNetworkNodePolicy (in this example, reference the roce_swX resources and set the CIDR range for the high-speed network):

network.yaml

Copy
Copied!
            

apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw1 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.101.0/24" } networkNamespace: default resourceName: roce_sw1 vlan: 11 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw2 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.102.0/24" } networkNamespace: default resourceName: roce_sw2 vlan: 12 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw3 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.103.0/24" } networkNamespace: default resourceName: roce_sw3 vlan: 13 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw4 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.104.0/24" } networkNamespace: default resourceName: roce_sw4 vlan: 14 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw5 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.105.0/24" } networkNamespace: default resourceName: roce_sw5 vlan: 15 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw6 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.106.0/24" } networkNamespace: default resourceName: roce_sw6 vlan: 16 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw7 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.107.0/24" } networkNamespace: default resourceName: roce_sw7 vlan: 17 --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: network-sw8 namespace: network-operator spec: ipam: | { "datastore": "kubernetes", "kubernetes": { "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig" }, "log_file": "/tmp/whereabouts.log", "log_level": "debug", "type": "whereabouts", "range": "192.168.108.0/24" } networkNamespace: default resourceName: roce_sw8 vlan: 18

Deploy network.yaml:

Copy
Copied!
            

kubectl apply -f network.yaml sriovnetwork.sriovnetwork.openshift.io/network-sw1 created sriovnetwork.sriovnetwork.openshift.io/network-sw2 created sriovnetwork.sriovnetwork.openshift.io/network-sw3 created sriovnetwork.sriovnetwork.openshift.io/network-sw4 created sriovnetwork.sriovnetwork.openshift.io/network-sw5 created sriovnetwork.sriovnetwork.openshift.io/network-sw6 created sriovnetwork.sriovnetwork.openshift.io/network-sw7 created sriovnetwork.sriovnetwork.openshift.io/network-sw8 created

Validating the Deployment

Check the deployed network:

Copy
Copied!
            

kubectl get network-attachment-definitions.k8s.cni.cncf.io NAME AGE network-sw1 33m network-sw2 33m network-sw3 33m network-sw4 33m network-sw5 33m network-sw6 33m network-sw7 33m network-sw8 33m

Check the Worker Node resources:

Copy
Copied!
            

kubectl get node clx-host-081 -o json | jq '.status.allocatable' { "cpu": "255900m", "ephemeral-storage": "1698708802820", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1056271380Ki", "nvidia.com/gpu": "8", "nvidia.com/roce_sw1": "8", "nvidia.com/roce_sw2": "8", "nvidia.com/roce_sw3": "8", "nvidia.com/roce_sw4": "8", "nvidia.com/roce_sw5": "8", "nvidia.com/roce_sw6": "8", "nvidia.com/roce_sw7": "8", "nvidia.com/roce_sw8": "8", "pods": "110" }   kubectl get node clx-host-082 -o json | jq '.status.allocatable' { "cpu": "255900m", "ephemeral-storage": "1698708802820", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1056271428Ki", "nvidia.com/gpu": "8", "nvidia.com/roce_sw1": "8", "nvidia.com/roce_sw2": "8", "nvidia.com/roce_sw3": "8", "nvidia.com/roce_sw4": "8", "nvidia.com/roce_sw5": "8", "nvidia.com/roce_sw6": "8", "nvidia.com/roce_sw7": "8", "nvidia.com/roce_sw8": "8", "pods": "110" }   kubectl get node clx-host-083 -o json | jq '.status.allocatable' { "cpu": "255900m", "ephemeral-storage": "1698708802820", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1056275120Ki", "nvidia.com/gpu": "8", "nvidia.com/roce_sw1": "8", "nvidia.com/roce_sw2": "8", "nvidia.com/roce_sw3": "8", "nvidia.com/roce_sw4": "8", "nvidia.com/roce_sw5": "8", "nvidia.com/roce_sw6": "8", "nvidia.com/roce_sw7": "8", "nvidia.com/roce_sw8": "8", "pods": "110" }   kubectl get node clx-host-084 -o json | jq '.status.allocatable' { "cpu": "255900m", "ephemeral-storage": "1698708802820", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1056270348Ki", "nvidia.com/gpu": "8", "nvidia.com/roce_sw1": "8", "nvidia.com/roce_sw2": "8", "nvidia.com/roce_sw3": "8", "nvidia.com/roce_sw4": "8", "nvidia.com/roce_sw5": "8", "nvidia.com/roce_sw6": "8", "nvidia.com/roce_sw7": "8", "nvidia.com/roce_sw8": "8", "pods": "110" }

Run synthetic RDMA benchmark tests with ib_write_bw between two pods that are running on different K8s Worker Nodes.

This step includes the following:

  • Create a container image and push it your repository

  • Deploy K8s deployment apps

  • Run test

RDMA benchmark Dockerfile:

Copy
Copied!
            

FROM ubuntu:20.04 # Ubuntu 20.04 docker container with inbox Mellanox drivers # LABEL about the custom image LABEL maintainer=vitaliyra@nvidia.com LABEL description="This is custom Container Image with inbox perftest package." WORKDIR /tmp/ ENV DEBIAN_FRONTEND=noninteractive RUN apt-get clean -y && apt-get -y update && apt-get install -y apt-utils udev vim bash && apt-get -y upgrade RUN apt-get install -y iproute2 rdma-core libibmad5 ibutils ibverbs-utils infiniband-diags perftest \ mstflint strace iputils-ping RUN ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime RUN dpkg-reconfigure --frontend noninteractive tzdata && apt-get clean all -y CMD bash

Note

Please use your favorite container building tools (docker, podman, etc.) to create a container image from Dockerfile for use in the below Deployment.

After creating the image, push it to the container registry.

Create a sample deployment test-deployment.yaml (container image should include InfiniBand userspace drivers and performance tools):

test-deployment.yaml

Copy
Copied!
            

apiVersion: apps/v1 kind: Deployment metadata: name: mlnx-inbox-pod labels: app: sriov spec: replicas: 2 selector: matchLabels: app: sriov template: metadata: labels: app: sriov annotations: k8s.v1.cni.cncf.io/networks: network-sw1 spec: containers: - image: < Container image > name: mlnx-inbox-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: cpu: 4 nvidia.com/roce_sw1: 1 limits: cpu: 4 nvidia.com/roce_sw1: 1 command: - sh - -c - sleep inf

Deploy the sample deployment.

Copy
Copied!
            

kubectl apply -f test-deployment.yaml deployment.apps/mlnx-inbox-pod created   kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mlnx-inbox-pod-6586dcc7b9-2b9nm 1/1 Running 0 2m14s 10.233.112.35 clx-host-084 <none> <none> mlnx-inbox-pod-6586dcc7b9-xs7wx 1/1 Running 0 2m14s 10.233.111.34 clx-host-081 <none> <none>

Check available network interfaces in each POD.

Copy
Copied!
            

## First POD kubectl exec -it mlnx-inbox-pod-6586dcc7b9-2b9nm -- ip a s 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 4: eth0@if95: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default link/ether 26:1f:c8:a8:e2:8d brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.112.35/32 brd 10.233.112.35 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::241f:c8ff:fea8:e28d/64 scope link valid_lft forever preferred_lft forever 36: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether e6:5a:bd:85:35:15 brd ff:ff:ff:ff:ff:ff inet 192.168.101.1/24 brd 192.168.101.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::e45a:bdff:fe85:3515/64 scope link valid_lft forever preferred_lft forever   ## Second POD kubectl exec -it mlnx-inbox-pod-6586dcc7b9-xs7wx -- ip a s 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000 link/ipip 0.0.0.0 brd 0.0.0.0 4: eth0@if94: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default link/ether 52:76:f4:e7:a2:9b brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.233.111.34/32 brd 10.233.111.34 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::5076:f4ff:fee7:a29b/64 scope link valid_lft forever preferred_lft forever 28: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 72:72:6d:1d:84:5a brd ff:ff:ff:ff:ff:ff inet 192.168.101.2/24 brd 192.168.101.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::7072:6dff:fe1d:845a/64 scope link valid_lft forever preferred_lft forever

Run synthetic RDMA benchmark tests.

Server

ib_write_bw -a -F -d $IB_DEV_NAME --report_gbits

Client

ib_write_bw -a -F $SERVER_IP -d $IB_DEV_NAME --report_gbits

Please console sessions to each POD - one for the server apps side, and the second for the client apps side.

In a first console (on the server side), run the following commands:

Copy
Copied!
            

kubectl exec -it mlnx-inbox-pod-6586dcc7b9-2b9nm -- bash root@mlnx-inbox-pod-6586dcc7b9-2b9nm:/tmp# rdma link | grep net1 link mlx5_13/1 state ACTIVE physical_state LINK_UP netdev net1 root@mlnx-inbox-pod-6586dcc7b9-2b9nm:/tmp# ib_write_bw -a -F -d mlx5_13 --report_gbits   ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_13 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 2 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet ---------------------------------------------------------------------------------------  local address: LID 0000 QPN 0x0069 PSN 0xaa30eb RKey 0x010e00 VAddr 0x007fb3a9d52000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:01 remote address: LID 0000 QPN 0x00e9 PSN 0x32bd22 RKey 0x030e00 VAddr 0x007ff245361000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:02 ---------------------------------------------------------------------------------------  #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 8388608 5000 185.70 185.44 0.002763 ---------------------------------------------------------------------------------------  

In a second console (on the client side ) , run the following commands:

Copy
Copied!
            

root@node1:~/YAMLs/8port/example# kubectl exec -it mlnx-inbox-pod-6586dcc7b9-xs7wx -- bash root@mlnx-inbox-pod-6586dcc7b9-xs7wx:/tmp# rdma link | grep net1 link mlx5_15/1 state ACTIVE physical_state LINK_UP netdev net1 root@mlnx-inbox-pod-6586dcc7b9-xs7wx:/tmp# ib_write_bw -a -F 192.168.101.1 -d mlx5_15 --report_gbits --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_15 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 2 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet ---------------------------------------------------------------------------------------  local address: LID 0000 QPN 0x00e9 PSN 0x32bd22 RKey 0x030e00 VAddr 0x007ff245361000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:02 remote address: LID 0000 QPN 0x0069 PSN 0xaa30eb RKey 0x010e00 VAddr 0x007fb3a9d52000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:101:01 ---------------------------------------------------------------------------------------  #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 2 5000 0.044858 0.044364 2.772772 4 5000 0.089829 0.089793 2.806042 8 5000 0.18 0.18 2.788396 16 5000 0.36 0.36 2.801705 32 5000 0.72 0.72 2.801529 64 5000 1.10 1.05 2.056373 128 5000 2.17 2.16 2.107263 256 5000 4.32 4.32 2.110149 512 5000 8.65 8.64 2.110166 1024 5000 17.29 17.24 2.104959 2048 5000 34.32 34.23 2.089381 4096 5000 68.14 65.74 2.006262 8192 5000 170.15 139.82 2.133420 16384 5000 188.33 169.84 1.295812 32768 5000 190.95 180.36 0.688024 65536 5000 191.23 181.41 0.327763 131072 5000 192.34 190.78 0.181938 262144 5000 191.26 185.41 0.083644 524288 5000 191.15 183.44 0.043735 1048576 5000 190.31 187.27 0.022325 2097152 5000 187.04 185.88 0.011079 4194304 5000 189.42 185.82 0.005538 8388608 5000 185.70 185.44 0.002763 ---------------------------------------------------------------------------------------

Kubeflow Training Operator

Kubeflow is a machine learning toolkit for Kubernetes.

Kubeflow training operators are part of Kubeflow, and a group of Kubernetes operators that add support to Kubeflow for distributed training of Machine Learning models using different frameworks.

The training operator provides Kubernetes CR that makes it easy to run distributed or non-distributed TensorFlow/PyTorch/Apache MXNet/XGBoost/MPI jobs on Kubernetes.

In the example below we deploy the Kubeflow training operators stable release v1.4.0:

Copy
Copied!
            

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.4.0" namespace/kubeflow created customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/mxjobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org created serviceaccount/training-operator created clusterrole.rbac.authorization.k8s.io/training-operator created clusterrolebinding.rbac.authorization.k8s.io/training-operator created service/training-operator created deployment.apps/training-operator created

Appendix

Job Testing Results

Below are Dockerfile and MPIJob examples with different network configurations.

Dockerfile

Dockerfile example for using MPIJob:

Copy
Copied!
            

FROM nvcr.io/nvidia/tensorflow:21.10-tf1-py3 RUN apt-get update && apt-get install -y --no-install-recommends openssh-client openssh-server && \ mkdir -p /var/run/sshd   # Allow OpenSSH to talk to containers without asking for confirmation # by disabling StrictHostKeyChecking. # mpi-operator mounts the .ssh folder from a Secret. For that to work, we need # to disable UserKnownHostsFile to avoid write permissions. # Disabling StrictModes avoids directory and files read permission checks.   RUN sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \ echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \ sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config   RUN mkdir /tensorflow WORKDIR "/tensorflow" RUN git clone https://github.com/tensorflow/benchmarks WORKDIR "/tensorflow/benchmarks"   CMD ["/bin/bash"]

This Dockerfile is based on the TensorFlow NGC Container image. The TensorFlow NGC Container is optimized for GPU acceleration and contains a validated set of libraries that enable and optimize GPU performance. This container may also contain modifications to the TensorFlow source code in order to maximize performance and compatibility. This container also contains software for accelerating ETL (DALI , RAPIDS ), training ( cuDNN , NCCL ), and inference ( TensorRT ) workloads.

For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation.

Note

Please use your favorite container building tools (docker, podman, etc.) to create a container image from Dockerfile for use in the below deployment.

After creating the image, push it to the container registry.

MPIJob Examples

Below is an MPIJob example with network configuration over K8s management network:

Copy
Copied!
            

# TF MPIJob over MGMT network apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: < Container image > name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "32" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --batch_size=64 - --model=resnet152 - --variable_update=horovod - --xla=true - --use_fp16=true Worker: replicas: 4 template: spec: containers: - image: < Container image > name: tensorflow-benchmarks resources: limits: nvidia.com/gpu: 8

The below is an MPIJob example with network configuration over secondary K8s network:

Copy
Copied!
            

# TF MPIJob over high-perf TCP network apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: < Container image > name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "32" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - NCCL_IB_DISABLE=1 - -x - NCCL_NET_GDR_LEVEL=0 - -x - NCCL_NET_PLUGIN=none - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --batch_size=64 - --model=resnet152 - --variable_update=horovod - --xla=true - --use_fp16=true Worker: replicas: 4 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: network-sw1,network-sw2,network-sw3,network-sw4,network-sw5,network-sw6,network-sw7,network-sw8 spec: containers: - image: < Container image > name: tensorflow-benchmarks resources: limits: nvidia.com/gpu: 8 nvidia.com/roce_sw1: 1 nvidia.com/roce_sw2: 1 nvidia.com/roce_sw3: 1 nvidia.com/roce_sw4: 1 nvidia.com/roce_sw5: 1 nvidia.com/roce_sw6: 1 nvidia.com/roce_sw7: 1 nvidia.com/roce_sw8: 1

The below is an MPIJob example with network configuration over RDMA enabled secondary K8s network:

Copy
Copied!
            

# TF MPIJob over RDMA network apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: < Container image > name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "32" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - NCCL_IB_DISABLE=0 - -x - NCCL_NET_GDR_LEVEL=2 - -x - TF_ALLOW_IOLIBS=1 - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --batch_size=64 - --model=resnet152 - --variable_update=horovod - --xla=true - --use_fp16=true Worker: replicas: 4 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: network-sw1,network-sw2,network-sw3,network-sw4,network-sw5,network-sw6,network-sw7,network-sw8 spec: containers: - image: < Container image > name: tensorflow-benchmarks securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 8 nvidia.com/roce_sw1: 1 nvidia.com/roce_sw2: 1 nvidia.com/roce_sw3: 1 nvidia.com/roce_sw4: 1 nvidia.com/roce_sw5: 1 nvidia.com/roce_sw6: 1 nvidia.com/roce_sw7: 1 nvidia.com/roce_sw8: 1

Test Results

perGPU.png

total.png

Warning

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Authors

ID-2.jpg

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Related Documents

Last updated on Sep 12, 2023.