image image image image image image



On This Page

Created on May 20, 2020 by Vitaliy Razinkov

Introduction

This Quick Start Guide (QSG) explains how to build the highest performing Kubernetes cluster capable of hosting the most demanding distributed workloads running on top of NVIDIA Mellanox end-to-end InfiniBand EDR interconnect. RDMA Shared Device Plugin empower pods with RDMA device and allows to run native RDMA applications on the InfiniBand fabric such as HPC, ML, AI etc. applications. In this guide we’ll show how to deploy Kubernetes with RDMA Shared Device Plugin and will test network performance between pods running on different Worker Nodes

In this document we describe the process of building a Kubernetes cluster using Kubespray running the following setup.

  • 1 x Deployment server with Ansible apps (a single virtual or physical machine)
  • 1 x Kubernetes master node (a single virtual or physical machine)
  • 4 x Worker Nodes

The deployment is validated using Ubuntu 18.04 OS and Kubespray v2.13.0.

Abbreviation and Acronym List

TermDefinitionTermDefinition

CNI

Container Network Interface

HWE

Hardware Enablement

EDREnhanced Data Rate - 100Gb/sHPCHigh-Performance Computing

DHCP

Dynamic Host Configuration Protocol

K8s

Kubernetes

DNS

Domain Name System

NFD

Node Feature Discovery

DP

Device Plugin

QSG

Quick Start Guide

HCAHost Channel Adapter

RDMA

Remote Direct Memory Access

References

Key Components and Technologies

  • Mellanox InfiniBand Smart Adapters achieve extreme performance and scale, lowering cost per operation and increasing ROI for high performance computing, machine learning, advanced storage, clustered databases, low-latency embedded I/O applications, and more.
  • Mellanox InfiniBand Switches deliver the highest performance and port density with complete fabric management solutions to enable compute clusters and converged data centers to operate at any scale while reducing operational costs and infrastructure complexity. Mellanox switches includes a broad portfolio of Edge and Director switches supporting 40, 56, 100 and 200Gb/s port speeds.
  • Mellanox LinkX cables and transceivers product family provides the industry’s most complete line of 10, 25, 40, 50, 100 and 200Gb interconnect products for Cloud, HPC, Web 2.0, Enterprise, Telco, and storage data centers applications. LinkX portfolio contains copper high speed cables aimed for host-to-switch connection within a rack and optical cables for upwards switch-to-switch connection.
  • RDMA Shared Device Plugin is designed on top of the generic Kubernetes framework which allows vendors to advertise their resources to the kubelet without changing the core code.
  • Kubespray v2.13.0 From Kubernetes.io)
    Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
    • Highly available cluster
    • Composable attributes
    • Support for the most popular Linux distributions
  • Kubernetes v1.17.5
    Kubernetes (K8s) is an open-source container orchestration system for deployment automation, scaling, and management of containerized applications.

Solution Overview

Solution Logical Design

The logical design includes the following layers:

  • Two separate networking layers: 
    1. Management network
    2. High-speed InfiniBand network
  • Compute layer: 
    1. Deployment node (VM based node in our guide)
    2. Kubernetes Master node (VM based node in our guide)
    3. 4 x Kubernetes Worker nodes with NVIDIA Mellanox ConnectX-6 HCA. 
      NVIDIA Kubernetes RDMA shared device plugin (Mellanox K8s Shared DP) must be installed on all Worker Nodes.


InfiniBand Fabric Topology  

Simple Setup with a One Switch

In single switch case  you can connect up to 36 servers by using Mellanox SB7700 InfiniBand EDR 100Gb/s Switch System.

Scaled Setup for InfiniBand Fabric

For assistance in designing the scaled InfiniBand topology, use the Mellanox InfiniBand Topology Generator. It is an online cluster configuration tool that offers flexible cluster configurations and sizes.

For a scaled setup we recommend using Mellanox Unified Fabric Manager (UFM®)

Bill of Material

The below table specifies the hardware components for single switch topology:


          

Fabric Configuration

Physical Wiring

In this setup:

  • Deployment and Master Nodes are wired only to Management network switch
  • Only a single port from Mellanox HCA on each Worker Node is wired to a Mellanox switch using EDR copper cables:

 

In our physical deployment, we selected port P1 in our HCA. With default Host OS configuration port P1 corresponds to the network interface IB1.


InfiniBand Fabric Configuration

Below is a list of recommendations and prerequisites that are important for the configuration process:

  • Refer to the MLNX-OS User Manual to become familiar with the switch software (located at support.mellanox.com)
  • Upgrade the switch software to the latest MLNX-OS version
  • InfiniBand Subnet Manager (SM) is required to configure InfiniBand fabric properly

There are three ways to run InfiniBand Subnet Manager (SM) in the InfiniBand fabric:

  1. Start the SM on one or more managed switches. This is a very convenient and quick operation which allows for an easier InfiniBand ‘plug & play'.
  2. Run OpenSM daemon on one or more servers by executing the /etc/init.d/opensmd command. It is recommended to run the SM on a server in case there are 648 nodes or more.
  3. Use Unified Fabric Management (UFM®). 
    Mellanox’s Unified Fabric Manager (UFM®) is a powerful platform for scale-out computing, eliminates the complexity of fabric management, provides deep visibility into traffic and optimizes fabric performance.

In this guide, we will launch the InfiniBand SM on the InfiniBand switch (Method num. 1). Below are the configuration steps for the chosen method.

To enable the SM on one of the managed switches:

  1. Login to the switch and enter the next configuration commands (swx-mld-s01 is our switch name):

    IB switch configuration
    Mellanox MLNX-OS Switch Management
    
    switch login: admin
    Password: 
     
    swx-mld-s01 [standalone: master] > enable 
    swx-mld-s01 [standalone: master] # configure terminal
    swx-mld-s01 [standalone: master] (config) # ib smnode swx-mld-s01 enable 
    swx-mld-s01 [standalone: master] (config) # ib smnode swx-mld-s01 sm-priority 0
    
    swx-mld-s01 [standalone: master] (config) # ib sm virt enable
    swx-mld-s01 [standalone: master] (config) # write memory
    swx-mld-s01 [standalone: master] (config) # reload
     
  2. Once the switch reboots, check the switch configuration. It should look like the following:

    Switch config example
    Mellanox MLNX-OS Switch Management
    
    switch login: admin
    Password: 
    
    swx-mld-s01 [standalone: master] > enable 
    swx-mld-s01 [standalone: master] # configure terminal
    swx-mld-s01 [standalone: master] (config) # show running-config 
    ##
    ## Running database "initial"
    ## Generated at 2019/03/19 17:58:53 +0200
    ## Hostname: swx-mld-s01
    ##
    
    ##
    ## Running-config temporary prefix mode setting
    ##
    no cli default prefix-modes enable
    
    ##
    ## Subnet Manager configuration
    ##
       ib sm virt enable
       
    ##
    ## Other IPv6 configuration
    ##
    no ipv6 enable
       
    ##
    ## AAA remote server configuration
    ##
    # ldap bind-password ********
    # radius-server key ********
    # tacacs-server key ********
       
    ##
    ## Network management configuration
    ##
    # web proxy auth basic password ********
       clock timezone Asia Middle_East Jerusalem
    no ntp server 192.114.62.250 disable
       ntp server 192.114.62.250 keyID 0
    no ntp server 192.114.62.250 trusted-enable
       ntp server 192.114.62.250 version 4
       
    ##
    ## X.509 certificates configuration
    ##
    #
    # Certificate name system-self-signed, ID 0cd5b6a0da88a0e68b8f3b49408b361afc73289d
    # (public-cert config omitted since private-key config is hidden)
    
       
    ##
    ## IB nodename to GUID mapping
    ##
       ib smnode swx-mld-s01 create
       ib smnode swx-mld-s01 enable
       ib smnode swx-mld-s01 sm-priority 0
    ##
    ## Persistent prefix mode setting
    ##
    cli default prefix-modes enable

Deployment Guide

Host Configuration

Below is a table detailing the server and switch names:

The IB1 interface does not require any configuration.

For Management network (eno0/mgmt0) IP address assigned by DHCP service.

Host OS prerequisites 

  1. Ensure that the Mellanox Adapter ports are configured in InfiniBand mode. For more info refer to this guide.
  2. Ubuntu Server 18.04 operating system is installed on all servers with OpenSSH server packages and created non-root user account with sudo privileges without password.
    Update the Ubuntu software packages and install the latest HWE kernel by running the below commands:

    # apt-get update
    # apt-get -y install linux-image-generic-hwe-18.04
    # reboot
  3. Once the OS installation is complete, check that the Kubernetes Worker Nodes have IB1 network interface. 

Non-root user account prerequisites: 

In this deployment, we added following line to EOF /etc/sudoers :

#includedir /etc/sudoers.d


#K8s cluster deployment user with sudo privileges without password
user ALL=(ALL) NOPASSWD:ALL

K8s Cluster Deployment and Configuration

For this deployment, the Kubernetes cluster will be installed by Kubespray with non-root user account from Deployment server.

SSH Private Key and SSH Passwordless Login

Login to the deployment server as deployment user ( in our case - user) and create a SSH private key for configuring the password-less authentication on your computer by running the following command:

$ ssh-keygen

Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command:

$ ssh-copy-id user@nodename

Kubespray Configuration

  1. Install dependencies to run Kubespray with Ansible from Deployment server:

    $ cd ~
    $ sudo apt -y install python3-pip jq
    $ wget https://github.com/kubernetes-sigs/kubespray/archive/v2.13.0.tar.gz
    $ tar -zxf v2.13.0.tar.gz
    $ cd kubespray-2.13.0
    $ sudo pip3 install -r requirements.txt

    The default folder for the commands - ~/kubespray-2.13.0.

  2. Create a new cluster configuration:

    $ cp -rfp inventory/sample inventory/mycluster
    $ declare -a IPS=(192.168.222.111 192.168.222.101 192.168.222.102 192.168.222.103 192.168.222.104)
    $ CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}

    As a result, the inventory/mycluster/hosts.yaml file will be created.
    Review and change the host configuration in the file. Below is an example for this deployment:

    all:
      hosts:
        node1:
          ansible_host: 192.168.222.111
          ip: 192.168.222.111
          access_ip: 192.168.222.111
        node2:
          ansible_host: 192.168.222.101
          ip: 192.168.222.101
          access_ip: 192.168.222.101
        node3:
          ansible_host: 192.168.222.102
          ip: 192.168.222.102
          access_ip: 192.168.222.102
        node4:
          ansible_host: 192.168.222.103
          ip: 192.168.222.103
          access_ip: 192.168.222.103
        node5:
          ansible_host: 192.168.222.104
          ip: 192.168.222.104
          access_ip: 192.168.222.104
      children:
        kube-master:
          hosts:
            node1:
        kube-node:
          hosts:
            node2:
            node3:
            node4:
            node5:
        etcd:
          hosts:
            node1:
        k8s-cluster:
          children:
            kube-master:
            kube-node:
        calico-rr:
          hosts: 
  3. Review and change cluster installation parameters in the files inventory/mycluster/group_vars/all/all.yml and inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

    1. In inventory/mycluster/group_vars/all/all.yml remove the following line so the metrics can receive data about the use of cluster resources:

      ## The read-only port for the Kubelet to serve on with no authentication/authorization. Uncomment to enable.
      kube_read_only_port: 10255
    2. The Kubespray version in this deployment suffers from an inconsistency when installing Docker components (see issue details here: https://github.com/kubernetes-sigs/kubespray/issues/6160).
      To avoid any related issues, specify the Docker version by adding a specific variable "docker_version: 19.03" to inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml

      ## Container runtime
      ## docker for docker, crio for cri-o and containerd for containerd.
      container_manager: docker
      docker_version: 19.03
    3. The default Kubernetes CNI can be changed by setting the desired kube_network_plugin value (default: calico) parameter in inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml.

Install K8s Cluster Using Ansible Playbook

Deploy K8s cluster with Kubespray Ansible Playbook:

$ ansible-playbook -i inventory/mycluster/hosts.yaml --become --become-user=root cluster.yml 
The execution time for this step may take a while to finalize.

Example of a successful completion of the playbooks looks like:

PLAY RECAP ***************************************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
node1                      : ok=574  changed=95   unreachable=0    failed=0    skipped=1043 rescued=0    ignored=1   
node2                      : ok=350  changed=52   unreachable=0    failed=0    skipped=564  rescued=0    ignored=1   
node3                      : ok=350  changed=51   unreachable=0    failed=0    skipped=563  rescued=0    ignored=1   
node4                      : ok=350  changed=52   unreachable=0    failed=0    skipped=563  rescued=0    ignored=1   
node5                      : ok=350  changed=52   unreachable=0    failed=0    skipped=563  rescued=0    ignored=1   

Tuesday 14 July 2020  11:43:56 +0300 (0:00:00.101)       0:08:27.568 ********** 
=============================================================================== 
kubernetes/kubeadm : Join to cluster ----------------------------------------------------------------------------------------------------- 47.65s
kubernetes/master : kubeadm | Initialize first master ------------------------------------------------------------------------------------ 43.91s
download : download_file | Download item ------------------------------------------------------------------------------------------------- 28.59s
download : download_file | Download item ------------------------------------------------------------------------------------------------- 26.39s
kubernetes/master : Master | wait for kube-scheduler ------------------------------------------------------------------------------------- 22.01s
download : download_file | Download item ------------------------------------------------------------------------------------------------- 19.47s
kubernetes/preinstall : Update package management cache (APT) ----------------------------------------------------------------------------- 9.48s
kubernetes/master : Master | reload kubelet ----------------------------------------------------------------------------------------------- 9.02s
download : download_file | Download item -------------------------------------------------------------------------------------------------- 8.41s
kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------- 7.62s
etcd : wait for etcd up ------------------------------------------------------------------------------------------------------------------- 6.44s
etcd : Configure | Wait for etcd cluster to be healthy ------------------------------------------------------------------------------------ 6.32s
kubernetes-apps/ansible : Kubernetes Apps | Start Resources ------------------------------------------------------------------------------- 5.49s
download : download_file | Download item -------------------------------------------------------------------------------------------------- 4.86s
Gather necessary facts -------------------------------------------------------------------------------------------------------------------- 4.77s
download : download_file | Download item -------------------------------------------------------------------------------------------------- 4.40s
download : download | Download files / images --------------------------------------------------------------------------------------------- 4.36s
kubernetes-apps/ansible : Kubernetes Apps | Lay Down CoreDNS Template --------------------------------------------------------------------- 3.09s
container-engine/docker : ensure docker packages are installed ---------------------------------------------------------------------------- 2.84s
container-engine/docker : Docker | reload docker ------------------------------------------------------------------------------------------ 2.82s

Deployment Verification

Verifying the Kubernetes cluster deployment can be done through the ROOT user account on the K8s Master node.

Below is an example output of a K8s cluster with the deployment information with default Kubespray configuration using the Calico Kubernetes CNI plugin.

To ensure that the Kubernetes cluster is installed correctly, run the following commands:

root@node1:~# kubectl get nodes -o wide
NAME    STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node1   Ready    master   28m   v1.17.5   192.168.222.111   <none>        Ubuntu 18.04.4 LTS   5.3.0-62-generic   docker://19.3.7
node2   Ready    <none>   26m   v1.17.5   192.168.222.101   <none>        Ubuntu 18.04.4 LTS   5.3.0-62-generic   docker://19.3.7
node3   Ready    <none>   26m   v1.17.5   192.168.222.102   <none>        Ubuntu 18.04.4 LTS   5.3.0-62-generic   docker://19.3.7
node4   Ready    <none>   26m   v1.17.5   192.168.222.103   <none>        Ubuntu 18.04.4 LTS   5.3.0-62-generic   docker://19.3.7
node5   Ready    <none>   26m   v1.17.5   192.168.222.104   <none>        Ubuntu 18.04.4 LTS   5.3.0-62-generic   docker://19.3.7
 
root@node1:~# kubectl get pod -n kube-system -o wide
NAME                                          READY   STATUS    RESTARTS   AGE    IP                NODE    NOMINATED NODE   READINESS GATES
calico-kube-controllers-78974cdb4d-ltb2n      1/1     Running   0          26m    192.168.222.102   node3   <none>           <none>
calico-node-7mwxg                             1/1     Running   1          26m    192.168.222.101   node2   <none>           <none>
calico-node-gzsjq                             1/1     Running   1          26m    192.168.222.104   node5   <none>           <none>
calico-node-p5ssx                             1/1     Running   1          26m    192.168.222.102   node3   <none>           <none>
calico-node-rxnn6                             1/1     Running   1          26m    192.168.222.103   node4   <none>           <none>
calico-node-zr9lw                             1/1     Running   1          26m    192.168.222.111   node1   <none>           <none>
coredns-76798d84dd-f64sq                      1/1     Running   0          26m    10.233.92.3       node3   <none>           <none>
coredns-76798d84dd-tgfgp                      1/1     Running   0          26m    10.233.90.1       node1   <none>           <none>
dns-autoscaler-85f898cd5c-zcd7w               1/1     Running   0          26m    10.233.90.2       node1   <none>           <none>
kube-apiserver-node1                          1/1     Running   0          27m    192.168.222.111   node1   <none>           <none>
kube-controller-manager-node1                 1/1     Running   0          27m    192.168.222.111   node1   <none>           <none>
kube-proxy-5f4vl                              1/1     Running   0          26m    192.168.222.101   node2   <none>           <none>
kube-proxy-cszc6                              1/1     Running   0          26m    192.168.222.102   node3   <none>           <none>
kube-proxy-hn62h                              1/1     Running   0          26m    192.168.222.111   node1   <none>           <none>
kube-proxy-kwh25                              1/1     Running   0          26m    192.168.222.103   node4   <none>           <none>
kube-proxy-vzp9w                              1/1     Running   0          26m    192.168.222.104   node5   <none>           <none>
kube-scheduler-node1                          1/1     Running   0          27m    192.168.222.111   node1   <none>           <none>
kubernetes-dashboard-77475cf576-4q42g         1/1     Running   0          26m    10.233.92.2       node3   <none>           <none>
kubernetes-metrics-scraper-747b4fd5cd-qs9r2   1/1     Running   0          26m    10.233.92.1       node3   <none>           <none>
nginx-proxy-node2                             1/1     Running   0          26m    192.168.222.101   node2   <none>           <none>
nginx-proxy-node3                             1/1     Running   0          26m    192.168.222.102   node3   <none>           <none>
nginx-proxy-node4                             1/1     Running   0          26m    192.168.222.103   node4   <none>           <none>
nginx-proxy-node5                             1/1     Running   0          26m    192.168.222.104   node5   <none>           <none>
nodelocaldns-bvkp5                            1/1     Running   0          26m    192.168.222.101   node2   <none>           <none>
nodelocaldns-fskdn                            1/1     Running   0          26m    192.168.222.102   node3   <none>           <none>
nodelocaldns-nqxdn                            1/1     Running   0          26m    192.168.222.111   node1   <none>           <none>
nodelocaldns-svfw6                            1/1     Running   0          26m    192.168.222.104   node5   <none>           <none>
nodelocaldns-v24fk                            1/1     Running   0          26m    192.168.222.103   node4   <none>           <none>

Install RDMA Shared Device Plugin in InfiniBand mode for K8s cluster 

The RDMA shared device plugin presented in the deployment scenario below supports the application with native InfiniBand.

During the installation process a role will use the Kubespray inventory/mycluster/hosts.yaml file for the Kubernetes components deployment and provisioning.

The RDMA shared device plugin components are configured by a separate Ansible role (Role) in the installation. 
The Role will install the following components:

  1. Node Feature Discovery (NFD)
  2. Specific configuration for RDMA shared device plugin (configmap)
  3. RDMA shared device plugin(daemonset mode)

Prerequisites

  1. Deployed K8s cluster by Kubespray.
  2. InfiniBand driver must be installed on each Kubernetes Worker node.
    For InfiniBand inbox driver installation refer to the Ubuntu 18.04 Inbox Driver User Manual.
  3. Copy Role components 


Role Deployment

Deploy the role to the Kubespray deployment folder:

$ cd ~
$ git clone https://github.com/Mellanox/Kubespray-role-for-RDMA-shared-DP.git
$ cd ~/kubespray-2.13.0
$ cp -r ~/Kubespray-role-for-RDMA-shared-DP/* .

Customize Role Variables

Set the Role's variables in the yml file -  roles/mlnx-hca-shared/vars/main.yml.
Lines 7-10: defining shared resource ib1 as hca_shared_device_a.

vars/main.yml
---
# vars file for mlnx-shared-hca role
# Physical adapter names must be connected to separate InfiniBand fabric
# max_hca variable in shared_resouces supports value: 1-1000
# Example resources configuration of the multi resource pools with single device in pool and multiple devices in pool 
shared_resources:
- res_name: hca_shared_devices_a
  max_hca: 50
  devices:
  - ib1
    #- res_name: hca_shared_devices_b
    #  max_hca: 500
    #  devices:
    #  - ib2
    #  - ib3

# If you need only single resources pool please remove the unused resource name with its components

# Using for Host OS the Ubuntu LTS enablement (also called HWE or Hardware Enablement) kernel(required node reboot)
HWE_kernel: true

# MOFED installation
# If install_mofed is FALSE, will be used kernel InfiniBand inbox driver

install_mofed: false
mlnx_ofed_package: "mlnx-ofed-kernel-only"
mlnx_ofed_version: "latest"
upstream_libs: true


# Install Kubeflow MPI-Operator - https://github.com/kubeflow/mpi-operator

kubeflow_mpi_operator: false
mpi_operator_dep: "https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v1/mpi-operator.yaml"

# DP and CNI's URL's
shared_dp_ds: "https://raw.githubusercontent.com/Mellanox/k8s-rdma-shared-dev-plugin/master/images/k8s-rdma-shared-dev-plugin-ds.yaml"
nfd_release: "https://github.com/kubernetes-sigs/node-feature-discovery/archive/v0.6.0.tar.gz"

Role Execution

Run the playbook from the Kubespray deployment folder using the following command:

$ ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root mlnx-hca-shared.yaml

The execution time for this step may take a while to finalize.

Example of a successful completion of the playbooks looks like the following output:

PLAY RECAP ***************************************************************************************************************************************
node1                      : ok=17   changed=9    unreachable=0    failed=0    skipped=2    rescued=0    ignored=0   
node2                      : ok=6    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
node3                      : ok=6    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
node4                      : ok=6    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
node5                      : ok=6    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   

Tuesday 14 July 2020  16:22:55 +0300 (0:00:00.014)       0:00:31.228 ********** 
=============================================================================== 
mlnx-hca-shared : Install aptitude -------------------------------------------------------------------------------------------------------- 7.76s
mlnx-hca-shared : Update additional packages ---------------------------------------------------------------------------------------------- 7.75s
mlnx-hca-shared : OS update. It takes a while. -------------------------------------------------------------------------------------------- 2.89s
mlnx-hca-shared : Extract NFD daemonset's ------------------------------------------------------------------------------------------------- 2.31s
mlnx-hca-shared : Install Openshift pip module -------------------------------------------------------------------------------------------- 2.25s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------- 1.39s
mlnx-hca-shared : Download DP-ds.yaml ----------------------------------------------------------------------------------------------------- 1.16s
mlnx-hca-shared : create configmap for Shared DP ------------------------------------------------------------------------------------------ 1.08s
mlnx-hca-shared : Create a NFD worker Deployment by reading the definition from a local file ---------------------------------------------- 1.05s
mlnx-hca-shared : Create a NFD Master Deployment by reading the definition from a local file ---------------------------------------------- 0.56s
mlnx-hca-shared : Install Mellanox shared device plugin ----------------------------------------------------------------------------------- 0.52s
mlnx-hca-shared : check if a reboot is required ------------------------------------------------------------------------------------------- 0.40s
mlnx-hca-shared : check for module openshift ---------------------------------------------------------------------------------------------- 0.35s
mlnx-hca-shared : Remove NFD directory ---------------------------------------------------------------------------------------------------- 0.32s
mlnx-hca-shared : check if a reboot is required ------------------------------------------------------------------------------------------- 0.32s
mlnx-hca-shared : Insert NFD path to DP-ds.yaml ------------------------------------------------------------------------------------------- 0.28s
Install prerequisites for ALL Nodes ------------------------------------------------------------------------------------------------------- 0.27s
mlnx-hca-shared : MOFED installation ------------------------------------------------------------------------------------------------------ 0.22s
include_role : mlnx-hca-shared ------------------------------------------------------------------------------------------------------------ 0.15s
mlnx-hca-shared : Remove DP-ds.yaml ------------------------------------------------------------------------------------------------------- 0.14s

Role Installation Summary

After installing the role with default variable parameters in your K8s cluster, the following will also be completed:

  1. Installed Node feature discovery for Kubernetes.
  2. Configured configmap for "RDMA SHARED DEVICE PLUGIN" for creating resources.
  3. Installed DaemonSet with "RDMA SHARED DEVICE PLUGIN".

Role Deployment Verification

The role deployment verification must be done from the K8s Master node. Execute the following commands to initiate the verification process:

Key components
root@node1:~# kubectl get pod -n node-feature-discovery -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP             NODE    NOMINATED NODE   READINESS GATES
nfd-master-786cfcc58f-wjsq8   1/1     Running   0          72s   10.233.90.3    node1   <none>           <none>
nfd-worker-5t84f              1/1     Running   0          71s   10.233.105.1   node4   <none>           <none>
nfd-worker-jb9rd              1/1     Running   0          71s   10.233.70.1    node5   <none>           <none>
nfd-worker-kjm27              1/1     Running   0          71s   10.233.92.4    node3   <none>           <none>
nfd-worker-ljkkl              1/1     Running   0          71s   10.233.96.1    node2   <none>           <none>


root@node1:~# kubectl get pod -n kube-system -o wide | grep "rdma"
NAME                                          READY   STATUS    RESTARTS   AGE   IP                NODE    NOMINATED NODE   READINESS GATES
rdma-shared-dp-ds-4hkdv                       1/1     Running   0          45s   192.168.222.101   node2   <none>           <none>
rdma-shared-dp-ds-qprbd                       1/1     Running   0          45s   192.168.222.104   node5   <none>           <none>
rdma-shared-dp-ds-qzj2h                       1/1     Running   0          42s   192.168.222.102   node3   <none>           <none>
rdma-shared-dp-ds-tmv7f                       1/1     Running   0          42s   192.168.222.103   node4   <none>           <none>

Deployment verification

  1. Create a dummy container to test the deployment.
    Below is a yml configuration file for workload deployment with RDMA Shared device:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: test-deployment
      labels:
        app: test
    spec:
      replicas: 4
      selector: 
        matchLabels:
          app: test
      template:
        metadata:
          labels:
            app: test
        spec:
          containers:
          - image: IMAGE_NAME
            name: test-pod
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                rdma/hca_shared_devices_a: 1
            command:
            - sh
            - -c
            - sleep inf

    Requires a Docker image (IMAGE_NAME) with InfiniBand user space drivers installed.

  2. Start the deployment and make sure that all pods are running:

    root@node1:~# kubectl apply -f test-dep.yaml
    
    
    root@node1:~# kubectl get pod -o wide
    NAME                              READY   STATUS    RESTARTS   AGE   IP             NODE    NOMINATED NODE   READINESS GATES
    test-deployment-d5d789464-8gmml   1/1     Running   0          88s   10.233.70.2    node5   <none>           <none>
    test-deployment-d5d789464-fz787   1/1     Running   0          88s   10.233.105.2   node4   <none>           <none>
    test-deployment-d5d789464-hftjs   1/1     Running   0          88s   10.233.92.6    node3   <none>           <none>
    test-deployment-d5d789464-q7djk   1/1     Running   0          88s   10.233.96.2    node2   <none>           <none>
  3. Check the Pod InfiniBand components:

    root@node1:~# kubectl exec -it test-deployment-d5d789464-hftjs -- bash
    root@test-deployment-d5d789464-hftjs:/tmp# ls -la /dev/infiniband /sys/class/net
    /dev/infiniband:
    total 0
    drwxr-xr-x 2 root root      140 Jul 15 04:42 .
    drwxr-xr-x 6 root root      380 Jul 15 04:42 ..
    crw------- 1 root root 231,  67 Jul 15 04:42 issm3
    crw-rw-rw- 1 root root  10,  54 Jul 15 04:42 rdma_cm
    crw-rw-rw- 1 root root 231, 227 Jul 15 04:42 ucm3
    crw------- 1 root root 231,   3 Jul 15 04:42 umad3
    crw-rw-rw- 1 root root 231, 195 Jul 15 04:42 uverbs3
    
    /sys/class/net:
    total 0
    drwxr-xr-x  2 root root 0 Jul 15 04:43 .
    drwxr-xr-x 78 root root 0 Jul 15 04:43 ..
    lrwxrwxrwx  1 root root 0 Jul 15 04:43 eth0 -> ../../devices/virtual/net/eth0
    lrwxrwxrwx  1 root root 0 Jul 15 04:43 lo -> ../../devices/virtual/net/lo
    lrwxrwxrwx  1 root root 0 Jul 15 04:43 tunl0 -> ../../devices/virtual/net/tunl0
    
    
    root@test-deployment-d5d789464-hftjs:/tmp# ibv_devinfo                   
    hca_id:	mlx5_3
    	transport:			InfiniBand (0)
    	fw_ver:				20.24.0246
    	node_guid:			9803:9b03:0085:56a7
    	sys_image_guid:		9803:9b03:0085:56a6
    	vendor_id:			0x02c9
    	vendor_part_id:		4123
    	hw_ver:				0x0
    	board_id:			MT_0000000224
    	phys_port_cnt:		1
    		port:	1
    			state:			PORT_ACTIVE (4)
    			max_mtu:		4096 (5)
    			active_mtu:		4096 (5)
    			sm_lid:			1
    			port_lid:		4
    			port_lmc:		0x00
    			link_layer:		InfiniBand

    From the above example, we learn that the deployed container has only one InfiniBand device (uverb3), which can be used by any APP with native InfiniBand support.

  4. The following is a post-deployment sanity check with IB_WRITE_BW command between two Pods on different K8s Worker Nodes:

    First Pod - Server side
    root@node1:~# kubectl exec -it test-deployment-d5d789464-hftjs -- bash
    root@test-deployment-d5d789464-hftjs:/tmp# ib_write_bw -d mlx5_3 -a -F
    
    ************************************
    * Waiting for client to connect... *
    ************************************
    ---------------------------------------------------------------------------------------
                        RDMA_Write BW Test
     Dual-port       : OFF		Device         : mlx5_3
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x04 QPN 0x088d PSN 0xa1ff0c RKey 0x08a785 VAddr 0x007fae2d365000
     remote address: LID 0x03 QPN 0x088c PSN 0x57b931 RKey 0x08a282 VAddr 0x007f74e44b0000
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1097100          0.00               95.87  		        0.182849
    ---------------------------------------------------------------------------------------
    Second Pod - Client side
    root@node1:~# kubectl exec -it test-deployment-d5d789464-8gmml -- bash
    root@test-deployment-d5d789464-8gmml:/tmp# ib_write_bw  -F -d mlx5_3 10.233.92.6 -D 10 --cpu_util --report_gbits
    ---------------------------------------------------------------------------------------
                        RDMA_Write BW Test
     Dual-port       : OFF		Device         : mlx5_3
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x03 QPN 0x088c PSN 0x57b931 RKey 0x08a282 VAddr 0x007f74e44b0000
     remote address: LID 0x04 QPN 0x088d PSN 0xa1ff0c RKey 0x08a785 VAddr 0x007fae2d365000
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]    CPU_Util[%]
     65536      1097100          0.00               95.87  		        0.182849	    4.57
    ---------------------------------------------------------------------------------------
    

    The above example indicates that we can achieve a maximum transfer rate of ~ 96Gb/sec.

Done!

About the Authors

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.


Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Copyright
© 2022 NVIDIA Corporation & affiliates. All Rights Reserved.