Created on Nov 16, 2020 by Boris Kovalev, Vitaliy Razinkov
This Reference Deployment Guide (RDG) explains how to build the highest performing Kubernetes (K8s) cluster capable of hosting the most demanding distributed workloads, running on top of an NVIDIA GPU and an NVIDIA end-to-end InfiniBand fabric.
Abbreviation and Acronyms
|AOC||Active Optical Cable||IB||InfiniBand|
Container Network Interface
|CR||Custome Resource||MOFED||Mellanox OpenFabrics Enterprise Distribution|
|DAC||Direct Attach Copper cable|
Dynamic Host Configuration Protocol
|RDMA||Remote Direct Memory Access|
|EDR||Enhanced Data Rate - 100Gb/s|
Quick Start Guide
Graphics Processing Unit
Single Root Input Output Virtualization
|HDR||High Data Rate - 200Gb/s|
|HPC||High Performance Computing|
- NVIDIA T4 GPU
- NVIDIA OpenFabrics Enterprise Distribution for Linux (MLNX_OFED)
- What is Kubernetes?
- NVIDIA GPU Operator
- SR-IOV Network Operator
Provisioning of Machine Learning (ML) and High Performance Computing (HPC) cloud solutions may become a very complicated task. Proper design, and software and hardware component selection may become a gating task toward successful deployment.
This document will guide you through a complete solution cycle including design, component selection, technology overview and deployment steps.
The solution will be provisioned on top of GPU enabled servers over an NVIDIA end-to-end InfiniBand fabric.
NVIDIA GPU and SR-IOV Network Operators allow to run GPU accelerated and native RDMA workloads on the InfiniBand fabric such as HPC, Big Data, ML, AI and other applications.
The following processes are described below:
- K8s cluster deployment by Kubespray over bare metal nodes with Ubuntu 20.04 OS.
- NVIDIA GPU Operator deployment.
- InfiniBand fabric configuration.
- POD deployment example.
This document covers a single Kubernetes controller deployment scenario.
For high-availability cluster deployment, please refer to https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md
Key Components and Technologies
NVIDIA® T4 GPUThe NVIDIA® T4 GPU is based on the NVIDIA Turing™ architecture and packaged in an energy-efficient 70-watt small PCIe form factor. T4 is optimized for mainstream computing environments, and features multi-precision Turing Tensor Cores and RT Cores. Combined with accelerated containerized software stacks from NGC, T4 delivers revolutionary performance at scale to accelerate cloud workloads, such as high-performance computing, deep learning training and inference, machine learning, data analytics, and graphics.
- NVIDIA MLNX-OS®
NVIDIA MLNX-OS is Mellanox's InfiniBand/VPI switch operating system for data centers with storage, enterprise, high-performance, machine learning, Big Data computing and cloud fabrics.
- NVIDIA ConnectX InfiniBand adapters
NVIDIA® ConnectX® InfiniBand smart adapters with acceleration engines deliver best-in-class network performance and efficiency, enabling low-latency, high throughput and high message rates for applications at SDR, QDR, DDR, FDR, EDR and HDR InfiniBand speeds.
- NVIDIA smart InfiniBand switch systems
NVIDIA smart InfiniBand switch systems deliver the highest performance and port density for high performance computing (HPC), AI, Web 2.0, big data, clouds, and enterprise data centers. Support for 36 to 800-port configurations at up to 200Gb/s per port, allows compute clusters and converged data centers to operate at any scale, reducing operational costs and infrastructure complexity.
- NVIDIA LinkX® InfiniBand Cables
NVIDIA Mellanox LinkX cables and transceivers are designed to maximize the performance of High Performance Computing networks, requiring high-bandwidth, low-latency connections between compute nodes and switch nodes. DAC is available up to 7m. AOCs are available in <30m OM2 fiber lowest-cost lengths; OM3/OM4 multimode to 100m. DACs and AOCs data rates of QDR(40G), FDR10(40G), FDR(56G), EDR(100G), HDR100 (100G) and HDR (200G).
Kubernetes (K8s) is an open-source container orchestration platform for deployment automation, scaling, and management of containerized applications.
- Kubespray (From Kubernetes.io)
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks and provides:
- A highly available cluster
- Composable attributes
- Support for most popular Linux distributions
- NVIDIA GPU Operator
NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPUs.
These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.
Remote Direct Memory Access (RDMA) is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
Like locally based Direct Memory Access (DMA), RDMA improves throughput and performance and frees up compute resources.
- SR-IOV Network Operator
SR-IOV Network Operator is designed to help the user to provision and configure SR-IOV CNI plugin and Device plugin in the Openshift and Kubernetes clusters.
The logical design includes the following layers:
- One compute layer:
- Deployment node
- K8s Master node
- 2 x K8s Worker nodes with two NVIDIA T4 GPUs and one Mellanox ConnectX adapter.
- Two separate networking layers:
- Management network
- High-speed InfiniBand (IB) fabric
In this RDG we will describe a small scale solution with only one switch.
Simple Setup with One Switch
In a single switch case, by using an NVIDIA QM8700 InfiniBand HDR Switch System you can connect up to 40 servers with NVIDIA LinkX HDR 200Gb/s QSFP56 DAC cables.
Scaled Setup for InfiniBand Fabric
For assistance in designing the scaled InfiniBand topology, use the NVIDIA InfiniBand Topology Generator, an online cluster configuration tool that offers flexible cluster configurations and sizes.
For a scaled setup we recommend using NVIDIA Unified Fabric Manager (UFM®).
Bill of Materials (BoM)
The following hardware setup is utilized in the distributed K8s configuration described in this guide:
The above table does not contain Kubernetes Management network connectivity components.
Deployment and Configuration
The deployment is validated using Ubuntu 20.04 OS and Kubespray v2.14.2.
The first port of each NVIDIA HCA on each Worker node is wired to the NVIDIA switch using NVIDIA LinkX HDR 200Gb/s QSFP56 DAC cables.
- InfiniBand fabric
- Switch OS
- Management Network
DHCP and DNS services are part of the IT infrastructure. The component installation and configuration are not covered in this guide.
Below are the server names with their relevant network configurations.
|IP and NICS|
|Deployment Node||sl-depl-node||eno0: DHCP|
|High-speed switch||swx-mld-ib67||none||mgmt0: From DHCP|
InfiniBand Fabric Configuration
Below is a list of recommendations and prerequisites that are important for the configuration process:
- Refer to the MLNX-OS User Manual to become familiar with the switch software (located at enterprise-support.nvidia.com/s/)
- Upgrade the switch software to the latest MLNX-OS version
- InfiniBand Subnet Manager (SM) is required to configure InfiniBand fabric properly
There are three ways to run an InfiniBand SM in the InfiniBand fabric:
- Start the SM on one or more managed switches. This is a very convenient and quick operation which allows for easier InfiniBand ‘plug & play'.
- Run OpenSM daemon on one or more servers by executing the /etc/init.d/opensmd command. It is recommended to run the SM on a server in case there are 648 nodes or more.
- Use Unified Fabric Management (UFM®).
UFM is a powerful platform for scale-out computing, eliminates the complexity of fabric management, provides deep visibility into traffic, and optimizes fabric performance.
In this guide, we will launch the InfiniBand SM on the InfiniBand switch (Method num. 1). Below are the configuration steps for the chosen method.
To enable the SM on one of the managed switches:
Login to the switch and enter the next configuration commands (swx-mld-ib67 is our switch name):
Once the switch reboots, check the switch configuration. It should look like the following:
All the K8s Worker nodes have the same hardware specification (see BoM for details).
- Host BIOS
Verify that you are using a SR-IOV supported server platform for K8s Worker nodes, and review the BIOS settings in the hardware documentation to enable SR-IOV in the BIOS.
- Host OS
Ubuntu Server 20.04 operating system should be installed on all servers with OpenSSH server packages.
- Experience with Kubernetes
Make sure to familiarize yourself with the Kubernetes Cluster architecture.
Host OS Prerequisites
Make sure Ubuntu Server 20.04 operating system is installed on all servers with OpenSSH server packages, and create a non-root user account with sudo privileges without password.
Update the Ubuntu software packages by running the following commands:
Non-root User Account Prerequisites
In this solution we added the following line to the EOF /etc/sudoers:
Disable/blacklist Nouveau NVIDIA driver on the Worker node servers by running the commands below or paste each line into the terminal:
Install NVIDIA MLNX_OFED and upgrade firmware on the Worker node servers by running the commands below or paste each line into the terminal:
Set Up IB port link on the Worker node servers.
Set netns to exclusive mode for allows network namespace isolation for RDMA workloads on the Worker node servers.
Check netns mode and InfiniBand devices on the Worker node servers.
All Worker nodes must have the same configuration and the same PCIe card placement.
Check that IB interface is UP.
K8s Cluster Deployment and Configuration
The Kubernetes cluster in this solution will be installed using Kubespray with a non-root user account from a Deployment node.
SSH Private Key and SSH Passwordless Login
Login to the Deployment node as a deployment user (in this case - user) and create an SSH private key for configuring the password-less authentication on your computer by running the following commands:
Copy your SSH private key, such as ~/.ssh/id_rsa, to all nodes in your deployment by running the following command. Sample:
Check SSH connectivity to all nodes in your deployment by running the following command:
Install dependencies for running Kubespray with Ansible on the Deployment server.The default folder for subsequent commands is ~/kubespray-2.14.2.
Create a new cluster configuration.
As a result, theinventory/mycluster/hosts.yaml file will be created.Review and change the host configuration file - inventory/mycluster/hosts.yaml.
Below is an example for this deployment.
Review and change cluster installation parameters in the files:
In inventory/mycluster/group_vars/all/all.yml uncomment the following line so the metrics can receive data about the use of cluster resources:
In inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml set a default Kubernetes CNI by setting the desired kube_network_plugin value (default: calico) parameter and enable multi_networking by setting kube_network_plugin_multus: true.
Deploy K8s Cluster by Kubespray Ansible Playbook
Example of a successful completion of the playbooks looks like:
Label Worker nodes using
node-role.kubernetes.io/worker label, run on the K8s Master node.
K8s Deployment Verification
Verifying the Kubernetes cluster deployment can be done through the ROOT user account on the K8s Master node.
Below is an output example of a K8s cluster with the deployment information, with default Kubespray configuration using the Calico Kubernetes CNI plugin.
To ensure that the Kubernetes cluster is installed correctly, run the following commands:
NVIDIA GPU Operator Installation for K8s cluster
The preferred method to deploy the device plugin is as a daemonset using helm from K8s Master Node. Install Helm from the official installer script.
Add the NVIDIA Helm repository.
Deploy NVIDIA GPU Operator.
Verify the NVIDIA GPU Operator installation (wait ~ 5-10 minutes for the operator installation will finished).
To run a Sample GPU Application: https://github.com/NVIDIA/gpu-operator#running-a-sample-gpu-applicationFor GPU monitoring: https://github.com/NVIDIA/gpu-operator#gpu-monitoring
SR-IOV Network Operator Installation for K8s Cluster
SR-IOV network is an additional feature of a Kubernetes cluster.
To make it work, you need to provision and configure different components.
SR-IOV Network Operator Deployment Steps
- Initialize the supported SR-IOV NIC types on selected nodes.
- Provision SR-IOV device plugin executable on selected nodes.
- Provision SR-IOV CNI plugin executable on selected nodes.
- Manage configuration of SR-IOV device plugin on host.
- Generate net-att-def CRs for SR-IOV CNI plugin.
Install general dependencies on the Master node server, run the commands below.
Below is a detailed step-by-step description of an SR-IOV Network Operator installation.
Install Whereabouts CNI.
You can install this plugin with a Daemonset, using the following commands:
To ensure the plugin is installed correctly, run the following command:
Clone this GitHub repository.
Deploy the operator.
By default, the operator will be deployed in namespace 'sriov-network-operator' for a Kubernetes cluster. You can check if the deployment is finished successfully.
- Checking the status of SriovNetworkNodeState CRs to find out all the SR-IOV capable devices in our cluster.
In our deployment we choose IB interface with name ibs6f0.
With the chosen IB interface we create SriovNetworkNodePolicy CR.
Apply the SriovNetworkNodePolicy.
Check the Operator deployment after the police activation.
Create a Network Attachment Definition with file name sriov-ib0.yaml.
Apply the Network Attachment Definition.
Verify the Network Attachment Definition installation.
Check Worker node 2.
- Check Worker node 3.
Create a sample Deployment (Container image must include Cuda and InfiniBand performance tools):
Deploy the sample POD.
Verify the POD is running.
Check GPU in a container.
Check network adapters.
Run an RDMA Write - ib_write_bw bandwidth stress benchmark over IB.
ib_write_bw -a -d mlx5_0 &
ib_write_bw -a -F $server_IP -d mlx5_0 --report_gbits
Open 2 consoles to K8s Master node.
In a first console (Server side) to K8s Master node run the following commands:
In a second console (Client side) to K8s Master node run the following commands:
Delete the sample deployment by running:
Boris Kovalev has worked for the past several years as a Solutions Architect, focusing on NVIDIA Networking/Mellanox technology, and is responsible for complex machine learning, Big Data and advanced VMware-based cloud research and design. Boris previously spent more than 20 years as a senior consultant and solutions architect at multiple companies, most recently at VMware. He has written multiple reference designs covering VMware, machine learning, Kubernetes, and container solutions which are available at the Mellanox Documents website.
Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.