NVIDIA Docs Hub NVIDIA Networking Networking Solutions RDG for Accelerating AI Workloads in Red Hat OCP with NVIDIA DGX A100 Servers and NVIDIA InfiniBand Fabric

RDG for Accelerating AI Workloads in Red Hat OCP with NVIDIA DGX A100 Servers and NVIDIA InfiniBand Fabric

Created Oct 24, 2022.

Scope

The following Reference Deployment Guide (RDG) demonstrates the deployment process of a Red Hat OpenShift Container Platform (RH OCP v4.10.x) over NVIDA DGX A100 servers and NVIDIA HDR InfiniBand fabric for accelerated AI-applications.

Abbreviations and Acronyms

Term	Definition	Term	Definition
AI	Artificial Intelligence	ML	Machine Learning
CNI	Container Network Interface	NFD	Node Feature Discovery
CR	Custom Resources	NCCL	NVIDIA Collective Communication Library
CRD	Custom Resources Definition	OCI	Open Container Initiative
CRI	Container Runtime Interface	OCP	OpenShift Container Platform
DHCP	Dynamic Host Configuration Protocol	PF	Physical Function
DNS	Domain Name System	QSG	Quick Start Guide
DL	Deep Learning	RDG	Reference Deployment Guide
DP	Device Plugin	RDMA	Remote Direct Memory Access
HDR	InfiniBand High Data Rate (200Gb/s)	SR-IOV	Single Root Input Output Virtualization
IPAM	IP Address Management	TF	TensorFlow
K8s	Kubernetes	VF	Virtual Function

Introduction

Preparing a Red Hat OpenShift Container Platform (OCP) infrastructure to run AI workloads efficiently is challenging.

This document provides a complete reference deployment guide for such a system including technology overview, design, component selection, deployment steps and AI workload examples. The solution will be deployed on NVIDIA DGX A100 servers for OCP worker nodes and on X86 standard servers for OCP control plane nodes. The NVIDIA end-to-end HDR (200Gb/s) InfiniBand fabric is used to handle the workload networking needs, while a 100Gb/s Ethernet network is used as a Deployment/Management network.

In this guide, we use the OpenShift operators, the NVIDIA GPU Operator and the NVIDIA Network Operator, which are responsible for deploying and configuring GPU and Network components in the OCP cluster. These components accelerate AI tasks using CUDA, RDMA and GPUDirect technologies.

A Greenfield deployment is assumed for this guide.

References

Solution Architecture

Key Components and Technologies

NVIDIA DGX A100
NVIDIA DGX™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. NVIDIA DGX A100 features the world’s most advanced accelerator, the NVIDIA A100 Tensor Core GPU, enabling enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.

NVIDIA InfiniBand Adapters
NVIDIA ConnectX InfiniBand adapters provide ultra-low latency, extreme throughput, and innovative NVIDIA In-Network Computing engines to deliver the acceleration, scalability, and feature-rich technology needed for today's modern workloads.

NVIDIA LinkX Cables
The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

NVIDIA InfiniBand Switches
NVIDIA Quantum InfiniBand switch systems deliver the highest performance and port density available. Innovative capabilities such as NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ and advanced management features such as self-healing network capabilities, quality of service, enhanced virtual lane mapping, and NVIDIA In-Network Computing acceleration engines provide a performance boost for industrial, AI, and scientific applications.

OpenShift
Red Hat OpenShift Container Platform is an enterprise-ready Kubernetes container platform with full-stack automated operations to manage hybrid cloud and multi-cloud deployments. It helps you deliver applications faster and makes developers more productive. Automate life-cycle management to get increased security, tailored operations solutions, easy-to-manage cluster operations, and application portability.

NVIDIA GPU Operator
The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM-based monitoring, and more.

NVIDIA CUDA
CUDA^® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers can dramatically speed up computing applications by harnessing the power of GPUs. In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute-intensive portion of the application runs on thousands of GPU cores in parallel.

NVIDIA Network Operator
An analog to the NVIDIA GPU Operator, the NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

RDMA
RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.
Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.

GPUDirect RDMA
GPUDirect (GDR) RDMA provides a direct P2P (Peer-to-Peer) data path between the GPU memory directly to and from NVIDIA HCA devices. This reduces GPU-to-GPU communication latency and completely offloads the CPU, removing it from all GPU-to-GPU communications across the network.

Logical Design

The logical design includes the following parts:

Jump node used for initial deployment and debugging
OCP control plane nodes
NVIDIA DGX A100 servers - OCP worker nodes
NVIDIA Quantum InfiniBand fabric
UFM Enterprise Node
Deployment and OCP management networks

OpenShift Container Platform Networking

In this guide, an OCP cluster is deployed in a Non-Air Gap environment, and Internet access is required.

The OCP cluster is deployed on a dedicated Deployment/Management network (CIDR 192.168.77.0/24) that is part of the IT infrastructure which also includes DNS/DHCP services.

The installation and configuration procedures for these components are not covered in this guide.

Network/Fabric Diagram

In this RDG we will describe a small-scale solution with only one InfiniBand switch.

All OCP cluster Nodes are connected to the MGMT switch by a single 100GbE cable.

A ll InfiniBand ports of the DGX A100 server are connected to a single NVIDIA Quantum HDR 200Gb/s InfiniBand Smart Edge Switch with NVIDIA LinkX HDR 200Gb/s QSFP56 DAC cables.

In addition, we used a UFM Enterprise Node with a similar connectivity.

All server remote management ports and switch management ports are connected to a 1GbE switch.

The setup diagram is presented in the picture below.

For assistance in designing the scaled InfiniBand topology, use the InfiniBand Topology Generator, an online cluster configuration tool that offers flexible cluster configurations and sizes.

Software Stack Components

In this guide, the following software components have been used to deploy the system:

Red Hat OCP 4.10.30
AlmaLinux v8.5 for Jump-Node. Installation mode "Server with GUI"
Ubuntu server 18.04 for UFM Enterprise node
NVIDIA GPU Operator v1.11.1
NVIDIA Network Operator v1.3.0
Red Hat OpenShift Data Foundation
Red Hat SR-IOV Network Operator

Bill of Materials

The following hardware setup is utilized in this guide to build an OCP cluster with four Worker nodes.

Deployment and Configuration

Network and Fabric Configuration for an OCP Cluster

Below are the server names with their relevant network configurations.

Server/Switch Type	Server/Switch Name	IP and NICS
		InfiniBand Network	Management Network
Jump node	ocp4-jump	N/A	eth0: DHCP 192.168.77.201
OCP Master node1	control1	N/A	eth0: DHCP 192.168.77.11
OCP Master node2	control2	N/A	eth0: DHCP 192.168.77.12
OCP Master node3	control3	N/A	eth0: DHCP 192.168.77.13
OCP Worker node1	worker1	ib0: no IP set ib4: no IP set ib1: no IP set ib5: no IP set ib2: no IP set ib6: no IP set ib3: no IP set ib7: no IP set	enp225s0f0: DHCP 192.168.77.21
OCP Worker node2	worker2	ib0: no IP set ib4: no IP set ib1: no IP set ib5: no IP set ib2: no IP set ib6: no IP set ib3: no IP set ib7: no IP set	enp225s0f0: DHCP 192.168.77.22
OCP Worker node3	worker3	ib0: no IP set ib4: no IP set ib1: no IP set ib5: no IP set ib2: no IP set ib6: no IP set ib3: no IP set ib7: no IP set	enp225s0f0: DHCP 192.168.77.23
OCP Worker node4	worker4	ib0: no IP set ib4: no IP set ib1: no IP set ib5: no IP set ib2: no IP set ib6: no IP set ib3: no IP set ib7: no IP set	enp225s0f0: DHCP 192.168.77.24
InfiniBand switch	ib-sw01	N/A	mgmt0: DHCP 192.168.77.222
UFM	ufm	ib0: no IP set	eth0: DHCP 192.168.77.223

Wiring

On each OCP Worker Node, all the networking ports of each NVIDIA Network Adapter are wired to the Ethernet (SN2700) and InfiniBand (QM8700) NVIDIA switch using NVIDIA LinkX DAC cables.

The below figure illustrates the required wiring for building an OCP cluster with four Worker nodes and a UFM Enterprise Node.

InfiniBand Fabric Configuration

Overview

Below is a list of recommendations and prerequisites that are important for the configuration process:

Refer to the NVIDIA MLNX-OS User Manual to become familiar with the switch software (located at enterprise-support.nvidia.com/s/)
Upgrade the switch software to the latest NVIDIA MLNX-OS version
An InfiniBand Subnet Manager (SM) is required to configure the InfiniBand fabric properly

There are three ways to run an InfiniBand SM in the InfiniBand fabric:

Start the SM on one or more managed switches. This is a very convenient and quick operation which allows for easier InfiniBand ‘plug & play'.
Run an OpenSM daemon on one or more servers by executing the /etc/init.d/opensmd command.
Use a Unified Fabric Manager (UFM®).
UFM is a powerful platform for scale-out computing, which eliminates the complexity of fabric management, provides deep visibility into traffic and optimizes fabric performance.

Below are the configuration steps for method #1 and method #3.

This guide provides instructions for launching the InfiniBand SM with a Unified Fabric Manager (Method #3).

Enable the SM on the Managed Switch

IB switch configuration

Copy
Copied!

            
            NVIDIA MLNX-OS Switch Management
 
switch login: admin
Password: 
 
ib-sw01 [standalone: master] > enable 
ib-sw01 [standalone: master] # configure terminal
ib-sw01 [standalone: master] (config) # ib smnode ib-sw01 enable 
ib-sw01 [standalone: master] (config) # ib smnode ib-sw01 sm-priority 0
 
ib-sw01 [standalone: master] (config) # ib sm virt enable
ib-sw01 [standalone: master] (config) # write memory
ib-sw01 [standalone: master] (config) # reload

Once the switch reboots, check the switch configuration. It should look like the following:

Switch config example

Copy
Copied!

            
            NVIDIA MLNX-OS Switch Management
 
switch login: admin
Password: 
 
ib-sw01 [standalone: master] > enable 
ib-sw01 [standalone: master] # configure terminal
ib-sw01 [standalone: master] (config) # show running-config 
##
## Running database "initial"
## Generated at 2022/11/16 17:40:41 +0000
## Hostname: ib-sw01
## Product release: 3.9.1600
##
 
##
## Running-config temporary prefix mode setting
##
no cli default prefix-modes enable
 
##
## Subnet Manager configuration
##
   ib sm virt enable
 
##
## Other IP configuration
##
   hostname ib-sw01
 
##
## Other IPv6 configuration
##
no ipv6 enable
 
##
## Local user account configuration
##
   username admin password 7 $6$6GZ8Q0RF$FZW9pc23JJkwwOJTq85xZe1BJgqQV/m6APQNPkagZlTEUgKMWLr5X3Jq2hsUyB.K5nrGdDNUaSLiK2xupnIJo1
   username monitor password 7 $6$z1.r4Kl7$TIwaNf7uXNxZ9UdGdUpOO9kVug0shRqGtu75s3dSrY/wY1v1mGjrqQLNPHvHYh5HAhVuUz5wKzD6H/beYeEqL.
 
##
## AAA remote server configuration
##
# ldap bind-password ********
# radius-server key ********
# tacacs-server key ********
 
##
## Network management configuration
##
# web proxy auth basic password ********
 
##
## X.509 certificates configuration
##
#
# Certificate name system-self-signed, ID 12d0989d8623825b71bc25f9bc02de813fc9fe2a
# (public-cert config omitted since private-key config is hidden)
 
 
##
## IB nodename to GUID mapping
##
   ib smnode ib-sw01 create
   ib smnode ib-sw01 enable
   ib smnode ib-sw01 sm-priority 0
##
## Persistent prefix mode setting
##
cli default prefix-modes enable

UFM Deployment and Configuration

NVIDIA® UFM® is a host-based solution, providing all management functionality required for managing InfiniBand fabrics.

In this guide, a UFM will be deployed as a Stand Alone UFM Enterprise application container. High Availability deployment is possible, yet not covered.

Info

For the UFM Enterprise User Manual, refer to this link.
For the UFM Enterprise Docker Container Installation Guide, refer to this link.
Using the NVIDIA UFM Enterprise Software requires a license. To download the UFM software and license, please visit NVIDIA's Licensing Portal.
If you do not have a valid license, please fill out the NVIDIA Enterprise Account Registration form to get a UFM evaluation license.

UFM Node OS

Install the OS on the UFM Node (in this solution we have used Ubuntu 18.04 OS).
Install the NVIDIA MLNX_OFED network drivers. For further information refer to this link.
Install and enable Docker service—Ubuntu Docker Installation.
Use the "ibstat" command to make sure that the UFM Node is connected to the InfiniBand Fabric, and the link is up.
Make sure that the UFM Node is connected to the OpenShift Management network and allocate an IP Address. In our example we have assigned IP 192.168.77.223 to this node.
Set a dummy IP address on the InfiniBand ib0 interface, and make sure it is in the "up" state. This step is a prerequisite for UFM application installation.

Info

ib0 is the default fabric interface used by the UFM installer. If you have connected ib1 to the InfiniBand fabric, make sure to specify the interface during UFM installer execution.
Make sure that access to the Internet is available, as it will be used to pull the UFM application container. It is also possible to use local images without Internet connectivity.

UFM Stand-alone installation

Create a directory on the host to mount and sync the UFM Enterprise files with read/write permissions. For example: / opt/ufm/files/.
Copy only your UFM license file(s) to a temporary directory which we are going to use in the installation command. For example: /tmp/license_file/
Run the UFM installation command according to the following example, which will also configure UFM fabric interface to be ib0:
Copy

Copied!
```
            
            docker run -it --name=ufm_installer --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /etc/systemd/system/:/etc/systemd_files/ \
-v /opt/ufm/files/:/installation/ufm_files/ \
-v /tmp/license_file/:/installation/ufm_licenses/ \
mellanox/ufm-enterprise:latest \
--install \
--fabric-interface ib0
        
```
The values below can be updated in the command per your needs:
- /opt/ufm/files/
- /tmp/license_file/
- For example, if you want UFM files to be mounted in another location on your server, create that directory and replace the path in the command.

Reload the system:

Copy
Copied!

            
            systemctl daemon-reload

Configure general settings in the /opt/ufm/files/conf/gv.cfg file to enable Virtualization.

This allows supporting virtual ports in UFM.

Copy
Copied!

            
            [Virtualization]
# By enabling this flag, UFM will discover all the virtual ports assigned for all hypervisors in the fabric
enable = true
# Interval for checking whether any virtual ports were changed in the fabric
interval = 60

Warning

This step is required and cannot be skipped!

To start the UFM Enterprise service, run:

Copy
Copied!

            
            systemctl start ufm-enterprise

Connect from any web browser to the UFM WebUI. Use the following URL - https://192.168.77.223/ufm/

Info

Default Login Credentials: admin/123456

OCP Cluster Installation

General Prerequisites

DHCP and DNS services are required for OCP installation.

The OCP cluster installation requires two VIP addresses:

The API address is used to access the cluster API.
The Ingress address is used for cluster ingress traffic.

These IP addresses must be provided to the installation program when installing the OCP Container Platform cluster.

In our environment, we use the following IPs and DNS records:

Component	IP	Record	Description
Kubernetes API	192.168.77.211	api.ocp4.clx.labs.mlnx	A DNS A/AAAA or CNAME record and a DNS PTR record, to identify the API load balancer. These records must be resolvable by both clients external to the cluster and from all the nodes within the cluster.
Ingress	192.168.77.212	*.apps.ocp4.clx.labs.mlnx	A wildcard DNS A/AAAA or CNAME record that refers to the application ingress load balancer. The application ingress load balancer targets the machines that run the Ingress Controller PODs. The Ingress Controller PODs run on the compute machines by default. These records must be resolvable by both clients external to the cluster and from all the nodes within the cluster. For example, `console-openshift-console.apps.ocp4.clx.labs.mlnx` is used as a wildcard route to the OCP console.

Component

Record

Description

Kubernetes API

192.168.77.211

api.ocp4.clx.labs.mlnx

A DNS A/AAAA or CNAME record and a DNS PTR record, to identify the API load balancer. These records must be resolvable by both clients external to the cluster and from all the nodes within the cluster.

Ingress

192.168.77.212

*.apps.ocp4.clx.labs.mlnx

A wildcard DNS A/AAAA or CNAME record that refers to the application ingress load balancer. The application ingress load balancer targets the machines that run the Ingress Controller PODs. The Ingress Controller PODs run on the compute machines by default. These records must be resolvable by both clients external to the cluster and from all the nodes within the cluster.

For example, console-openshift-console.apps.ocp4.clx.labs.mlnx is used as a wildcard route to the OCP console.

OpenShift Container Platform nodes in the cluster should have access to an NTP server. OpenShift Container Platform nodes use NTP to synchronize their clocks.

NTP synchronization servers can be configured as a specific option of the DHCP service.

Note

DNS/DHCP network services are part of the IT infrastructure. The components installation procedure and configuration are not covered in this guide. For more information, see Networking Requirements for User-provisioned Infrastructure.

Jump Node Prerequisites

In this guide, a Jump Node is used for CLI and GUI access to an OCP cluster.

A standard X86 server has been used for the Jump Node. AlmaLinux OS 8.6 is installed on the server in a "Server with GUI" mode.

Generating an SSH Public Key for Discovery ISO

"Discovery ISO" is part of the OCP Assisted installed procedure. Instructions for the generation of "Discovery ISO" are provided below.

To generate a key, please run the following command on the Jump Node:

Copy
Copied!

            
            [root@jump-node ~]# ssh-keygen

For all prompts, please push "ENTER" to use default values. Your public key has been saved in /root/.ssh/id_rsa.pub.

OCP Cluster Deployment with Assisted Installer

Installer-provisioned installation on bare metal nodes deploys and configures the infrastructure that an OpenShift Container Platform cluster runs on.

This guide provides a methodology to achieve a successful installer-provisioned bare-metal installation.

Follow the steps outlined in the section to install an OCP cluster.

Open a web browser on the Jump Node (in our case, AlmaLinux with UI is used), connect to the Red Hat Hybrid Cloud site and log into the console with your credentials.
Using the left menu bar, select OpenShift :
Click on Create Cluster .
Select Datacenter , and click on "Bare Metal (x86_64)" in "Infrastructure provider".
Select the Assisted Installer installation type.
In the Cluster details open window, provide a Cluster Name and a Base Domain. From the drop list, select the OpenShift version that you wish to install, and click on Next. Example:
Add an OpenShift Data Foundation operator to use DGX server local storage.
In the Host discovery window, click on Add hosts.
In the Add hosts window, select the Full Image file option, paste your ssh public key in the "SSH public key" window (the key was generated in the Jump Node - /root/.ssh/id_rsa.pub ) and click on Generate Discovery ISO.
In the opened window, click on Download Discovery ISO.
Boot all servers in your cluster from the downloaded Discovery.ISO.
Go back to the RedHat Hybrid Cloud Console web page. After a few minutes, you will be able to see all your hosts in "Ready" status.
Review the CPU, Memory and Disk hosts configuration.
Select a Role for each Host. Example:
In each Worker Node, please choose "Installation disk" and click Next. Example:
In the Networking opened window, provide the Machine network, API IP and Ingress IP. Click Next. Example:
Review the configuration and click on Install cluster.
The OCP Installation process will start and will take a while.
Once the installation process is complete, you will be able to Download the kubeconfig file and Web Console URL, Username and Password. Example:
In our case, we copied the kubeconfig file on your Jump Node to the /root/.kube/config file.

To ensure that the OCP cluster is installed properly, verify it via a CLI or a Web Console.

Via CLI:

Copy
Copied!

            
            [root@jump-node ~]# oc get nodes
NAME                          STATUS   ROLES                 AGE   VERSION
control1.ocp4.clx.labs.mlnx   Ready    master                 1d   v1.23.5+012e945
control2.ocp4.clx.labs.mlnx   Ready    master                 1d   v1.23.5+012e945
control3.ocp4.clx.labs.mlnx   Ready    master                 1d   v1.23.5+012e945
worker1.ocp4.clx.labs.mlnx    Ready    worker                 1d   v1.23.5+012e945
worker2.ocp4.clx.labs.mlnx    Ready    worker                 1d   v1.23.5+012e945
worker3.ocp4.clx.labs.mlnx    Ready    worker                 1d   v1.23.5+012e945
worker4.ocp4.clx.labs.mlnx    Ready    worker                 1d   v1.23.5+012e945

Note

To interact with the OpenShift Container Platform from a command-line interface, the OpenShift CLI (oc) should be installed. You can install oc on Linux, Windows or macOS. For OpenShift CLI installation guide, please refer to this link.

Via Web Console:

Please open a web browser on the Infra Node with the provided Web Console URL, Username and Password.

Make sure that the Cluster Status is ok.

Using the left menu bar, expand the Compute section and select the Nodes Status.

By clicking on each node, you can see the Node status with detailed information. Example:

Additional status information is available via the OpenShift CLI. Example:

Copy
Copied!

            
            [root@jump-node ~]# oc get nodes -o wide
NAME                          STATUS   ROLES                 AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
control1.ocp4.clx.labs.mlnx   Ready    master                1d    v1.23.5+012e945   192.168.77.11   <none>        Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa)   4.18.0-305.57.1.el8_4.x86_64   cri-o://1.23.3-15.rhaos4.10.git6af791c.el8
control2.ocp4.clx.labs.mlnx   Ready    master                1d    v1.23.5+012e945   192.168.77.12   <none>        Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa)   4.18.0-305.57.1.el8_4.x86_64   cri-o://1.23.3-15.rhaos4.10.git6af791c.el8
control3.ocp4.clx.labs.mlnx   Ready    master                1d    v1.23.5+012e945   192.168.77.13   <none>        Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa)   4.18.0-305.57.1.el8_4.x86_64   cri-o://1.23.3-15.rhaos4.10.git6af791c.el8
worker1.ocp4.clx.labs.mlnx    Ready    worker                1d    v1.23.5+012e945   192.168.77.21   <none>        Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa)   4.18.0-305.57.1.el8_4.x86_64   cri-o://1.23.3-15.rhaos4.10.git6af791c.el8
worker2.ocp4.clx.labs.mlnx    Ready    worker                1d    v1.23.5+012e945   192.168.77.22   <none>        Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa)   4.18.0-305.57.1.el8_4.x86_64   cri-o://1.23.3-15.rhaos4.10.git6af791c.el8
worker3.ocp4.clx.labs.mlnx    Ready    worker                1d    v1.23.5+012e945   192.168.77.23   <none>        Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa)   4.18.0-305.57.1.el8_4.x86_64   cri-o://1.23.3-15.rhaos4.10.git6af791c.el8
worker4.ocp4.clx.labs.mlnx    Ready    worker                1d    v1.23.5+012e945   192.168.77.24   <none>        Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa)   4.18.0-305.57.1.el8_4.x86_64   cri-o://1.23.3-15.rhaos4.10.git6af791c.el8

Copy
Copied!

            
            [root@jump-node ~]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.30   True        False         False      20m
baremetal                                  4.10.30   True        False         False      1d 
cloud-controller-manager                   4.10.30   True        False         False      1d 
cloud-credential                           4.10.30   True        False         False      1d 
cluster-autoscaler                         4.10.30   True        False         False      1d 
config-operator                            4.10.30   True        False         False      1d 
console                                    4.10.30   True        False         False      1d
csi-snapshot-controller                    4.10.30   True        False         False      1d 
dns                                        4.10.30   True        False         False      1d 
etcd                                       4.10.30   True        False         False      1d 
image-registry                             4.10.30   True        False         False      1d 
ingress                                    4.10.30   True        False         False      1d 
insights                                   4.10.30   True        False         False      1d 
kube-apiserver                             4.10.30   True        False         False      1d 
kube-controller-manager                    4.10.30   True        False         False      1d 
kube-scheduler                             4.10.30   True        False         False      1d 
kube-storage-version-migrator              4.10.30   True        False         False      1d 
machine-api                                4.10.30   True        False         False      1d 
machine-approver                           4.10.30   True        False         False      1d 
machine-config                             4.10.30   True        False         False      1d 
marketplace                                4.10.30   True        False         False      1d 
monitoring                                 4.10.30   True        False         False      1d 
network                                    4.10.30   True        False         False      1d 
node-tuning                                4.10.30   True        False         False      1d 
openshift-apiserver                        4.10.30   True        False         False      1d
openshift-controller-manager               4.10.30   True        False         False      1d
openshift-samples                          4.10.30   True        False         False      1d 
operator-lifecycle-manager                 4.10.30   True        False         False      1d 
operator-lifecycle-manager-catalog         4.10.30   True        False         False      1d 
operator-lifecycle-manager-packageserver   4.10.30   True        False         False      1d 
service-ca                                 4.10.30   True        False         False      1d 
storage                                    4.10.30   True        False         False      1d

Post-installation Configuration

In the OpenShift 4.x environment, each running container will be limited to the default maximum PID value of 1024 . To properly run an AI application on the OCP cluster, more than 1024 processes are required within a single container.

The OCP cluster operator is required to adjust the default maximum PID value to a higher number - 4096. It can be done as part of the "Day 2 operation for OCP". For additional information about post-installation configuration, please refer to "Day 2 operation for OCP".

Create ContainerRuntimeConfig custom resource in order to configure the cri-o pidsLimit - mco-pidup.yaml:

Copy
Copied!

            
            apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  name: 01-worker-scale-increase-pid-limit
spec:
  containerRuntimeConfig:
    pidsLimit: 4096
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""

Apply the following configuration:

Copy
Copied!

            
            oc create -f mco-pidup.yaml

Please verify by checking that the latest rendered-worker machine-config has been rolled out to the pools successfully:

Copy
Copied!

            
            [root@jump-node ~]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-dc4c25c725418932d2678b0a174057b6   True      False      False      3              3                   3                     0                      1d
worker   rendered-worker-0f6b49419faed3fb46a74259d570896f   True      False      False      4              4                   4                     0                      1d

Once all Worker Nodes are rebooted, you may login and confirm the current setting:

Copy
Copied!

            
            [root@jump-node ~]# oc debug node/worker1.ocp4.clx.labs.mlnx
Starting pod/worker1ocp4clxlabsmlnx-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.77.21
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit
[crio]
  [crio.runtime]
    pids_limit = 4096

Installing OpenShift Operators

To run AI applications on the OCP cluster, the following operators are required:

Node Feature Discovery Operator (installed as part of the initial cluster deployment)
Local Storage (installed as part of the initial cluster deployment)
OpenShift Data Foundation (installed as part of the initial cluster deployment)
NVIDIA Network Operator
NVIDIA GPU Operator
SRI-OV Network Operator

Info

For additional information about adding Operators to a cluster, see the Red Hat OpenShift Container Platform documentation.

Installing NVIDIA Network Operator

NVIDIA Network Operator is required to be installed on the OCP cluster in order to enable the compilation and installation of the RDMA GPUDirect module as part of the NVIDIA GPU Operator.

For the NVIDIA Network Operator Installation Guide for OCP Cluster, please refer to this link.

Warning

Cluster-wide entitlement is required to be applied before NVIDIA Network Operator installation. This step is described here.

NVIDIA Network Operator installation and configuration steps via Web Console:

Expand the Operators section in the left menu bar and select the OperatorHub.
Search for "NVIDIA" in the search bar. Two items should appear as results.
Select the operator that is tagged as "NVIDIA Network Operator". This is the version supported by NVIDIA.
In the opened pop-up window, click "Install" .
Following the completed installation, in the left menu bar go to the Operators section and click on "Installed Operators". Then, select the "NVIDIA Network Operator".
On the NVIDIA Network Operator details screen click on "Create instance" in the "NicClusterPolicy" section.
In the "NicClusterPolicy" tab, set the desired values in the "Ofed Driver" section, or leave the default values. In the "RDMA Shared Device Plugin" section please remove all values in the following subsections: "Config", "Image", "Repository" and "Version", then click on the "Create" button.

Info

Applying the "NicClusterPolicy" is dependent on the server platform hardware configuration and may take some time.

To ensure that the NVIDIA Network Operator is deployed properly, please run the following commands:

Copy
Copied!

            
            [root@jump-node ~]# oc -n nvidia-network-operator get pod -o wide
NAME                                                          READY   STATUS    RESTARTS      AGE   IP              NODE                         NOMINATED NODE   READINESS GATES
mofed-rhcos4.10-ds-gwt6w                                      1/1     Running   0             48m   192.168.77.21   worker1.ocp4.clx.labs.mlnx   <none>           <none>
mofed-rhcos4.10-ds-qrn2b                                      1/1     Running   0             48m   192.168.77.24   worker4.ocp4.clx.labs.mlnx   <none>           <none>
mofed-rhcos4.10-ds-tj695                                      1/1     Running   0             48m   192.168.77.23   worker3.ocp4.clx.labs.mlnx   <none>           <none>
mofed-rhcos4.10-ds-z742t                                      1/1     Running   0             48m   192.168.77.22   worker2.ocp4.clx.labs.mlnx   <none>           <none>
nvidia-network-operator-controller-manager-86bdf7bdd5-tc5s5   2/2     Running   7 (41m ago)   72m   10.129.2.35     worker1.ocp4.clx.labs.mlnx   <none>           <none>

Installing the NVIDIA GPU Operator

The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit , automatic node labelling using GFD , DCGM based monitoring and others.

NVIDIA GPU Operator installation and configuration steps via Web Console:

In the OpenShift Container Platform web console from the side menu, navigate to Operators > OperatorHub, and select All Projects.
In Operators > OperatorHub, search for the NVIDIA GPU Operator.
Select the NVIDIA GPU Operator and click Install in the subsequent screen.
Once the installation is completed, in the left menu bar go to the Operators section, and click on "Installed Operators", then select the "NVIDIA GPU Operator".
On the NVIDIA GPU Operator details screen click on "Create instance" in the "ClusterPolicy".
In the "ClusterPolicy" tab, please open the subsection "NVIDIA GPU/VGPU Driver Config" and scroll down. Mark the checkbox "Enabled" in the "RDMA" subsection and click on the "Create" button.

Note

For deep-drive step-by-step installation instructions, please visit GPU Operator on OpenShift.

To ensure that the NVIDIA GPU Operator is deployed properly, please run the following commands:

GPU Operator deployment pod's status

Copy
Copied!

            
            [root@jump-node ~]# oc NAME gpu-feature-discovery-2dzjv gpu-feature-discovery-lk4jz gpu-feature-discovery-ndpbw gpu-feature-discovery-wl5fs gpu-operator-66bd74b4f4-q2c9q nvidia-container-toolkit-daemonset-5vjrk nvidia-container-toolkit-daemonset-jh2xf nvidia-container-toolkit-daemonset-s4pbf nvidia-container-toolkit-daemonset-tdkns nvidia-cuda-validator-fqbxg nvidia-cuda-validator-kdbbh nvidia-cuda-validator-nv7vz nvidia-dcgm-27ghv nvidia-dcgm-2tz55 nvidia-dcgm-exporter-hkzxw nvidia-dcgm-exporter-kk4lp nvidia-dcgm-exporter-nk5j7 nvidia-dcgm-exporter-vgjr6 nvidia-dcgm-nqj7s nvidia-dcgm-tlf2p nvidia-device-plugin-daemonset-7cvps nvidia-device-plugin-daemonset-bjjdv nvidia-device-plugin-daemonset-qndb6 nvidia-device-plugin-daemonset-t7t5t nvidia-device-plugin-validator-gphhw nvidia-device-plugin-validator-mqc82 nvidia-device-plugin-validator-vw92q nvidia-driver-daemonset-410.84. nvidia-driver-daemonset-410.84. nvidia-driver-daemonset-410.84. nvidia-driver-daemonset-410.84. nvidia-mig-manager-4grl2 nvidia-mig-manager-nhbh5 nvidia-mig-manager-v852l nvidia-mig-manager-xkjxf nvidia-node-status-exporter-pmtfw nvidia-node-status-exporter-r7m4j nvidia-node-status-exporter-szzsg nvidia-node-status-exporter-zwgld nvidia-operator-validator-52z9h nvidia-operator-validator-7gnwz nvidia-operator-validator-ndhll nvidia-operator-validator-zvsgc

-n nvidia-gpu-operator get pod -o wide READY   STATUS      RESTARTS      AGE    IP              NODE                         NOMINATED NODE   READINESS GATES 1/1     Running     0             39m    10.130.0.36     worker3.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.130.2.3      worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.131.0.14     worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.129.2.15     worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     3 (42m ago)   62m    10.129.2.48     worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.130.2.9      worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.129.2.12     worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.131.0.2      worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             39m    10.130.0.37     worker3.ocp4.clx.labs.mlnx   <none>           <none> 0/1     Completed   0             37m    10.131.0.36     worker4.ocp4.clx.labs.mlnx   <none>           <none> 0/1     Completed   0             35m    10.130.0.44     worker3.ocp4.clx.labs.mlnx   <none>           <none> 0/1     Completed   0             36m    10.129.2.68     worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             39m    192.168.77.23   worker3.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   192.168.77.22   worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   192.168.77.21   worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   192.168.77.24   worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   192.168.77.22   worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             39m    192.168.77.23   worker3.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   192.168.77.24   worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   192.168.77.21   worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.129.2.16     worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.131.0.10     worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.130.2.14     worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             39m    10.130.0.39     worker3.ocp4.clx.labs.mlnx   <none>           <none> 0/1     Completed   0             36m    10.131.0.39     worker4.ocp4.clx.labs.mlnx   <none>           <none> 0/1     Completed   0             34m    10.130.0.45     worker3.ocp4.clx.labs.mlnx   <none>           <none> 0/1     Completed   0             35m    10.129.2.69     worker1.ocp4.clx.labs.mlnx   <none>           <none> 202208161501-0-r6dwb   3/3     Running     4 (38m ago)   165m   10.131.0.11     worker4.ocp4.clx.labs.mlnx   <none>           <none> 202208161501-0-r9gzw   3/3     Running     3 (36m ago)   165m   10.130.0.11     worker3.ocp4.clx.labs.mlnx   <none>           <none> 202208161501-0-vgwrz   3/3     Running     4 (37m ago)   165m   10.129.2.8      worker1.ocp4.clx.labs.mlnx   <none>           <none> 202208161501-0-zzb58   3/3     Running     2 (38m ago)   165m   10.130.2.12     worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             39m    10.130.0.40     worker3.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.129.2.10     worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.130.2.10     worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             165m   10.131.0.6      worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     1             165m   10.130.2.7      worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     3             165m   10.130.0.3      worker3.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     1             165m   10.131.0.8      worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     2             165m   10.129.2.4      worker1.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             39m    10.130.0.38     worker3.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             41m    10.131.0.24     worker4.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             41m    10.130.2.33     worker2.ocp4.clx.labs.mlnx   <none>           <none> 1/1     Running     0             40m    10.129.2.58     worker1.ocp4.clx.labs.mlnx   <none>           <none>

Installing the SR-IOV Network Operator

The SR-IOV Network Operator is generally responsible for configuring the SR-IOV components in an OpenShift cluster.

SR-IOV Network Operator installation and configuration steps via Web Console and CLI:

In the OpenShift Container Platform web console from the side menu, navigate to Operators > OperatorHub, and select All Projects.
In Operators > OperatorHub, search for the NVIDIA GPU Operator.
Select the SR-IOV Network Operator and click Install on the subsequent screen.
To configure the InfiniBand network, the following components should be created: "SR-IOV Network Node Policies" and "SriovIBNetworks". Both components were created via YAML configuration files and applied vis CLI.

To configure "SR-IOV Network Node Policies", use - policy.yaml.

Multi policy configuration YAML file - policy.yaml

Copy
Copied!

            
            apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib0
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib0
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib0
  isRdma: true
  linkType: ib
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib1
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib1
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib1
  isRdma: true
  linkType: ib
---  
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib2
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib2
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib2
  isRdma: true
  linkType: ib
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib3
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib3
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib3
  isRdma: true
  linkType: ib
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib4
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib4
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib4
  isRdma: true
  linkType: ib
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib5
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib5
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib5
  isRdma: true
  linkType: ib
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib6
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib6
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib6
  isRdma: true
  linkType: ib
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-ib7
  namespace: openshift-sriov-network-operator
spec:
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  nicSelector:
    pfNames:
      - ib7
  deviceType: netdevice
  numVfs: 8
  priority: 99
  resourceName: ib7
  isRdma: true
  linkType: ib

Deploy policy.yaml:
Copy

Copied!
```
            
            oc apply -f policy.yaml
        
```
Warning

This step may take a while to complete, depending on the amount of Worker Nodes used to apply the configuration, and the number of VFs for each selected network interface.

To configure "SR-IOV Network Node Policies", ib-net.yaml has been used.

Multi SriovIBNetwork configuration YAML file - ib-net.yaml

Copy
Copied!

            
            apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib0
  namespace: openshift-sriov-network-operator
spec:
  resourceName: "ib0"
  networkNamespace: default
  linkState: enable
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.0.0/24"
    }
 
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: "ib1"
  networkNamespace: default
  linkState: enable
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.1.0/24"
    }
 
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib2
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ib2
  networkNamespace: default
  linkState: enable
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.2.0/24"
    }
 
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib3
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ib3
  networkNamespace: default
  linkState: auto
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.3.0/24"
    }
 
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib4
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ib4
  networkNamespace: default
  linkState: auto
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.4.0/24"
    }
 
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib5
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ib5
  networkNamespace: default
  linkState: auto
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.5.0/24"
    }
 
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib6
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ib6
  networkNamespace: default
  linkState: auto
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.6.0/24"
    }
 
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: net-ib7
  namespace: openshift-sriov-network-operator
spec:
  resourceName: ib7
  networkNamespace: default
  linkState: auto
  ipam: |-
    {
        "type": "whereabouts",
        "range": "192.168.7.0/24"
    }

Deploy ib-net.yaml:

Copy
Copied!

            
            oc apply -f ib-net.yaml

Validating the Operators Deployment

Check the deployed InfiniBand networks:

Copy
Copied!

            
            oc get network-attachment-definitions.k8s.cni.cncf.io
NAME      AGE
net-ib0   1d
net-ib1   1d
net-ib2   1d
net-ib3   1d
net-ib4   1d
net-ib5   1d
net-ib6   1d
net-ib7   1d

Check the Worker Node resources:

Copy
Copied!

            
            oc get node worker1.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable'
{
  "cpu": "255500m",
  "ephemeral-storage": "1727851483143",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "1054987908Ki",
  "nvidia.com/gpu": "8",
  "openshift.io/ib0": "8",
  "openshift.io/ib1": "8",
  "openshift.io/ib2": "8",
  "openshift.io/ib3": "8",
  "openshift.io/ib4": "8",
  "openshift.io/ib5": "8",
  "openshift.io/ib6": "8",
  "openshift.io/ib7": "8",
  "pods": "250"
}
 
oc get node worker2.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable'
{
  "cpu": "255500m",
  "ephemeral-storage": "1727851483143",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "1054987908Ki",
  "nvidia.com/gpu": "8",
  "openshift.io/ib0": "8",
  "openshift.io/ib1": "8",
  "openshift.io/ib2": "8",
  "openshift.io/ib3": "8",
  "openshift.io/ib4": "8",
  "openshift.io/ib5": "8",
  "openshift.io/ib6": "8",
  "openshift.io/ib7": "8",
  "pods": "250"
}
 
oc get node worker3.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable'
{
  "cpu": "255500m",
  "ephemeral-storage": "1727851483143",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "1054987908Ki",
  "nvidia.com/gpu": "8",
  "openshift.io/ib0": "8",
  "openshift.io/ib1": "8",
  "openshift.io/ib2": "8",
  "openshift.io/ib3": "8",
  "openshift.io/ib4": "8",
  "openshift.io/ib5": "8",
  "openshift.io/ib6": "8",
  "openshift.io/ib7": "8",
  "pods": "250"
}
 
oc get node worker4.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable'
{
  "cpu": "255500m",
  "ephemeral-storage": "1727851483143",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "1054987908Ki",
  "nvidia.com/gpu": "8",
  "openshift.io/ib0": "8",
  "openshift.io/ib1": "8",
  "openshift.io/ib2": "8",
  "openshift.io/ib3": "8",
  "openshift.io/ib4": "8",
  "openshift.io/ib5": "8",
  "openshift.io/ib6": "8",
  "openshift.io/ib7": "8",
  "pods": "250"
}

Synthetic RDMA Benchmark

Run synthetic RDMA benchmark tests with ib_write_bw between two pods that are running on different Worker Nodes.

This step includes the following:

Create a container image and push it to your repository
Deploy test deployment apps
Run test

Create a container image from Dockerfile:

Copy
Copied!

            
            FROM ubuntu:20.04
# Ubuntu 20.04 docker container with inbox Mellanox drivers

# LABEL about the custom image
LABEL maintainer=vitaliyra@nvidia.com
LABEL description="This is custom Container Image with inbox perftest package."

WORKDIR /tmp/
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get clean -y && apt-get -y update && apt-get install -y apt-utils udev vim bash && apt-get -y upgrade
RUN apt-get install -y iproute2 rdma-core libibmad5 ibutils ibverbs-utils infiniband-diags perftest \
            mstflint strace iputils-ping
RUN ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime
RUN dpkg-reconfigure --frontend noninteractive tzdata && apt-get clean all -y
CMD bash

Info

Please use your favorite container building tools (docker, podman, etc.) to create a container image from Dockerfile for use in the below deployment.

After creating the image, push it to the container registry.

Create a sample deployment test-deployment.yaml (the container image should include InfiniBand userspace drivers and performance tools):

test-deployment.yaml

Copy
Copied!

            
            apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlnx-inbox-pod
  labels:
    app: sriov
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sriov
  template:
    metadata:
      labels:
        app: sriov
      annotations:
        k8s.v1.cni.cncf.io/networks: net-ib0
    spec:
      containers:
      - image: < Container image >
        name: mlnx-inbox-ctr
        securityContext:
          capabilities:
            add: [ "IPC_LOCK" ]
        resources:
          requests:
            openshift.io/ib0: 1
          limits:
            openshift.io/ib0: 1
        command:
        - sh
        - -c
        - sleep inf

Deploy the sample deployment.

Copy
Copied!

            
            oc apply -f test-deployment.yaml
deployment.apps/mlnx-inbox-pod created  
 
oc get pod -o wide
NAME                              READY   STATUS      RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
mlnx-inbox-pod-6948fd6d54-9s66q   1/1     Running     0          36s   10.130.2.231   worker2.ocp4.clx.labs.mlnx   <none>           <none>
mlnx-inbox-pod-6948fd6d54-t4hpx   1/1     Running     0          37s   10.131.1.33    worker4.ocp4.clx.labs.mlnx   <none>           <none>

Check available network interfaces in each POD.

Copy
Copied!

            
            ## First POD 
 
oc exec -it mlnx-inbox-pod-6948fd6d54-9s66q -- bash
root@mlnx-inbox-pod-6948fd6d54-9s66q:/tmp# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
3: eth0@if645: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether 0a:58:0a:82:02:e7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.130.2.231/23 brd 10.130.3.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::c4e:5fff:feb1:c036/64 scope link 
       valid_lft forever preferred_lft forever
430: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband 00:00:00:88:fe:80:00:00:00:00:00:00:39:71:61:ca:7b:e2:c4:3b brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 192.168.0.2/24 brd 192.168.0.255 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::3b71:61ca:7be2:c43b/64 scope link 
       valid_lft forever preferred_lft forever
root@mlnx-inbox-pod-6948fd6d54-9s66q:/tmp# ibv_devinfo 
hca_id:	mlx5_29
	transport:			InfiniBand (0)
	fw_ver:				20.31.2006
	node_guid:			3971:61ca:7be2:c43b
	sys_image_guid:			043f:7203:009c:6800
	vendor_id:			0x02c9
	vendor_part_id:			4124
	hw_ver:				0x0
	board_id:			MT_0000000223
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		376
			port_lmc:		0x00
			link_layer:		InfiniBand
 
 
 
## Second POD 
 
oc exec -it mlnx-inbox-pod-6948fd6d54-t4hpx -- bash
root@mlnx-inbox-pod-6948fd6d54-t4hpx:/tmp# ibv_devinfo 
hca_id:	mlx5_11
	transport:			InfiniBand (0)
	fw_ver:				20.31.2006
	node_guid:			19aa:44e1:73a8:70c1
	sys_image_guid:			043f:7203:00c0:017e
	vendor_id:			0x02c9
	vendor_part_id:			4124
	hw_ver:				0x0
	board_id:			MT_0000000223
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			1
			port_lid:		474
			port_lmc:		0x00
			link_layer:		InfiniBand

Run synthetic RDMA benchmark tests.

Server	ib_write_bw -F -d $IB_DEV_NAME --report_gbits
Client	ib_write_bw -D 20 -F $SERVER_IP -d $IB_DEV_NAME --report_gbits

Please console sessions to each POD - one for the server apps side, and the second for the client apps side.

On the first console (on the server side), run the following commands:

Copy
Copied!

            
            oc exec -it mlnx-inbox-pod-6948fd6d54-9s66q -- bash
root@mlnx-inbox-pod-6948fd6d54-9s66q:/tmp# ib_write_bw -d mlx5_29  -F --report_gbits
 
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_29
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x178 QPN 0x007d PSN 0x1347c5 RKey 0x01053c VAddr 0x007f287d861000
 remote address: LID 0x1da QPN 0x00bd PSN 0x1347c5 RKey 0x02053c VAddr 0x007fd8c7404000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      3728317          0.00               195.47 		   0.372828
---------------------------------------------------------------------------------------

On the second console (on the client side ) , run the following commands:

Copy
Copied!

            
            oc exec -it mlnx-inbox-pod-6948fd6d54-t4hpx -- bash
root@mlnx-inbox-pod-6948fd6d54-t4hpx:/tmp# ib_write_bw -d mlx5_11  -F 10.130.2.231 --report_gbits -D 20
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_11
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x1da QPN 0x00bd PSN 0x1347c5 RKey 0x02053c VAddr 0x007fd8c7404000
 remote address: LID 0x178 QPN 0x007d PSN 0x1347c5 RKey 0x01053c VAddr 0x007f287d861000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      3728317          0.00               195.47 		   0.372828
---------------------------------------------------------------------------------------

For the Synthetic RDMA Benchmark with ib_write_bw we got 195Gbps which is the expected line rate for IB HDR.

Kubeflow Training Operator

Kubeflow is a machine learning toolkit for Kubernetes.

Kubeflow training operators are part of Kubeflow, and a group of Kubernetes operators that add support to Kubeflow for distributed training of Machine Learning models using different frameworks.

The training operator provides Kubernetes CR that makes it easier to run distributed or non-distributed TensorFlow/PyTorch/Apache MXNet/XGBoost/MPI jobs on Kubernetes.

In the example below we deploy the latest stable release of the Kubeflow training operators:

Copy
Copied!

            
            kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
namespace/kubeflow created
customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/mxjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created
customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org created
serviceaccount/training-operator created
clusterrole.rbac.authorization.k8s.io/training-operator created
clusterrolebinding.rbac.authorization.k8s.io/training-operator created
service/training-operator created
deployment.apps/training-operator created

Info

Configuration of the namespace, to allow the default service account to run pods as a root:

# oc new-project $MY_PROJECT
# oc adm policy add-scc-to-user privileged -z default # (from $MY_PROJECT namespace)
# oc adm policy add-scc-to-user anyuid -z  default  # (from $MY_PROJECT namespace)

Appendix

Job Testing Results

Below are the Dockerfile and MPIJob examples with different network configurations.

Dockerfile

Dockerfile example for using MPIJob:

Copy
Copied!

            
            FROM nvcr.io/nvidia/tensorflow:22.08-tf2-py3
RUN apt-get update && apt-get install -y --no-install-recommends openssh-client openssh-server && \
    mkdir -p /var/run/sshd
 
# Allow OpenSSH to talk to containers without asking for confirmation
# by disabling StrictHostKeyChecking.
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
# to disable UserKnownHostsFile to avoid write permissions.
# Disabling StrictModes avoids directory and files read permission checks.
 
RUN sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \
    echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
 
RUN mkdir /tensorflow
WORKDIR "/tensorflow"
RUN git clone https://github.com/tensorflow/benchmarks
WORKDIR "/tensorflow/benchmarks"
 
CMD ["/bin/bash"]

This Dockerfile is based on the TensorFlow NGC Container image. The TensorFlow NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance. This container may also contain modifications to the TensorFlow source code in order to maximize performance and compatibility. It also contains software for accelerating ETL (DALI , RAPIDS ), training ( cuDNN , NCCL ) and inference ( TensorRT ) workloads.

For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation.

Info

Please use your favorite container building tools (docker, podman, etc.) to create a container image from Dockerfile for use in the below deployment.

After creating the image, push it to the container registry.

MPIJob Examples

The below is an MPIJob example with network configuration over InfiniBand. It is based on OCP secondary network without GPUDirect options:

MPIJOB with disabled GDR

Copy
Copied!

            
            apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-benchmarks
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: < container image >
            name: tensorflow-benchmarks
            command:
              - mpirun
              - --allow-run-as-root
              - -np
              - "32"
              - -bind-to
              - none
              - -map-by
              - slot
              - -x
              - NCCL_DEBUG=INFO
              - -x
              - NCCL_IB_DISABLE=0
              - -x
              - NCCL_NET_GDR_LEVEL=0
              - -x
              - TF_ALLOW_IOLIBS=1
              - -x
              - LD_LIBRARY_PATH
              - -x
              - PATH
              - -mca
              - pml
              - ob1
              - -mca
              - btl
              - ^openib
              - -mca
              - btl_tcp_if_include
              - eth0
              - python
              - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              - --batch_size=64
              - --model=resnet152
              - --variable_update=horovod
              - --use_fp16=true
    Worker:
      replicas: 4
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: net-ib0,net-ib1,net-ib2,net-ib3,net-ib4,net-ib5,net-ib6,net-ib7
        spec:
          containers:
          - image: < container image>
            name: tensorflow-benchmarks
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                nvidia.com/gpu: 8
                openshift.io/ib0: 1
                openshift.io/ib1: 1
                openshift.io/ib2: 1
                openshift.io/ib3: 1
                openshift.io/ib4: 1
                openshift.io/ib5: 1
                openshift.io/ib6: 1
                openshift.io/ib7: 1

The below is an MPIJob example with network configuration over InfiniBand. It is based on OCP secondary network with enabled GPUDirect options:

MPIJOB with enabled GDR

Copy
Copied!

            
            apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-benchmarks
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: < container image >
            name: tensorflow-benchmarks
            command:
              - mpirun
              - --allow-run-as-root
              - -np
              - "32"
              - -bind-to
              - none
              - -map-by
              - slot
              - -x
              - NCCL_DEBUG=INFO
              - -x
              - NCCL_IB_DISABLE=0
              - -x
              - NCCL_NET_GDR_LEVEL=2
              - -x
              - TF_ALLOW_IOLIBS=1
              - -x
              - LD_LIBRARY_PATH
              - -x
              - PATH
              - -mca
              - pml
              - ob1
              - -mca
              - btl
              - ^openib
              - -mca
              - btl_tcp_if_include
              - eth0
              - python
              - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              - --batch_size=64
              - --model=resnet152
              - --variable_update=horovod
              - --use_fp16=true
    Worker:
      replicas: 4
      template:
        metadata:
          annotations:
            k8s.v1.cni.cncf.io/networks: net-ib0,net-ib1,net-ib2,net-ib3,net-ib4,net-ib5,net-ib6,net-ib7
        spec:
          containers:
          - image: < container image>
            name: tensorflow-benchmarks
            securityContext:
              capabilities:
                add: [ "IPC_LOCK" ]
            resources:
              limits:
                nvidia.com/gpu: 8
                openshift.io/ib0: 1
                openshift.io/ib1: 1
                openshift.io/ib2: 1
                openshift.io/ib3: 1
                openshift.io/ib4: 1
                openshift.io/ib5: 1
                openshift.io/ib6: 1
                openshift.io/ib7: 1

Test Results

Summary

From the tests above, it can be seen that using GPUDirect resulted in a ~17% usage advantage in our setup.

The number of processed images depends on the chosen model and batch size in the TF benchmark .

Note

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Authors

	Vitaliy Razinkov Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Related Documents

Quantum Infiniband Switches GPU RHEL and CoreOS GPUDirect RDMA Kubernetes Bare Metal GPU Operator Network Operator ConnectX InfiniBand and RDMA Red Hat SR-IOV

Last updated on Sep 12, 2023.

On This Page

IB switch configuration

Switch config example

GPU Operator deployment pod's status

Multi policy configuration YAML file - policy.yaml

Multi SriovIBNetwork configuration YAML file - ib-net.yaml

test-deployment.yaml

MPIJOB with disabled GDR

MPIJOB with enabled GDR