RDG for Accelerating AI Workloads in Red Hat OCP with NVIDIA DGX A100 Servers and NVIDIA InfiniBand Fabric

RDG for Accelerating AI Workloads in Red Hat OCP with NVIDIA DGX A100 Servers and NVIDIA InfiniBand Fabric

Created Oct 24, 2022.

Scope

The following Reference Deployment Guide (RDG) demonstrates the deployment process of a Red Hat OpenShift Container Platform (RH OCP v4.10.x) over NVIDA DGX A100 servers and NVIDIA HDR InfiniBand fabric for accelerated AI-applications.

Abbreviations and Acronyms

Term

Definition

Term

Definition

AI

Artificial Intelligence

ML

Machine Learning

CNI

Container Network Interface

NFD

Node Feature Discovery

CR

Custom Resources

NCCL

NVIDIA Collective Communication Library

CRD

Custom Resources Definition

OCI

Open Container Initiative

CRI

Container Runtime Interface

OCP

OpenShift Container Platform

DHCP

Dynamic Host Configuration Protocol

PF

Physical Function

DNS

Domain Name System

QSG

Quick Start Guide

DL

Deep Learning

RDG

Reference Deployment Guide

DP

Device Plugin

RDMA

Remote Direct Memory Access

HDR

InfiniBand High Data Rate (200Gb/s)

SR-IOV

Single Root Input Output Virtualization

IPAM

IP Address Management

TF

TensorFlow

K8s

Kubernetes

VF

Virtual Function

Introduction

Preparing a Red Hat OpenShift Container Platform (OCP) infrastructure to run AI workloads efficiently is challenging.

This document provides a complete reference deployment guide for such a system including technology overview, design, component selection, deployment steps and AI workload examples. The solution will be deployed on NVIDIA DGX A100 servers for OCP worker nodes and on X86 standard servers for OCP control plane nodes. The NVIDIA end-to-end HDR (200Gb/s) InfiniBand fabric is used to handle the workload networking needs, while a 100Gb/s Ethernet network is used as a Deployment/Management network.

In this guide, we use the OpenShift operators, the NVIDIA GPU Operator and the NVIDIA Network Operator, which are responsible for deploying and configuring GPU and Network components in the OCP cluster. These components accelerate AI tasks using CUDA, RDMA and GPUDirect technologies.

A Greenfield deployment is assumed for this guide.

References

Solution Architecture

Key Components and Technologies

  • NVIDIA DGX A100

    NVIDIA DGX™ A100 is the universal system for all AI workloads, offering unprecedented compute density, performance, and flexibility in the world’s first 5 petaFLOPS AI system. NVIDIA DGX A100 features the world’s most advanced accelerator, the NVIDIA A100 Tensor Core GPU, enabling enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure that includes direct access to NVIDIA AI experts.

  • NVIDIA InfiniBand Adapters

    NVIDIA ConnectX InfiniBand adapters provide ultra-low latency, extreme throughput, and innovative NVIDIA In-Network Computing engines to deliver the acceleration, scalability, and feature-rich technology needed for today's modern workloads.

  • NVIDIA LinkX Cables

    The NVIDIA® LinkX® product family of cables and transceivers provides the industry’s most complete line of 10, 25, 40, 50, 100, 200, and 400GbE in Ethernet and 100, 200 and 400Gb/s InfiniBand products for Cloud, HPC, hyperscale, Enterprise, telco, storage and artificial intelligence, data center applications.

  • NVIDIA InfiniBand Switches

    NVIDIA Quantum InfiniBand switch systems deliver the highest performance and port density available. Innovative capabilities such as NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ and advanced management features such as self-healing network capabilities, quality of service, enhanced virtual lane mapping, and NVIDIA In-Network Computing acceleration engines provide a performance boost for industrial, AI, and scientific applications.

  • OpenShift

    Red Hat OpenShift Container Platform is an enterprise-ready Kubernetes container platform with full-stack automated operations to manage hybrid cloud and multi-cloud deployments. It helps you deliver applications faster and makes developers more productive. Automate life-cycle management to get increased security, tailored operations solutions, easy-to-manage cluster operations, and application portability.

  • NVIDIA GPU Operator

    The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM-based monitoring, and more.

  • NVIDIA CUDA

    CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers can dramatically speed up computing applications by harnessing the power of GPUs. In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute-intensive portion of the application runs on thousands of GPU cores in parallel.

  • NVIDIA Network Operator

    An analog to the NVIDIA GPU Operator, the NVIDIA Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. Paired with the NVIDIA GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The NVIDIA Network Operator uses Kubernetes CRD and the Operator Framework to provision the host software needed for enabling accelerated networking.

  • RDMA

    RDMA is a technology that allows computers in a network to exchange data without involving the processor, cache or operating system of either computer.

    Like locally based DMA, RDMA improves throughput and performance and frees up compute resources.

  • GPUDirect RDMA

    GPUDirect (GDR) RDMA provides a direct P2P (Peer-to-Peer) data path between the GPU memory directly to and from NVIDIA HCA devices. This reduces GPU-to-GPU communication latency and completely offloads the CPU, removing it from all GPU-to-GPU communications across the network.

gpu.png

Logical Design

The logical design includes the following parts:

  • Jump node used for initial deployment and debugging

  • OCP control plane nodes

  • NVIDIA DGX A100 servers - OCP worker nodes

  • NVIDIA Quantum InfiniBand fabric

  • UFM Enterprise Node

  • Deployment and OCP management networks

    21.png

OpenShift Container Platform Networking

In this guide, an OCP cluster is deployed in a Non-Air Gap environment, and Internet access is required.

The OCP cluster is deployed on a dedicated Deployment/Management network (CIDR 192.168.77.0/24) that is part of the IT infrastructure which also includes DNS/DHCP services.

The installation and configuration procedures for these components are not covered in this guide.

network.png

Network/Fabric Diagram

In this RDG we will describe a small-scale solution with only one InfiniBand switch.

All OCP cluster Nodes are connected to the MGMT switch by a single 100GbE cable.

A ll InfiniBand ports of the DGX A100 server are connected to a single NVIDIA Quantum HDR 200Gb/s InfiniBand Smart Edge Switch with NVIDIA LinkX HDR 200Gb/s QSFP56 DAC cables.

In addition, we used a UFM Enterprise Node with a similar connectivity.

All server remote management ports and switch management ports are connected to a 1GbE switch.

The setup diagram is presented in the picture below.

22.png

For assistance in designing the scaled InfiniBand topology, use the InfiniBand Topology Generator, an online cluster configuration tool that offers flexible cluster configurations and sizes.

Software Stack Components

In this guide, the following software components have been used to deploy the system:

  • Red Hat OCP 4.10.30

  • AlmaLinux v8.5 for Jump-Node. Installation mode "Server with GUI"

  • Ubuntu server 18.04 for UFM Enterprise node

  • NVIDIA GPU Operator v1.11.1

  • NVIDIA Network Operator v1.3.0

  • Red Hat OpenShift Data Foundation

  • Red Hat SR-IOV Network Operator

Bill of Materials

The following hardware setup is utilized in this guide to build an OCP cluster with four Worker nodes.

bom.png

Deployment and Configuration

Network and Fabric Configuration for an OCP Cluster

Below are the server names with their relevant network configurations.

Server/Switch Type

Server/Switch Name

IP and NICS

InfiniBand Network

Management Network

Jump node

ocp4-jump

N/A

eth0: DHCP

192.168.77.201

OCP Master node1

control1

N/A

eth0: DHCP

192.168.77.11

OCP Master node2

control2

N/A

eth0: DHCP

192.168.77.12

OCP Master node3

control3

N/A

eth0: DHCP

192.168.77.13

OCP Worker node1

worker1

ib0: no IP set ib4: no IP set

ib1: no IP set ib5: no IP set

ib2: no IP set ib6: no IP set

ib3: no IP set ib7: no IP set

enp225s0f0: DHCP

192.168.77.21

OCP Worker node2

worker2

ib0: no IP set ib4: no IP set

ib1: no IP set ib5: no IP set

ib2: no IP set ib6: no IP set

ib3: no IP set ib7: no IP set

enp225s0f0: DHCP

192.168.77.22

OCP Worker node3

worker3

ib0: no IP set ib4: no IP set

ib1: no IP set ib5: no IP set

ib2: no IP set ib6: no IP set

ib3: no IP set ib7: no IP set

enp225s0f0: DHCP

192.168.77.23

OCP Worker node4

worker4

ib0: no IP set ib4: no IP set

ib1: no IP set ib5: no IP set

ib2: no IP set ib6: no IP set

ib3: no IP set ib7: no IP set

enp225s0f0: DHCP

192.168.77.24

InfiniBand switch

ib-sw01

N/A

mgmt0: DHCP

192.168.77.222

UFM

ufm

ib0: no IP set

eth0: DHCP

192.168.77.223

Wiring

On each OCP Worker Node, all the networking ports of each NVIDIA Network Adapter are wired to the Ethernet (SN2700) and InfiniBand (QM8700) NVIDIA switch using NVIDIA LinkX DAC cables.

The below figure illustrates the required wiring for building an OCP cluster with four Worker nodes and a UFM Enterprise Node.

wiring.png

InfiniBand Fabric Configuration

Overview

Below is a list of recommendations and prerequisites that are important for the configuration process:

  • Refer to the NVIDIA MLNX-OS User Manual to become familiar with the switch software (located at enterprise-support.nvidia.com/s/)

  • Upgrade the switch software to the latest NVIDIA MLNX-OS version

  • An InfiniBand Subnet Manager (SM) is required to configure the InfiniBand fabric properly

There are three ways to run an InfiniBand SM in the InfiniBand fabric:

  1. Start the SM on one or more managed switches. This is a very convenient and quick operation which allows for easier InfiniBand ‘plug & play'.

  2. Run an OpenSM daemon on one or more servers by executing the /etc/init.d/opensmd command.

  3. Use a Unified Fabric Manager (UFM®).

    UFM is a powerful platform for scale-out computing, which eliminates the complexity of fabric management, provides deep visibility into traffic and optimizes fabric performance.

Below are the configuration steps for method #1 and method #3.

This guide provides instructions for launching the InfiniBand SM with a Unified Fabric Manager (Method #3).

Enable the SM on the Managed Switch

  1. Login to the switch and enter the next configuration commands (swx-mld-ib67 is our switch name):

    IB switch configuration

    Copy
    Copied!
                

    NVIDIA MLNX-OS Switch Management   switch login: admin Password:   ib-sw01 [standalone: master] > enable ib-sw01 [standalone: master] # configure terminal ib-sw01 [standalone: master] (config) # ib smnode ib-sw01 enable ib-sw01 [standalone: master] (config) # ib smnode ib-sw01 sm-priority 0   ib-sw01 [standalone: master] (config) # ib sm virt enable ib-sw01 [standalone: master] (config) # write memory ib-sw01 [standalone: master] (config) # reload

  2. Once the switch reboots, check the switch configuration. It should look like the following:

    Switch config example

    Copy
    Copied!
                

    NVIDIA MLNX-OS Switch Management   switch login: admin Password:   ib-sw01 [standalone: master] > enable ib-sw01 [standalone: master] # configure terminal ib-sw01 [standalone: master] (config) # show running-config ## ## Running database "initial" ## Generated at 2022/11/16 17:40:41 +0000 ## Hostname: ib-sw01 ## Product release: 3.9.1600 ##   ## ## Running-config temporary prefix mode setting ## no cli default prefix-modes enable   ## ## Subnet Manager configuration ## ib sm virt enable   ## ## Other IP configuration ## hostname ib-sw01   ## ## Other IPv6 configuration ## no ipv6 enable   ## ## Local user account configuration ## username admin password 7 $6$6GZ8Q0RF$FZW9pc23JJkwwOJTq85xZe1BJgqQV/m6APQNPkagZlTEUgKMWLr5X3Jq2hsUyB.K5nrGdDNUaSLiK2xupnIJo1 username monitor password 7 $6$z1.r4Kl7$TIwaNf7uXNxZ9UdGdUpOO9kVug0shRqGtu75s3dSrY/wY1v1mGjrqQLNPHvHYh5HAhVuUz5wKzD6H/beYeEqL.   ## ## AAA remote server configuration ## # ldap bind-password ******** # radius-server key ******** # tacacs-server key ********   ## ## Network management configuration ## # web proxy auth basic password ********   ## ## X.509 certificates configuration ## # # Certificate name system-self-signed, ID 12d0989d8623825b71bc25f9bc02de813fc9fe2a # (public-cert config omitted since private-key config is hidden)     ## ## IB nodename to GUID mapping ## ib smnode ib-sw01 create ib smnode ib-sw01 enable ib smnode ib-sw01 sm-priority 0 ## ## Persistent prefix mode setting ## cli default prefix-modes enable

UFM Deployment and Configuration

NVIDIA® UFM® is a host-based solution, providing all management functionality required for managing InfiniBand fabrics.

In this guide, a UFM will be deployed as a Stand Alone UFM Enterprise application container. High Availability deployment is possible, yet not covered.

Info
  • For the UFM Enterprise User Manual, refer to this link.

  • For the UFM Enterprise Docker Container Installation Guide, refer to this link.

  • Using the NVIDIA UFM Enterprise Software requires a license. To download the UFM software and license, please visit NVIDIA's Licensing Portal.

  • If you do not have a valid license, please fill out the NVIDIA Enterprise Account Registration form to get a UFM evaluation license.

UFM Node OS

  1. Install the OS on the UFM Node (in this solution we have used Ubuntu 18.04 OS).

  2. Install the NVIDIA MLNX_OFED network drivers. For further information refer to this link.

  3. Install and enable Docker service—Ubuntu Docker Installation.

  4. Use the "ibstat" command to make sure that the UFM Node is connected to the InfiniBand Fabric, and the link is up.

  5. Make sure that the UFM Node is connected to the OpenShift Management network and allocate an IP Address. In our example we have assigned IP 192.168.77.223 to this node.

  6. Set a dummy IP address on the InfiniBand ib0 interface, and make sure it is in the "up" state. This step is a prerequisite for UFM application installation.

    Info

    ib0 is the default fabric interface used by the UFM installer. If you have connected ib1 to the InfiniBand fabric, make sure to specify the interface during UFM installer execution.

  7. Make sure that access to the Internet is available, as it will be used to pull the UFM application container. It is also possible to use local images without Internet connectivity.

UFM Stand-alone installation

  1. Create a directory on the host to mount and sync the UFM Enterprise files with read/write permissions. For example: / opt/ufm/files/.

  2. Copy only your UFM license file(s) to a temporary directory which we are going to use in the installation command. For example: /tmp/license_file/

  3. Run the UFM installation command according to the following example, which will also configure UFM fabric interface to be ib0:

    Copy
    Copied!
                

    docker run -it --name=ufm_installer --rm \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /etc/systemd/system/:/etc/systemd_files/ \ -v /opt/ufm/files/:/installation/ufm_files/ \ -v /tmp/license_file/:/installation/ufm_licenses/ \ mellanox/ufm-enterprise:latest \ --install \ --fabric-interface ib0

    The values below can be updated in the command per your needs:

    • /opt/ufm/files/

    • /tmp/license_file/

    • For example, if you want UFM files to be mounted in another location on your server, create that directory and replace the path in the command.

  4. Reload the system:

    Copy
    Copied!
                

    systemctl daemon-reload

  5. Configure general settings in the /opt/ufm/files/conf/gv.cfg file to enable Virtualization.

    This allows supporting virtual ports in UFM.

    Copy
    Copied!
                

    [Virtualization] # By enabling this flag, UFM will discover all the virtual ports assigned for all hypervisors in the fabric enable = true # Interval for checking whether any virtual ports were changed in the fabric interval = 60

    Warning

    This step is required and cannot be skipped!

  6. To start the UFM Enterprise service, run:

    Copy
    Copied!
                

    systemctl start ufm-enterprise

  7. Connect from any web browser to the UFM WebUI. Use the following URL - https://192.168.77.223/ufm/

    Info

    Default Login Credentials: admin/123456

    ufm-gui.png

OCP Cluster Installation

General Prerequisites

DHCP and DNS services are required for OCP installation.

The OCP cluster installation requires two VIP addresses:

  • The API address is used to access the cluster API.

  • The Ingress address is used for cluster ingress traffic.

These IP addresses must be provided to the installation program when installing the OCP Container Platform cluster.

In our environment, we use the following IPs and DNS records:

Component

IP

Record

Description

Kubernetes API

192.168.77.211

api.ocp4.clx.labs.mlnx

A DNS A/AAAA or CNAME record and a DNS PTR record, to identify the API load balancer. These records must be resolvable by both clients external to the cluster and from all the nodes within the cluster.

Ingress

192.168.77.212

*.apps.ocp4.clx.labs.mlnx

A wildcard DNS A/AAAA or CNAME record that refers to the application ingress load balancer. The application ingress load balancer targets the machines that run the Ingress Controller PODs. The Ingress Controller PODs run on the compute machines by default. These records must be resolvable by both clients external to the cluster and from all the nodes within the cluster.

For example, console-openshift-console.apps.ocp4.clx.labs.mlnx is used as a wildcard route to the OCP console.

OpenShift Container Platform nodes in the cluster should have access to an NTP server. OpenShift Container Platform nodes use NTP to synchronize their clocks.

NTP synchronization servers can be configured as a specific option of the DHCP service.

Note

DNS/DHCP network services are part of the IT infrastructure. The components installation procedure and configuration are not covered in this guide. For more information, see Networking Requirements for User-provisioned Infrastructure.

Jump Node Prerequisites

In this guide, a Jump Node is used for CLI and GUI access to an OCP cluster.

A standard X86 server has been used for the Jump Node. AlmaLinux OS 8.6 is installed on the server in a "Server with GUI" mode.

Generating an SSH Public Key for Discovery ISO

"Discovery ISO" is part of the OCP Assisted installed procedure. Instructions for the generation of "Discovery ISO" are provided below.

To generate a key, please run the following command on the Jump Node:

Copy
Copied!
            

[root@jump-node ~]# ssh-keygen

For all prompts, please push "ENTER" to use default values. Your public key has been saved in /root/.ssh/id_rsa.pub.

OCP Cluster Deployment with Assisted Installer

Installer-provisioned installation on bare metal nodes deploys and configures the infrastructure that an OpenShift Container Platform cluster runs on.

This guide provides a methodology to achieve a successful installer-provisioned bare-metal installation.

Follow the steps outlined in the section to install an OCP cluster.

  1. Open a web browser on the Jump Node (in our case, AlmaLinux with UI is used), connect to the Red Hat Hybrid Cloud site and log into the console with your credentials.

    ocp1.PNG

  2. Using the left menu bar, select OpenShift :

    ocp3.PNG

  3. Click on Create Cluster .

    ocp4.PNG

  4. Select Datacenter , and click on "Bare Metal (x86_64)" in "Infrastructure provider".

    31.jpg

  5. Select the Assisted Installer installation type.

    32.jpg

  6. In the Cluster details open window, provide a Cluster Name and a Base Domain. From the drop list, select the OpenShift version that you wish to install, and click on Next. Example:

    34.jpg

  7. Add an OpenShift Data Foundation operator to use DGX server local storage.

    35.jpg

  8. In the Host discovery window, click on Add hosts.

    36.jpg

  9. In the Add hosts window, select the Full Image file option, paste your ssh public key in the "SSH public key" window (the key was generated in the Jump Node - /root/.ssh/id_rsa.pub ) and click on Generate Discovery ISO.

    37.jpg

  10. In the opened window, click on Download Discovery ISO.

    disco.PNG

  11. Boot all servers in your cluster from the downloaded Discovery.ISO.

  12. Go back to the RedHat Hybrid Cloud Console web page. After a few minutes, you will be able to see all your hosts in "Ready" status.

    Review the CPU, Memory and Disk hosts configuration.

    Select a Role for each Host. Example:

    2.png

  13. In each Worker Node, please choose "Installation disk" and click Next. Example:

    3.png

  14. In the Networking opened window, provide the Machine network, API IP and Ingress IP. Click Next. Example:

    4.png

  15. Review the configuration and click on Install cluster.

    5.png

  16. The OCP Installation process will start and will take a while.

    6.png

  17. Once the installation process is complete, you will be able to Download the kubeconfig file and Web Console URL, Username and Password. Example:

    10_-_small.png

  18. In our case, we copied the kubeconfig file on your Jump Node to the /root/.kube/config file.

  19. To ensure that the OCP cluster is installed properly, verify it via a CLI or a Web Console.

    Via CLI:

    Copy
    Copied!
                

    [root@jump-node ~]# oc get nodes NAME STATUS ROLES AGE VERSION control1.ocp4.clx.labs.mlnx Ready master 1d v1.23.5+012e945 control2.ocp4.clx.labs.mlnx Ready master 1d v1.23.5+012e945 control3.ocp4.clx.labs.mlnx Ready master 1d v1.23.5+012e945 worker1.ocp4.clx.labs.mlnx Ready worker 1d v1.23.5+012e945 worker2.ocp4.clx.labs.mlnx Ready worker 1d v1.23.5+012e945 worker3.ocp4.clx.labs.mlnx Ready worker 1d v1.23.5+012e945 worker4.ocp4.clx.labs.mlnx Ready worker 1d v1.23.5+012e945

    Note

    To interact with the OpenShift Container Platform from a command-line interface, the OpenShift CLI (oc) should be installed. You can install oc on Linux, Windows or macOS. For OpenShift CLI installation guide, please refer to this link.

    Via Web Console:

    Please open a web browser on the Infra Node with the provided Web Console URL, Username and Password.

    console.jpg

    Make sure that the Cluster Status is ok.

    console2.jpg

    Using the left menu bar, expand the Compute section and select the Nodes Status.

    node-status.jpg

    By clicking on each node, you can see the Node status with detailed information. Example:

    nodeinfo.jpg

    Additional status information is available via the OpenShift CLI. Example:

    Copy
    Copied!
                

    [root@jump-node ~]# oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME control1.ocp4.clx.labs.mlnx Ready master                1d   v1.23.5+012e945 192.168.77.11 <none> Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa) 4.18.0-305.57.1.el8_4.x86_64 cri-o://1.23.3-15.rhaos4.10.git6af791c.el8 control2.ocp4.clx.labs.mlnx Ready master                1d   v1.23.5+012e945 192.168.77.12 <none> Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa) 4.18.0-305.57.1.el8_4.x86_64 cri-o://1.23.3-15.rhaos4.10.git6af791c.el8 control3.ocp4.clx.labs.mlnx Ready master                1d   v1.23.5+012e945 192.168.77.13 <none> Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa) 4.18.0-305.57.1.el8_4.x86_64 cri-o://1.23.3-15.rhaos4.10.git6af791c.el8 worker1.ocp4.clx.labs.mlnx Ready worker   1d   v1.23.5+012e945 192.168.77.21 <none> Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa) 4.18.0-305.57.1.el8_4.x86_64 cri-o://1.23.3-15.rhaos4.10.git6af791c.el8 worker2.ocp4.clx.labs.mlnx Ready worker                1d   v1.23.5+012e945 192.168.77.22 <none> Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa) 4.18.0-305.57.1.el8_4.x86_64 cri-o://1.23.3-15.rhaos4.10.git6af791c.el8 worker3.ocp4.clx.labs.mlnx Ready worker                1d   v1.23.5+012e945 192.168.77.23 <none> Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa) 4.18.0-305.57.1.el8_4.x86_64 cri-o://1.23.3-15.rhaos4.10.git6af791c.el8 worker4.ocp4.clx.labs.mlnx Ready worker                1d   v1.23.5+012e945 192.168.77.24 <none> Red Hat Enterprise Linux CoreOS 410.84.202208161501-0 (Ootpa) 4.18.0-305.57.1.el8_4.x86_64 cri-o://1.23.3-15.rhaos4.10.git6af791c.el8

    Copy
    Copied!
                

    [root@jump-node ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.30 True False False 20m baremetal 4.10.30 True False False      1d cloud-controller-manager 4.10.30 True False False 1d cloud-credential 4.10.30 True False False 1d cluster-autoscaler 4.10.30 True False False 1d config-operator 4.10.30 True False False 1d console 4.10.30 True False False 1d csi-snapshot-controller 4.10.30 True False False 1d dns 4.10.30 True False False 1d etcd 4.10.30 True False False 1d image-registry 4.10.30 True False False 1d ingress 4.10.30 True False False 1d insights 4.10.30 True False False      1d kube-apiserver 4.10.30 True False False      1d kube-controller-manager 4.10.30 True False False      1d kube-scheduler 4.10.30 True False False      1d kube-storage-version-migrator 4.10.30 True False False      1d machine-api 4.10.30 True False False      1d machine-approver 4.10.30 True False False      1d machine-config 4.10.30 True False False      1d marketplace 4.10.30 True False False      1d monitoring 4.10.30 True False False      1d network 4.10.30 True False False      1d node-tuning 4.10.30 True False False      1d openshift-apiserver 4.10.30 True False False 1d openshift-controller-manager 4.10.30 True False False 1d openshift-samples 4.10.30 True False False      1d operator-lifecycle-manager 4.10.30 True False False      1d operator-lifecycle-manager-catalog 4.10.30 True False False      1d operator-lifecycle-manager-packageserver 4.10.30 True False False      1d service-ca 4.10.30 True False False      1d storage 4.10.30 True False False      1d

Post-installation Configuration

In the OpenShift 4.x environment, each running container will be limited to the default maximum PID value of 1024 . To properly run an AI application on the OCP cluster, more than 1024 processes are required within a single container.

The OCP cluster operator is required to adjust the default maximum PID value to a higher number - 4096. It can be done as part of the "Day 2 operation for OCP". For additional information about post-installation configuration, please refer to "Day 2 operation for OCP".

  1. Create ContainerRuntimeConfig custom resource in order to configure the cri-o pidsLimit - mco-pidup.yaml:

    Copy
    Copied!
                

    apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: 01-worker-scale-increase-pid-limit spec: containerRuntimeConfig: pidsLimit: 4096 machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: ""

  2. Apply the following configuration:

    Copy
    Copied!
                

    oc create -f mco-pidup.yaml

  3. Please verify by checking that the latest rendered-worker machine-config has been rolled out to the pools successfully:

    Copy
    Copied!
                

    [root@jump-node ~]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-dc4c25c725418932d2678b0a174057b6 True False False 3 3 3 0 1d worker rendered-worker-0f6b49419faed3fb46a74259d570896f True False False 4 4 4 0 1d

  4. Once all Worker Nodes are rebooted, you may login and confirm the current setting:

    Copy
    Copied!
                

    [root@jump-node ~]# oc debug node/worker1.ocp4.clx.labs.mlnx Starting pod/worker1ocp4clxlabsmlnx-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.77.21 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# cat /etc/crio/crio.conf.d/01-ctrcfg-pidsLimit [crio] [crio.runtime] pids_limit = 4096

Installing OpenShift Operators

To run AI applications on the OCP cluster, the following operators are required:

  • Node Feature Discovery Operator (installed as part of the initial cluster deployment)

  • Local Storage (installed as part of the initial cluster deployment)

  • OpenShift Data Foundation (installed as part of the initial cluster deployment)

  • NVIDIA Network Operator

  • NVIDIA GPU Operator

  • SRI-OV Network Operator

Info

For additional information about adding Operators to a cluster, see the Red Hat OpenShift Container Platform documentation.

Installing NVIDIA Network Operator

NVIDIA Network Operator is required to be installed on the OCP cluster in order to enable the compilation and installation of the RDMA GPUDirect module as part of the NVIDIA GPU Operator.

For the NVIDIA Network Operator Installation Guide for OCP Cluster, please refer to this link.

Warning

Cluster-wide entitlement is required to be applied before NVIDIA Network Operator installation. This step is described here.

NVIDIA Network Operator installation and configuration steps via Web Console:

  1. Expand the Operators section in the left menu bar and select the OperatorHub.

  2. Search for "NVIDIA" in the search bar. Two items should appear as results.

  3. Select the operator that is tagged as "NVIDIA Network Operator". This is the version supported by NVIDIA.

    nvidia-net.jpg

  4. In the opened pop-up window, click "Install" .

    nvidia-net2.jpg

  5. Following the completed installation, in the left menu bar go to the Operators section and click on "Installed Operators". Then, select the "NVIDIA Network Operator".

  6. On the NVIDIA Network Operator details screen click on "Create instance" in the "NicClusterPolicy" section.

    nvidia-net3.jpg

  7. In the "NicClusterPolicy" tab, set the desired values in the "Ofed Driver" section, or leave the default values. In the "RDMA Shared Device Plugin" section please remove all values in the following subsections: "Config", "Image", "Repository" and "Version", then click on the "Create" button.

    nvidia-net4.jpg

    nvidia-net5.jpg

    Info

    Applying the "NicClusterPolicy" is dependent on the server platform hardware configuration and may take some time.

To ensure that the NVIDIA Network Operator is deployed properly, please run the following commands:

Copy
Copied!
            

[root@jump-node ~]# oc -n nvidia-network-operator get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mofed-rhcos4.10-ds-gwt6w 1/1 Running 0 48m 192.168.77.21 worker1.ocp4.clx.labs.mlnx <none> <none> mofed-rhcos4.10-ds-qrn2b 1/1 Running 0 48m 192.168.77.24 worker4.ocp4.clx.labs.mlnx <none> <none> mofed-rhcos4.10-ds-tj695 1/1 Running 0 48m 192.168.77.23 worker3.ocp4.clx.labs.mlnx <none> <none> mofed-rhcos4.10-ds-z742t 1/1 Running 0 48m 192.168.77.22 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-network-operator-controller-manager-86bdf7bdd5-tc5s5 2/2 Running 7 (41m ago) 72m 10.129.2.35 worker1.ocp4.clx.labs.mlnx <none> <none>

Installing the NVIDIA GPU Operator

The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit , automatic node labelling using GFD , DCGM based monitoring and others.

NVIDIA GPU Operator installation and configuration steps via Web Console:

  1. In the OpenShift Container Platform web console from the side menu, navigate to Operators > OperatorHub, and select All Projects.

  2. In Operators > OperatorHub, search for the NVIDIA GPU Operator.

  3. Select the NVIDIA GPU Operator and click Install in the subsequent screen.

    nvidia-net.jpg

  4. Once the installation is completed, in the left menu bar go to the Operators section, and click on "Installed Operators", then select the "NVIDIA GPU Operator".

  5. On the NVIDIA GPU Operator details screen click on "Create instance" in the "ClusterPolicy".

    gpu1.jpg

  6. In the "ClusterPolicy" tab, please open the subsection "NVIDIA GPU/VGPU Driver Config" and scroll down. Mark the checkbox "Enabled" in the "RDMA" subsection and click on the "Create" button.

    gpu2.jpg

    gpu3.jpg

    gpu4.jpg

Note

For deep-drive step-by-step installation instructions, please visit GPU Operator on OpenShift.

To ensure that the NVIDIA GPU Operator is deployed properly, please run the following commands:

GPU Operator deployment pod's status

Copy
Copied!
            

[root@jump-node ~]# oc -n nvidia-gpu-operator get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES gpu-feature-discovery-2dzjv 1/1 Running 0 39m 10.130.0.36 worker3.ocp4.clx.labs.mlnx <none> <none> gpu-feature-discovery-lk4jz 1/1 Running 0 165m 10.130.2.3 worker2.ocp4.clx.labs.mlnx <none> <none> gpu-feature-discovery-ndpbw 1/1 Running 0 165m 10.131.0.14 worker4.ocp4.clx.labs.mlnx <none> <none> gpu-feature-discovery-wl5fs 1/1 Running 0 165m 10.129.2.15 worker1.ocp4.clx.labs.mlnx <none> <none> gpu-operator-66bd74b4f4-q2c9q 1/1 Running 3 (42m ago) 62m 10.129.2.48 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-container-toolkit-daemonset-5vjrk 1/1 Running 0 165m 10.130.2.9 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-container-toolkit-daemonset-jh2xf 1/1 Running 0 165m 10.129.2.12 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-container-toolkit-daemonset-s4pbf 1/1 Running 0 165m 10.131.0.2 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-container-toolkit-daemonset-tdkns 1/1 Running 0 39m 10.130.0.37 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-cuda-validator-fqbxg 0/1 Completed 0 37m 10.131.0.36 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-cuda-validator-kdbbh 0/1 Completed 0 35m 10.130.0.44 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-cuda-validator-nv7vz 0/1 Completed 0 36m 10.129.2.68 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-27ghv 1/1 Running 0 39m 192.168.77.23 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-2tz55 1/1 Running 0 165m 192.168.77.22 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-exporter-hkzxw 1/1 Running 0 165m 192.168.77.21 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-exporter-kk4lp 1/1 Running 0 165m 192.168.77.24 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-exporter-nk5j7 1/1 Running 0 165m 192.168.77.22 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-exporter-vgjr6 1/1 Running 0 39m 192.168.77.23 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-nqj7s 1/1 Running 0 165m 192.168.77.24 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-dcgm-tlf2p 1/1 Running 0 165m 192.168.77.21 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-device-plugin-daemonset-7cvps 1/1 Running 0 165m 10.129.2.16 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-device-plugin-daemonset-bjjdv 1/1 Running 0 165m 10.131.0.10 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-device-plugin-daemonset-qndb6 1/1 Running 0 165m 10.130.2.14 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-device-plugin-daemonset-t7t5t 1/1 Running 0 39m 10.130.0.39 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-device-plugin-validator-gphhw 0/1 Completed 0 36m 10.131.0.39 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-device-plugin-validator-mqc82 0/1 Completed 0 34m 10.130.0.45 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-device-plugin-validator-vw92q 0/1 Completed 0 35m 10.129.2.69 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-driver-daemonset-410.84.202208161501-0-r6dwb 3/3 Running 4 (38m ago) 165m 10.131.0.11 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-driver-daemonset-410.84.202208161501-0-r9gzw 3/3 Running 3 (36m ago) 165m 10.130.0.11 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-driver-daemonset-410.84.202208161501-0-vgwrz 3/3 Running 4 (37m ago) 165m 10.129.2.8 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-driver-daemonset-410.84.202208161501-0-zzb58 3/3 Running 2 (38m ago) 165m 10.130.2.12 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-mig-manager-4grl2 1/1 Running 0 39m 10.130.0.40 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-mig-manager-nhbh5 1/1 Running 0 165m 10.129.2.10 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-mig-manager-v852l 1/1 Running 0 165m 10.130.2.10 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-mig-manager-xkjxf 1/1 Running 0 165m 10.131.0.6 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-node-status-exporter-pmtfw 1/1 Running 1 165m 10.130.2.7 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-node-status-exporter-r7m4j 1/1 Running 3 165m 10.130.0.3 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-node-status-exporter-szzsg 1/1 Running 1 165m 10.131.0.8 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-node-status-exporter-zwgld 1/1 Running 2 165m 10.129.2.4 worker1.ocp4.clx.labs.mlnx <none> <none> nvidia-operator-validator-52z9h 1/1 Running 0 39m 10.130.0.38 worker3.ocp4.clx.labs.mlnx <none> <none> nvidia-operator-validator-7gnwz 1/1 Running 0 41m 10.131.0.24 worker4.ocp4.clx.labs.mlnx <none> <none> nvidia-operator-validator-ndhll 1/1 Running 0 41m 10.130.2.33 worker2.ocp4.clx.labs.mlnx <none> <none> nvidia-operator-validator-zvsgc 1/1 Running 0 40m 10.129.2.58 worker1.ocp4.clx.labs.mlnx <none> <none>

Installing the SR-IOV Network Operator

The SR-IOV Network Operator is generally responsible for configuring the SR-IOV components in an OpenShift cluster.

SR-IOV Network Operator installation and configuration steps via Web Console and CLI:

  1. In the OpenShift Container Platform web console from the side menu, navigate to Operators > OperatorHub, and select All Projects.

  2. In Operators > OperatorHub, search for the NVIDIA GPU Operator.

  3. Select the SR-IOV Network Operator and click Install on the subsequent screen.

    sriov1.jpg

  4. To configure the InfiniBand network, the following components should be created: "SR-IOV Network Node Policies" and "SriovIBNetworks". Both components were created via YAML configuration files and applied vis CLI.

  5. To configure "SR-IOV Network Node Policies", use - policy.yaml.

    Multi policy configuration YAML file - policy.yaml

    Copy
    Copied!
                

    apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib0 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib0 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib0 isRdma: true linkType: ib --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib1 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib1 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib1 isRdma: true linkType: ib --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib2 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib2 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib2 isRdma: true linkType: ib --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib3 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib3 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib3 isRdma: true linkType: ib --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib4 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib4 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib4 isRdma: true linkType: ib --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib5 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib5 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib5 isRdma: true linkType: ib --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib6 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib6 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib6 isRdma: true linkType: ib --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-ib7 namespace: openshift-sriov-network-operator spec: nodeSelector: feature.node.kubernetes.io/network-sriov.capable: 'true' nicSelector: pfNames: - ib7 deviceType: netdevice numVfs: 8 priority: 99 resourceName: ib7 isRdma: true linkType: ib

  6. Deploy policy.yaml:

    Copy
    Copied!
                

    oc apply -f policy.yaml

    Warning

    This step may take a while to complete, depending on the amount of Worker Nodes used to apply the configuration, and the number of VFs for each selected network interface.

  7. To configure "SR-IOV Network Node Policies", ib-net.yaml has been used.

    Multi SriovIBNetwork configuration YAML file - ib-net.yaml

    Copy
    Copied!
                

    apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib0 namespace: openshift-sriov-network-operator spec: resourceName: "ib0" networkNamespace: default linkState: enable ipam: |- { "type": "whereabouts", "range": "192.168.0.0/24" }   --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib1 namespace: openshift-sriov-network-operator spec: resourceName: "ib1" networkNamespace: default linkState: enable ipam: |- { "type": "whereabouts", "range": "192.168.1.0/24" }   --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib2 namespace: openshift-sriov-network-operator spec: resourceName: ib2 networkNamespace: default linkState: enable ipam: |- { "type": "whereabouts", "range": "192.168.2.0/24" }   --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib3 namespace: openshift-sriov-network-operator spec: resourceName: ib3 networkNamespace: default linkState: auto ipam: |- { "type": "whereabouts", "range": "192.168.3.0/24" }   --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib4 namespace: openshift-sriov-network-operator spec: resourceName: ib4 networkNamespace: default linkState: auto ipam: |- { "type": "whereabouts", "range": "192.168.4.0/24" }   --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib5 namespace: openshift-sriov-network-operator spec: resourceName: ib5 networkNamespace: default linkState: auto ipam: |- { "type": "whereabouts", "range": "192.168.5.0/24" }   --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib6 namespace: openshift-sriov-network-operator spec: resourceName: ib6 networkNamespace: default linkState: auto ipam: |- { "type": "whereabouts", "range": "192.168.6.0/24" }   --- apiVersion: sriovnetwork.openshift.io/v1 kind: SriovIBNetwork metadata: name: net-ib7 namespace: openshift-sriov-network-operator spec: resourceName: ib7 networkNamespace: default linkState: auto ipam: |- { "type": "whereabouts", "range": "192.168.7.0/24" }

  8. Deploy ib-net.yaml:

    Copy
    Copied!
                

    oc apply -f ib-net.yaml

Validating the Operators Deployment

  1. Check the deployed InfiniBand networks:

    Copy
    Copied!
                

    oc get network-attachment-definitions.k8s.cni.cncf.io NAME AGE net-ib0 1d net-ib1 1d net-ib2 1d net-ib3 1d net-ib4 1d net-ib5 1d net-ib6 1d net-ib7 1d

  2. Check the Worker Node resources:

    Copy
    Copied!
                

    oc get node worker1.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable' { "cpu": "255500m", "ephemeral-storage": "1727851483143", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1054987908Ki", "nvidia.com/gpu": "8", "openshift.io/ib0": "8", "openshift.io/ib1": "8", "openshift.io/ib2": "8", "openshift.io/ib3": "8", "openshift.io/ib4": "8", "openshift.io/ib5": "8", "openshift.io/ib6": "8", "openshift.io/ib7": "8", "pods": "250" }   oc get node worker2.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable' { "cpu": "255500m", "ephemeral-storage": "1727851483143", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1054987908Ki", "nvidia.com/gpu": "8", "openshift.io/ib0": "8", "openshift.io/ib1": "8", "openshift.io/ib2": "8", "openshift.io/ib3": "8", "openshift.io/ib4": "8", "openshift.io/ib5": "8", "openshift.io/ib6": "8", "openshift.io/ib7": "8", "pods": "250" }   oc get node worker3.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable' { "cpu": "255500m", "ephemeral-storage": "1727851483143", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1054987908Ki", "nvidia.com/gpu": "8", "openshift.io/ib0": "8", "openshift.io/ib1": "8", "openshift.io/ib2": "8", "openshift.io/ib3": "8", "openshift.io/ib4": "8", "openshift.io/ib5": "8", "openshift.io/ib6": "8", "openshift.io/ib7": "8", "pods": "250" }   oc get node worker4.ocp4.clx.labs.mlnx -o json | jq '.status.allocatable' { "cpu": "255500m", "ephemeral-storage": "1727851483143", "hugepages-1Gi": "0", "hugepages-2Mi": "0", "memory": "1054987908Ki", "nvidia.com/gpu": "8", "openshift.io/ib0": "8", "openshift.io/ib1": "8", "openshift.io/ib2": "8", "openshift.io/ib3": "8", "openshift.io/ib4": "8", "openshift.io/ib5": "8", "openshift.io/ib6": "8", "openshift.io/ib7": "8", "pods": "250" }

Synthetic RDMA Benchmark

Run synthetic RDMA benchmark tests with ib_write_bw between two pods that are running on different Worker Nodes.

This step includes the following:

  • Create a container image and push it to your repository

  • Deploy test deployment apps

  • Run test

  1. Create a container image from Dockerfile:

    Copy
    Copied!
                

    FROM ubuntu:20.04 # Ubuntu 20.04 docker container with inbox Mellanox drivers # LABEL about the custom image LABEL maintainer=vitaliyra@nvidia.com LABEL description="This is custom Container Image with inbox perftest package." WORKDIR /tmp/ ENV DEBIAN_FRONTEND=noninteractive RUN apt-get clean -y && apt-get -y update && apt-get install -y apt-utils udev vim bash && apt-get -y upgrade RUN apt-get install -y iproute2 rdma-core libibmad5 ibutils ibverbs-utils infiniband-diags perftest \ mstflint strace iputils-ping RUN ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime RUN dpkg-reconfigure --frontend noninteractive tzdata && apt-get clean all -y CMD bash

    Info

    Please use your favorite container building tools (docker, podman, etc.) to create a container image from Dockerfile for use in the below deployment.

    After creating the image, push it to the container registry.

  2. Create a sample deployment test-deployment.yaml (the container image should include InfiniBand userspace drivers and performance tools):

    test-deployment.yaml

    Copy
    Copied!
                

    apiVersion: apps/v1 kind: Deployment metadata: name: mlnx-inbox-pod labels: app: sriov spec: replicas: 2 selector: matchLabels: app: sriov template: metadata: labels: app: sriov annotations: k8s.v1.cni.cncf.io/networks: net-ib0 spec: containers: - image: < Container image > name: mlnx-inbox-ctr securityContext: capabilities: add: [ "IPC_LOCK" ] resources: requests: openshift.io/ib0: 1 limits: openshift.io/ib0: 1 command: - sh - -c - sleep inf

  3. Deploy the sample deployment.

    Copy
    Copied!
                

    oc apply -f test-deployment.yaml deployment.apps/mlnx-inbox-pod created     oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mlnx-inbox-pod-6948fd6d54-9s66q 1/1 Running 0 36s 10.130.2.231 worker2.ocp4.clx.labs.mlnx <none> <none> mlnx-inbox-pod-6948fd6d54-t4hpx 1/1 Running 0 37s 10.131.1.33 worker4.ocp4.clx.labs.mlnx <none> <none>

  4. Check available network interfaces in each POD.

    Copy
    Copied!
                

    ## First POD   oc exec -it mlnx-inbox-pod-6948fd6d54-9s66q -- bash root@mlnx-inbox-pod-6948fd6d54-9s66q:/tmp# ip a s 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 3: eth0@if645: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default link/ether 0a:58:0a:82:02:e7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.130.2.231/23 brd 10.130.3.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::c4e:5fff:feb1:c036/64 scope link valid_lft forever preferred_lft forever 430: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256 link/infiniband 00:00:00:88:fe:80:00:00:00:00:00:00:39:71:61:ca:7b:e2:c4:3b brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 192.168.0.2/24 brd 192.168.0.255 scope global net1 valid_lft forever preferred_lft forever inet6 fe80::3b71:61ca:7be2:c43b/64 scope link valid_lft forever preferred_lft forever root@mlnx-inbox-pod-6948fd6d54-9s66q:/tmp# ibv_devinfo hca_id: mlx5_29 transport: InfiniBand (0) fw_ver: 20.31.2006 node_guid: 3971:61ca:7be2:c43b sys_image_guid: 043f:7203:009c:6800 vendor_id: 0x02c9 vendor_part_id: 4124 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 376 port_lmc: 0x00 link_layer: InfiniBand       ## Second POD   oc exec -it mlnx-inbox-pod-6948fd6d54-t4hpx -- bash root@mlnx-inbox-pod-6948fd6d54-t4hpx:/tmp# ibv_devinfo hca_id: mlx5_11 transport: InfiniBand (0) fw_ver: 20.31.2006 node_guid: 19aa:44e1:73a8:70c1 sys_image_guid: 043f:7203:00c0:017e vendor_id: 0x02c9 vendor_part_id: 4124 hw_ver: 0x0 board_id: MT_0000000223 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 474 port_lmc: 0x00 link_layer: InfiniBand

  5. Run synthetic RDMA benchmark tests.

    Server

    ib_write_bw -F -d $IB_DEV_NAME --report_gbits

    Client

    ib_write_bw -D 20 -F $SERVER_IP -d $IB_DEV_NAME --report_gbits

    Please console sessions to each POD - one for the server apps side, and the second for the client apps side.

    On the first console (on the server side), run the following commands:

    Copy
    Copied!
                

    oc exec -it mlnx-inbox-pod-6948fd6d54-9s66q -- bash root@mlnx-inbox-pod-6948fd6d54-9s66q:/tmp# ib_write_bw -d mlx5_29 -F --report_gbits   ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_29 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet ---------------------------------------------------------------------------------------  local address: LID 0x178 QPN 0x007d PSN 0x1347c5 RKey 0x01053c VAddr 0x007f287d861000 remote address: LID 0x1da QPN 0x00bd PSN 0x1347c5 RKey 0x02053c VAddr 0x007fd8c7404000 ---------------------------------------------------------------------------------------  #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 3728317 0.00 195.47 0.372828 ---------------------------------------------------------------------------------------

    On the second console (on the client side ) , run the following commands:

    Copy
    Copied!
                

    oc exec -it mlnx-inbox-pod-6948fd6d54-t4hpx -- bash root@mlnx-inbox-pod-6948fd6d54-t4hpx:/tmp# ib_write_bw -d mlx5_11 -F 10.130.2.231 --report_gbits -D 20 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_11 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet ---------------------------------------------------------------------------------------  local address: LID 0x1da QPN 0x00bd PSN 0x1347c5 RKey 0x02053c VAddr 0x007fd8c7404000 remote address: LID 0x178 QPN 0x007d PSN 0x1347c5 RKey 0x01053c VAddr 0x007f287d861000 ---------------------------------------------------------------------------------------  #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 3728317 0.00 195.47 0.372828 ---------------------------------------------------------------------------------------

    For the Synthetic RDMA Benchmark with ib_write_bw we got 195Gbps which is the expected line rate for IB HDR.

Kubeflow Training Operator

Kubeflow is a machine learning toolkit for Kubernetes.

Kubeflow training operators are part of Kubeflow, and a group of Kubernetes operators that add support to Kubeflow for distributed training of Machine Learning models using different frameworks.

The training operator provides Kubernetes CR that makes it easier to run distributed or non-distributed TensorFlow/PyTorch/Apache MXNet/XGBoost/MPI jobs on Kubernetes.

In the example below we deploy the latest stable release of the Kubeflow training operators:

Copy
Copied!
            

kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone" namespace/kubeflow created customresourcedefinition.apiextensions.k8s.io/mpijobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/mxjobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/pytorchjobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/tfjobs.kubeflow.org created customresourcedefinition.apiextensions.k8s.io/xgboostjobs.kubeflow.org created serviceaccount/training-operator created clusterrole.rbac.authorization.k8s.io/training-operator created clusterrolebinding.rbac.authorization.k8s.io/training-operator created service/training-operator created deployment.apps/training-operator created

Info

Configuration of the namespace, to allow the default service account to run pods as a root:

# oc new-project $MY_PROJECT

# oc adm policy add-scc-to-user privileged -z default # (from $MY_PROJECT namespace)

# oc adm policy add-scc-to-user anyuid -z default # (from $MY_PROJECT namespace)

Appendix

Job Testing Results

Below are the Dockerfile and MPIJob examples with different network configurations.

Dockerfile

Dockerfile example for using MPIJob:

Copy
Copied!
            

FROM nvcr.io/nvidia/tensorflow:22.08-tf2-py3 RUN apt-get update && apt-get install -y --no-install-recommends openssh-client openssh-server && \ mkdir -p /var/run/sshd   # Allow OpenSSH to talk to containers without asking for confirmation # by disabling StrictHostKeyChecking. # mpi-operator mounts the .ssh folder from a Secret. For that to work, we need # to disable UserKnownHostsFile to avoid write permissions. # Disabling StrictModes avoids directory and files read permission checks.   RUN sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config && \ echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \ sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config   RUN mkdir /tensorflow WORKDIR "/tensorflow" RUN git clone https://github.com/tensorflow/benchmarks WORKDIR "/tensorflow/benchmarks"   CMD ["/bin/bash"]

This Dockerfile is based on the TensorFlow NGC Container image. The TensorFlow NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance. This container may also contain modifications to the TensorFlow source code in order to maximize performance and compatibility. It also contains software for accelerating ETL (DALI , RAPIDS ), training ( cuDNN , NCCL ) and inference ( TensorRT ) workloads.

For supported versions, see the Framework Containers Support Matrix and the NVIDIA Container Toolkit Documentation.

Info

Please use your favorite container building tools (docker, podman, etc.) to create a container image from Dockerfile for use in the below deployment.

After creating the image, push it to the container registry.

MPIJob Examples

The below is an MPIJob example with network configuration over InfiniBand. It is based on OCP secondary network without GPUDirect options:

MPIJOB with disabled GDR

Copy
Copied!
            

apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: < container image > name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "32" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - NCCL_IB_DISABLE=0 - -x - NCCL_NET_GDR_LEVEL=0 - -x - TF_ALLOW_IOLIBS=1 - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - -mca - btl_tcp_if_include - eth0 - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --batch_size=64 - --model=resnet152 - --variable_update=horovod - --use_fp16=true Worker: replicas: 4 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: net-ib0,net-ib1,net-ib2,net-ib3,net-ib4,net-ib5,net-ib6,net-ib7 spec: containers: - image: < container image> name: tensorflow-benchmarks securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 8 openshift.io/ib0: 1 openshift.io/ib1: 1 openshift.io/ib2: 1 openshift.io/ib3: 1 openshift.io/ib4: 1 openshift.io/ib5: 1 openshift.io/ib6: 1 openshift.io/ib7: 1

The below is an MPIJob example with network configuration over InfiniBand. It is based on OCP secondary network with enabled GPUDirect options:

MPIJOB with enabled GDR

Copy
Copied!
            

apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: < container image > name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "32" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - NCCL_IB_DISABLE=0 - -x - NCCL_NET_GDR_LEVEL=2 - -x - TF_ALLOW_IOLIBS=1 - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - -mca - btl_tcp_if_include - eth0 - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --batch_size=64 - --model=resnet152 - --variable_update=horovod - --use_fp16=true Worker: replicas: 4 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: net-ib0,net-ib1,net-ib2,net-ib3,net-ib4,net-ib5,net-ib6,net-ib7 spec: containers: - image: < container image> name: tensorflow-benchmarks securityContext: capabilities: add: [ "IPC_LOCK" ] resources: limits: nvidia.com/gpu: 8 openshift.io/ib0: 1 openshift.io/ib1: 1 openshift.io/ib2: 1 openshift.io/ib3: 1 openshift.io/ib4: 1 openshift.io/ib5: 1 openshift.io/ib6: 1 openshift.io/ib7: 1

Test Results

results_gdr.jpg

Summary

From the tests above, it can be seen that using GPUDirect resulted in a ~17% usage advantage in our setup.

The number of processed images depends on the chosen model and batch size in the TF benchmark .

Note

The performance results listed in this document are indicative and should not be considered as formal performance targets for NVIDIA products.

Authors

ID-2.jpg

Vitaliy Razinkov

Over the past few years, Vitaliy Razinkov has been working as a Solutions Architect on the NVIDIA Networking team, responsible for complex Kubernetes/OpenShift and Microsoft's leading solutions, research and design. He previously spent more than 25 years in senior positions at several companies. Vitaliy has written several reference designs guides on Microsoft technologies, RoCE/RDMA accelerated machine learning in Kubernetes/OpenShift, and container solutions, all of which are available on the NVIDIA Networking Documentation website.

Related Documents

Last updated on Sep 12, 2023.