NVIDIA AI Enterprise with OpenShift

NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software, optimized, certified, and supported by NVIDIA with NVIDIA-Certified Systems. Additional information can be found at the NVIDIA AI Enterprise web page.

The following methods of installation are supported:

OpenShift Container Platform on bare metal or VMware vSphere with GPU Passthrough
OpenShift Container Platform on VMware vSphere with NVIDIA vGPU

OpenShift Container Platform bare metal or VMware vSphere with GPU Passthrough

For OpenShift Container Platform bare metal or VMware vSphere with GPU Passthrough you do not need to make changes to the ClusterPolicy. Follow the guidance in Installing the NVIDIA GPU Operator to install the NVIDIA GPU Operator.

Note

The option exists to use the vGPU driver with bare metal and VMware vSphere VMs with GPU Passthrough. In this case, follow the guidance in the section “OpenShift Container Platform on VMware vSphere with NVIDIA vGPU”.

OpenShift Container Platform on VMware vSphere with NVIDIA vGPUs

Overview

This section provides insights into deploying NVIDIA AI Enterprise for VMware vSphere with RedHat OpenShift Container Platform.

The steps involved are:

Step 1: Install Node Feature Discovery (NFD) Operator
Step 2: Install NVIDIA GPU Operator
Step 3: Create the NGC secret
Step 4: Create the ConfigMap
Step 5: Create the Cluster Policy

Introduction

When NVIDIA AI Enterprise is running on VMware vSphere based virtualized infrastructure, a key component is NVIDIA virtual GPU. The NVIDIA AI Enterprise Host Software vSphere Installation Bundle (VIB) is installed on the VMware ESXi host server and it is responsible for communicating with the NVIDIA vGPU guest driver which is installed on the guest VM. Guest VMs use the NVIDIA vGPUs in the same manner as a physical GPU that has been passed through by the hypervisor. In the VM itself, vGPU drivers are installed which support the different license levels that are available.

Note

Installing the NVIDIA vGPU Host Driver VIB on the ESXi host is out of the scope of this document. See the NVIDIA AI Enterprise Deployment Guide for detailed instructions.

Red Hat OpenShift on VMware vSphere

Follow the steps outlined in the Installing vSphere section of the RedHat OpenShift documentation installing OpenShift on vSphere.

Note

When using virtualized GPUs you must change the boot method of each VM that is deployed as a worker and the VM template to be EFI. This requires powering down running worker VMs. The template must be converted to a VM, then change the boot method to EFI, then convert back to a template.

Secure boot also needs to be disabled as shown:

When using the UPI install method, after Step 8 of the “Installing RHCOS and starting the OpenShift Container Platform bootstrap process” change the boot method to EFI before continuing to Step 9.

When using the IPI method, each VM’s boot method can be changed to EFI after VM deployment.

In addition to the EFI boot setting, ensure that the VM has the following configuration parameters set:

VM Settings > VM options > Advanced > Configuration Parameters > Edit Configuration

pciPassthru.use64bitMMIO TRUE

pciPassthru.64bitMMIOSizeGB 512

To support GPUDirect RDMA ensure that the VM has the following configuration parameters set:

VM Settings > VM options > Advanced > Configuration Parameters > Edit Configuration

pciPassthru.allowP2P = true

pciPassthru.RelaxACSforP2P = true

It is also recommended that you reference Running Red Hat OpenShift Container Platform on VMware Cloud Foundation documentation for deployment best practices, system configuration, and reference architecture.

Install Node Feature Discovery Operator

Follow the guidance in Installing the Node Feature Discovery (NFD) Operator to install the Node Feature Discovery Operator.

Install NVIDIA GPU Operator

Follow the guidance in Installing the NVIDIA GPU Operator to install the NVIDIA GPU Operator.

Note

Skip the guidance associated with creating the cluster policy instead follow the guidance in the subsequent sections.

Create the NGC secret

OpenShift has a secret object type which provides a mechanism for holding sensitive information such as passwords and private source repository credentials. Next you will create a secret object for storing our NGC API key (the mechanism used to authenticate your access to the NGC container registry).

Note

Before you begin you will need to generate or use an existing API key.

Navigate to Home > Projects and ensure the nvidia-gpu-operator is selected.
In the OpenShift Container Platform web console, click Secrets from the Workloads drop down.
Click the Create Drop down.
Select Image Pull Secret.
Enter the following into each field:
- Secret name: ngc-secret
- Authentication type: Image registry credentials
- Registry server address: nvcr.io/nvaie
- Username: $oauthtoken
- Password: <API-KEY>
- Email: <YOUR-EMAIL>
Click Create.

A pull secret is created.

Create the ConfigMap for NLS Token

Prerequisites

Generate and download a NLS client license token. See Section 4.6 of the NVIDIA License System User Guide for instructions.

Procedure

Navigate to Home > Projects and ensure the nvidia-gpu-operator is selected.
Select the Workloads Drop Down menu.
Select ConfigMaps.
Click Create ConfigMap.
Enter the details for your ConfigMap.
1. The name must be licensing-config.
2. Copy and paste the information for your NLS client token into the client_configuration_token.tok parameter.

Click Create.

Example output

kind: ConfigMap
apiVersion: v1
metadata:
 name: licensing-config
data:
 client_configuration_token.tok: >-
  tJ8EKOD5-rN7sSUWyHKsrvVSgfRYucvKo-lg<SNIP>
 gridd.conf: '# empty file'

The created ConfigMap should resemble the following:

Create the Cluster Policy Instance

Now create the cluster policy, which is responsible for maintaining policy resources to create pods in a cluster.

In the OpenShift Container Platform web console, from the side menu, select Operators > Installed Operators, and click NVIDIA GPU Operator.
Select the ClusterPolicy tab, then click Create ClusterPolicy.

Note

The platform assigns the default name gpu-cluster-policy.
Expand the drop down for NVIDIA GPU/vGPU Driver config and then licensingConfig. In the text box labeled configMapName, enter the name of the licensing config map that was previously created (for example licensing-config). Check the nlsEnabled checkbox. Refer the screenshots for parameter examples and modify values accordingly.

Note

This was previously created in the previous section “Create the ConfigMap for NLS Token”.
- configMapName: licensing-config
- nlsEnabled: Enabled
Expand the rdma menu and check enabled if you want to deploy GPUDirect RDMA:
Scroll down to specify repository path under the NVIDIA GPU/vGPU Driver config section. See the screenshot below for parameter examples and modify values accordingly.
- enabled: Enabled
- repository: nvcr.io/nvaie
Scroll down further to image name and specifgy the NVIDIA vGPU driver version under the NVIDIA GPU/vGPU Driver config section.
- version: 525.60.13
- image: vgpu-guest-driver-3-0
  Note
  
  The above version and image are examples for NVIDIA AI Enterprise 3.0. Please update the vGPU driver version and image for the appropriate OpenShift Container Platform version.
  
  4.9 is nvcr.io/nvaie/vgpu-guest-driver-3-0:525.60.13-rhcos4.9
  
  4.10 is nvcr.io/nvaie/vgpu-guest-driver-3-0:525.60.13-rhcos4.10
  
  4.11 is nvcr.io/nvaie/vgpu-guest-driver-3-0:525.60.13-rhcos4.11
Expand the Advanced configuration menu and specify the imagePullSecret.

Note

This was previously created in the section “Create NGC secret”.
Click Create.

The GPU Operator proceeds to install all the required components to set up the NVIDIA GPUs in the OpenShift Container Platform cluster.

Note

Wait at least 10-20 minutes before digging deeper into any form of troubleshooting because this may take some time to finish.

The status of the newly deployed ClusterPolicy gpu-cluster-policy for the NVIDIA GPU Operator changes to State:ready when the installation succeeds.

Verify the ClusterPolicy installation from the CLI run:

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'

This lists each node and the number of GPUs.

Example output

$ oc get nodes -o=custom-columns='Node:metadata.name,GPUs:status.capacity.nvidia\.com/gpu'

  Node GPUs

  nvaie-ocp-7rfr8-master-0 <none>

  nvaie-ocp-7rfr8-master-1 <none>

  nvaie-ocp-7rfr8-master-2 <none>

  nvaie-ocp-7rfr8-worker-7x5km 1

  nvaie-ocp-7rfr8-worker-9jgmk <none>

  nvaie-ocp-7rfr8-worker-jntsp 1

  nvaie-ocp-7rfr8-worker-zkggt <none>

Verify the successful installation of the NVIDIA GPU Operator

Perform the following steps to verify the successful installation of the NVIDIA GPU Operator.

In the OpenShift Container Platform web console, from the side menu, select Workloads > Pods.
Under the Project drop down select the nvidia-gpu-operator project.
Verify the pods are successfully deployed.

Alternatively from the command line run the following command:

$ oc get pods -n nvidia-gpu-operator

NAME                                                                  READY   STATUS      RESTARTS   AGE
pod/gpu-feature-discovery-hlpgs                                       1/1     Running     0          91m
pod/gpu-operator-8dc8d6648-jzhnr                                      1/1     Running     0          94m
pod/nvidia-dcgm-exporter-ds9xd                                        1/1     Running     0          91m
pod/nvidia-dcgm-k7tz6                                                 1/1     Running     0          91m
pod/nvidia-device-plugin-daemonset-nqxmc                              1/1     Running     0          91m
pod/nvidia-driver-daemonset-49.84.202202081504-0-9df9j                2/2     Running     0          91m
pod/nvidia-node-status-exporter-7bhdk                                 1/1     Running     0          91m