Appendix

Install GPU Operator in Proxy Environments

Introduction

This page describes how to successfully deploy the GPU Operator in clusters behind a HTTP Proxy. By default, the GPU Operator requires internet access for the following reasons:

  1. Container images need to be pulled during GPU Operator installation.

  2. The driver container needs to download several OS packages prior to driver installation.

To address these requirements, all Kubernetes nodes as well as the driver container need proper configuration in order to direct traffic through the proxy.

This document demonstrates how to configure the GPU Operator so that the driver container can successfully download packages behind a HTTP proxy. Since configuring Kubernetes/container runtime components to use a proxy is not specific to the GPU Operator, we do not include those instructions here.

The instructions for Openshift are different, so skip the section titled HTTP Proxy Configuration for Openshift if you are not running Openshift.

Prerequisites

  • Kubernetes cluster is configured with HTTP proxy settings (container runtime should be enabled with HTTP proxy)

HTTP Proxy Configuration for Openshift

For Openshift, it is recommended to use the cluster-wide Proxy object to provide proxy information for the cluster. Please follow the procedure described in Configuring the cluster-wide proxy from Red Hat Openshift public documentation. The GPU Operator will automatically inject proxy related ENV into the driver container based on information present in the cluster-wide Proxy object.

Note

  • GPU Operator v1.8.0 does not work well on RedHat OpenShift when a cluster-wide Proxy object is configured and causes constant restarts of driver container. This will be fixed in an upcoming patch release v1.8.2.

HTTP Proxy Configuration

First, get the values.yaml file used for GPU Operator configuration:

$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.7.0/deployments/gpu-operator/values.yaml

Note

Replace v1.7.0 in the above command with the version you want to use.

Specify driver.env in values.yaml with appropriate HTTP_PROXY, HTTPS_PROXY, and NO_PROXY environment variables (in both uppercase and lowercase).

driver:
   env:
   - name: HTTPS_PROXY
     value: http://<example.proxy.com:port>
   - name: HTTP_PROXY
     value: http://<example.proxy.com:port>
   - name: NO_PROXY
     value: <example.com>
   - name: https_proxy
     value: http://<example.proxy.com:port>
   - name: http_proxy
     value: http://<example.proxy.com:port>
   - name: no_proxy
     value: <example.com>

Note

  • Proxy related ENV are automatically injected by GPU Operator into the driver container to indicate proxy information used when downloading necessary packages.

  • If HTTPS Proxy server is setup then change the values of HTTPS_PROXY and https_proxy to use https instead.

Deploy GPU Operator

Download and deploy GPU Operator Helm Chart with the updated values.yaml.

Fetch the chart from NGC repository. v1.10.0 is used as an example in the command below:

$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.10.0.tgz

Install the GPU Operator with updated values.yaml:

$ helm install --wait gpu-operator \
     -n gpu-operator --create-namespace \
     gpu-operator-v1.10.0.tgz \
     -f values.yaml

Check the status of the pods to ensure all the containers are running:

$ kubectl get pods -n gpu-operator

Install GPU Operator in Air-gapped Environments

Introduction

This page describes how to successfully deploy the GPU Operator in clusters with restricted internet access. By default, The GPU Operator requires internet access for the following reasons:

  1. Container images need to be pulled during GPU Operator installation.

  2. The driver container needs to download several OS packages prior to driver installation.

To address these requirements, it may be necessary to create a local image registry and/or a local package repository so that the necessary images and packages are available for your cluster. In subsequent sections, we detail how to configure the GPU Operator to use local image registries and local package repositories. If your cluster is behind a proxy, also follow the steps from Install GPU Operator in Proxy Environments.

Different steps are required for different environments with varying levels of internet connectivity. The supported use cases/environments are listed in the below table:

Network Flow

Use Case

Pulling Images

Pulling Packages

1

HTTP Proxy with full Internet access

K8s node –> HTTP Proxy –> Internet Image Registry

Driver container –> HTTP Proxy –> Internet Package Repository

2

HTTP Proxy with limited Internet access

K8s node –> HTTP Proxy –> Internet Image Registry

Driver container –> HTTP Proxy –> Local Package Repository

3a

Full Air-Gapped (w/ HTTP Proxy)

K8s node –> Local Image Registry

Driver container –> HTTP Proxy –> Local Package Repository

3b

Full Air-Gapped (w/o HTTP Proxy)

K8s node –> Local Image Registry

Driver container–> Local Package Repository

Note

For Red Hat Openshift deployments in air-gapped environments (use cases 2, 3a and 3b), see the documentation here.

Note

Ensure that Kubernetes nodes can successfully reach the local DNS server(s). Public name resolution for image registry and package repositories are mandatory for use cases 1 and 2.

Before proceeding to the next sections, get the values.yaml file used for GPU Operator configuration.

$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.7.0/deployments/gpu-operator/values.yaml

Note

Replace v1.7.0 in the above command with the version you want to use.

Local Image Registry

Without internet access, the GPU Operator requires all images to be hosted in a local image registry that is accessible to all nodes in the cluster. To allow the GPU Operator to work with a local registry, users can specify local repository, image, tag along with pull secrets in values.yaml.

Pulling and pushing container images to local registry

To pull the correct images from the NVIDIA registry, you can leverage the fields repository, image and version specified in the file values.yaml.

The general syntax for the container image is <repository>/<image>:<version>.

If the version is not specified, you can retrieve the information from the NVIDIA NGC catalog (https://ngc.nvidia.com/catalog) by checking the available tags for an image.

An example is shown below with the gpu-operator container image:

operator:
    repository: nvcr.io/nvidia
    image: gpu-operator
    version: "v1.9.0"

For instance, to pull the gpu-operator image version v1.9.0, use the following instruction:

$ docker pull nvcr.io/nvidia/gpu-operator:v1.9.0

There is one caveat with regards to the driver image. The version field must be appended by the OS name running on the worker node.

driver:
    repository: nvcr.io/nvidia
    image: driver
    version: "470.82.01"

To pull the driver image for Ubuntu 20.04:

$ docker pull nvcr.io/nvidia/driver:470.82.01-ubuntu20.04

To pull the driver image for CentOS 8:

$ docker pull nvcr.io/nvidia/driver:470.82.01-centos8

To push the images to the local registry, simply tag the pulled images by prefixing the image with the image registry information.

Using the above examples, this will result in:

$ docker tag nvcr.io/nvidia/gpu-operator:v1.9.0 <local-registry>/<local-path>/gpu-operator:v1.9.0
$ docker tag nvcr.io/nvidia/driver:470.82.01-ubuntu20.04 <local-registry>/<local-path>/driver:470.82.01-ubuntu20.04

Finally, push the images to the local registry:

$ docker push  <local-registry>/<local-path>/gpu-operator:v1.9.0
$ docker push <local-registry>/<local-path>/driver:470.82.01-ubuntu20.04

Update values.yaml with local registry information in the repository field.

Note

replace <repo.example.com:port> below with your local image registry url and port

Sample of values.yaml for GPU Operator v1.9.0:

operator:
  repository: <repo.example.com:port>
  image: gpu-operator
  version: 1.9.0
  imagePullSecrets: []
  initContainer:
    image: cuda
    repository: <repo.example.com:port>
    version: 11.4.2-base-ubi8

 validator:
   image: gpu-operator-validator
   repository: <repo.example.com:port>
   version: 1.9.0
   imagePullSecrets: []

 driver:
   repository: <repo.example.com:port>
   image: driver
   version: "470.82.01"
   imagePullSecrets: []
   manager:
     image: k8s-driver-manager
     repository: <repo.example.com:port>
     version: v0.2.0

 toolkit:
   repository: <repo.example.com:port>
   image: container-toolkit
   version: 1.7.2-ubuntu18.04
   imagePullSecrets: []

 devicePlugin:
   repository: <repo.example.com:port>
   image: k8s-device-plugin
   version: v0.10.0-ubi8
   imagePullSecrets: []

 dcgmExporter:
   repository: <repo.example.com:port>
   image: dcgm-exporter
   version: 2.3.1-2.6.0-ubuntu20.04
   imagePullSecrets: []

 gfd:
   repository: <repo.example.com:port>
   image: gpu-feature-discovery
   version: v0.4.1
   imagePullSecrets: []

 nodeStatusExporter:
   enabled: false
   repository: <repo.example.com:port>
   image: gpu-operator-validator
   version: "1.9.0"

 migManager:
   enabled: true
   repository: <repo.example.com:port>
   image: k8s-mig-manager
   version: v0.2.0-ubuntu20.04

 node-feature-discovery:
   image:
     repository: <repo.example.com:port>
     pullPolicy: IfNotPresent
     # tag, if defined will use the given image tag, else Chart.AppVersion will be used
     # tag:
   imagePullSecrets: []

Local Package Repository

The driver container deployed as part of the GPU operator requires certain packages to be available as part of the driver installation. In restricted internet access or air-gapped installations, users are required to create a local mirror repository for their OS distribution and make the following packages available:

Note

KERNEL_VERSION is the underlying running kernel version on the GPU node GCC_VERSION is the gcc version matching the one used for building underlying kernel

ubuntu:
   linux-headers-${KERNEL_VERSION}
   linux-image-${KERNEL_VERSION}
   linux-modules-${KERNEL_VERSION}

centos:
   elfutils-libelf.x86_64
   elfutils-libelf-devel.x86_64
   kernel-headers-${KERNEL_VERSION}
   kernel-devel-${KERNEL_VERSION}
   kernel-core-${KERNEL_VERSION}
   gcc-${GCC_VERSION}

rhel/rhcos:
   kernel-headers-${KERNEL_VERSION}
   kernel-devel-${KERNEL_VERSION}
   kernel-core-${KERNEL_VERSION}
   gcc-${GCC_VERSION}

For example, for Ubuntu these packages can be found at archive.ubuntu.com so this would be the mirror that needs to be replicated locally for your cluster. Using apt-mirror, these packages will be automatically mirrored to your local package repository server.

For CentOS, reposync can be used to create the local mirror.

Once all above required packages are mirrored to the local repository, repo lists need to be created following distribution specific documentation. A ConfigMap containing the repo list file needs to be created in the namespace where the GPU Operator gets deployed.

An example of repo list is shown below for Ubuntu 20.04 (access to local package repository via HTTP):

custom-repo.list:

deb [arch=amd64] http://<local pkg repository>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal main universe
deb [arch=amd64] http://<local pkg repository>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal-updates main universe
deb [arch=amd64] http://<local pkg repository>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal-security main universe

An example of repo list is shown below for CentOS 8 (access to local package repository via HTTP):

custom-repo.repo:

[baseos]
name=CentOS Linux $releasever - BaseOS
baseurl=http://<local pkg repository>/repos/centos/$releasever/$basearch/os/baseos/
gpgcheck=0
enabled=1

[appstream]
name=CentOS Linux $releasever - AppStream
baseurl=http://<local pkg repository>/repos/centos/$releasever/$basearch/os/appstream/
gpgcheck=0
enabled=1

[extras]
name=CentOS Linux $releasever - Extras
baseurl=http://<local pkg repository>/repos/centos/$releasever/$basearch/os/extras/
gpgcheck=0
enabled=1

Create the ConfigMap:

$ kubectl create configmap repo-config -n gpu-operator --from-file=<path-to-repo-list-file>

Once the ConfigMap is created using the above command, update values.yaml with this information, to let the GPU Operator mount the repo configuration within the driver container to pull required packages. Based on the OS distribution the GPU Operator will automatically mount this ConfigMap into the appropriate directory.

driver:
   repoConfig:
      configMapName: repo-config

If self-signed certificates are used for an HTTPS based internal repository then a ConfigMap needs to be created for those certs and provide that during the GPU Operator install. Based on the OS distribution the GPU Operator will automatically mount this ConfigMap into the appropriate directory.

$ kubectl create configmap cert-config -n gpu-operator --from-file=<path-to-pem-file1> --from-file=<path-to-pem-file2>
driver:
   certConfig:
      name: cert-config

Deploy GPU Operator

Download and deploy GPU Operator Helm Chart with the updated values.yaml.

Fetch the chart from NGC repository. v1.9.0 is used in the command below:

$ helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.9.0.tgz

Install the GPU Operator with updated values.yaml:

$ helm install --wait gpu-operator \
     -n gpu-operator --create-namespace \
     gpu-operator-v1.9.0.tgz \
     -f values.yaml

Check the status of the pods to ensure all the containers are running:

$ kubectl get pods -n gpu-operator

Considerations when Installing with Outdated Kernels in Cluster

The driver container deployed as part of the GPU Operator requires certain packages to be available as part of the driver installation. On GPU nodes where the running kernel is not the latest, the driver container may fail to find the right version of these packages (e.g. kernel-headers, kernel-devel) that correspond to the running kernel version. In the driver container logs, you will most likely see the following error message: Could not resolve Linux kernel version.

In general, upgrading your system to the latest kernel should fix this issue. But if this is not an option, the following is a workaround to successfully deploy the GPU operator when GPU nodes in your cluster may not be running the latest kernel.

Add Archived Package Repositories

The workaround is to find the package archive containing packages for your outdated kernel and to add this repository to the package manager running inside the driver container. To achieve this, we can simply mount a repository list file into the driver container using a ConfigMap. The ConfigMap containing the repository list file needs to be created in the gpu-operator namespace.

Let us demonstrate this workaround via an example. The system used in this example is running CentOS 7 with an outdated kernel:

$ uname -r
3.10.0-1062.12.1.el7.x86_64

The official archive for older CentOS packages is https://vault.centos.org/. Typically, most archived CentOS repositories are found in /etc/yum.repos.d/CentOS-Vault.repo but they are disabled by default. If the appropriate archive repository was enabled, then the driver container would resolve the kernel version and be able to install the correct versions of the prerequisite packages.

We can simply drop in a replacement of /etc/yum.repos.d/CentOS-Vault.repo to ensure the appropriate CentOS archive is enabled. For the kernel running in this example, the CentOS-7.7.1908 archive contains the kernel-headers version we are looking for. Here is our example drop-in replacement file:

[C7.7.1908-base]
name=CentOS-7.7.1908 - Base
baseurl=http://vault.centos.org/7.7.1908/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1

[C7.7.1908-updates]
name=CentOS-7.7.1908 - Updates
baseurl=http://vault.centos.org/7.7.1908/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
enabled=1

Once the repo list file is created, we can create a ConfigMap for it:

$ kubectl create configmap repo-config -n gpu-operator --from-file=<path-to-repo-list-file>

Once the ConfigMap is created using the above command, update values.yaml with this information, to let the GPU Operator mount the repo configuration within the driver container to pull required packages.

For Ubuntu:

driver:
   repoConfig:
      configMapName: repo-config
      destinationDir: /etc/apt/sources.list.d

For RHEL/Centos/RHCOS:

driver:
   repoConfig:
      configMapName: repo-config
      destinationDir: /etc/yum.repos.d

Deploy GPU Operator with updated values.yaml:

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     -f values.yaml

Check the status of the pods to ensure all the containers are running:

$ kubectl get pods -n gpu-operator

Customizing NVIDIA GPU Driver Parameters during Installation

The NVIDIA Driver kernel modules accept a number of parameters which can be used to customize the behavior of the driver. Most of the parameters are documented in the NVIDIA Driver README. By default, the GPU Operator loads the kernel modules with default values. Starting with v1.10, the GPU Operator provides the ability to pass custom parameters to the kernel modules that get loaded as part of the NVIDIA Driver installation (e.g. nvidia, nvidia-modeset, nvidia-uvm, and nvidia-peermem).

To pass custom parameters, execute the following steps.

Create a configuration file named <module>.conf, where <module> is the name of the kernel module the parameters are for. The file should contain parameters as key-value pairs – one parameter per line. In the below example, we are passing one parameter to the nvidia module, which is disabling the use of GSP firmware.

$ cat nvidia.conf
NVreg_EnableGpuFirmware=0

Create a ConfigMap for the configuration file. If multiple modules are being configured, pass multiple files when creating the ConfigMap.

$ kubectl create configmap kernel-module-params -n gpu-operator --from-file=nvidia.conf=./nvidia.conf

Install the GPU Operator and set driver.kernelModuleConfig.name to the name of the ConfigMap containing the kernel module parameters.

$ helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.kernelModuleConfig.name="kernel-module-params"

Installing Precompiled and Canonical Signed Drivers on Ubuntu 20.04 and 22.04

GPU Operator supports deploying NVIDIA precompiled and signed drivers from Canonical on Ubuntu 20.04 and 22.04 (x86 platform only). This is required when nodes are enabled with Secure Boot. In order to use these, GPU Operator needs to be installed with options --set driver.version=<DRIVER_BRANCH>-signed.

$ helm install --wait gpu-operator \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.version=<DRIVER_BRANCH>-signed

supported DRIVER_BRANCH value currently are 470, 510 and 515 which will install latest drivers available on that branch for current running kernel version.

Following are the packages used in this case by the driver container.

  • linux-objects-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION} - Linux kernel nvidia modules.

  • linux-signatures-nvidia-${KERNEL_VERSION} - Linux kernel signatures for nvidia modules.

  • linux-modules-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION} - Meta package for nvidia driver modules, signatures and kernel interfaces.

  • nvidia-utils-${DRIVER_BRANCH}-server - NVIDIA driver support binaries.

  • nvidia-compute-utils-${DRIVER_BRANCH}-server - NVIDIA compute utilities (includes nvidia-persistenced).

Note

  • Before upgrading kernel on the worker nodes please ensure that above packages are available for that kernel version, else upgrade will cause driver installation failures.

To check if above packages are available for a specific kernel version, use the following commands (in this example, we use the 515 branch):

$ KERNEL_VERSION=$(uname -r)
$ DRIVER_BRANCH=515
$ sudo apt-get update
$ sudo apt-cache show linux-modules-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION}

A successful output is shown below:

$ sudo apt-cache show linux-modules-nvidia-${DRIVER_BRANCH}-server-${KERNEL_VERSION}

Package: linux-modules-nvidia-515-server-5.15.0-56-generic
Architecture: amd64
Version: 5.15.0-56.62+1
Priority: optional
Section: restricted/kernel
Source: linux-restricted-modules
Origin: Ubuntu
Maintainer: Canonical Kernel Team <kernel-team@lists.ubuntu.com>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 34
Depends: debconf (>= 0.5) | debconf-2.0, linux-image-5.15.0-56-generic | linux-image-unsigned-5.15.0-56-generic, linux-signatures-nvidia-5.15.0-56-generic  (= 5.15.0-56.62+1), linux-objects-nvidia-515-server-5.15.0-56-generic (= 5.15.0-56.62+1), nvidia-kernel-common-515-server (<= 515.86.01-1), nvidia-kernel-common-515-server (>= 515.86.01)
Filename: pool/restricted/l/linux-restricted-modules/linux-modules-nvidia-515-server-5.15.0-56-generic_5.15.0-56.62+1_amd64.deb
Size: 7040
MD5sum: 530d817653545eaac63ec64d6edc115c
SHA1: e2fd492c06a9be7d0a603d6861d7e35a267d2943
SHA256: 999477c5bd0b213196ed93fd6340a0183dc3d8202be2aa5a008d50f3ba184a3a
SHA512: 45b2bd3f377449742c92b5f7c89c09540edd3e5162cc8d87e3a2c5c45939595c9972982a1ff4e520a56b1525b9f208a2b791619f028f6b420faec0892c430632
Description-en: Linux kernel nvidia modules for version 5.15.0-56
This package pulls together the Linux kernel nvidia modules for
version 5.15.0-56 with the appropriate signatures.
.
You likely do not want to install this package directly. Instead, install the
one of the linux-modules-nvidia-515-server-generic* meta-packages,
which will ensure that upgrades work correctly, and that supporting packages are
also installed.
Description-md5: 13391e5f98aed1dde0d387f22d097bed