Getting Started
This document provides instructions, including pre-requisites for getting started with the NVIDIA GPU Operator.
Red Hat OpenShift 4
For installing the GPU Operator on clusters with Red Hat OpenShift using RHCOS worker nodes, follow the user guide.
VMware vSphere with Tanzu
For installing the GPU Operator on VMware vSphere with Tanzu leveraging NVIDIA AI Enterprise, follow the NVIDIA AI Enterprise document.
Google Cloud Anthos
For getting started with NVIDIA GPUs for Google Cloud Anthos, follow the getting started document.
Prerequisites
Before installing the GPU Operator, you should ensure that the Kubernetes cluster meets some prerequisites.
Nodes must be configured with a container engine such as Docker CE/EE,
cri-o
, orcontainerd
. For docker, follow the official install instructions.Node Feature Discovery (NFD) is a dependency for the Operator on each node. By default, NFD master and worker are automatically deployed by the Operator. If NFD is already running in the cluster prior to the deployment of the operator, then the Operator can be configured to not to install NFD.
For monitoring in Kubernetes 1.13 and 1.14, enable the kubelet
KubeletPodResources
feature gate. From Kubernetes 1.15 onwards, its enabled by default.
Note
To enable the KubeletPodResources
feature gate, run the following command: echo -e "KUBELET_EXTRA_ARGS=--feature-gates=KubeletPodResources=true" | sudo tee /etc/default/kubelet
Before installing the GPU Operator on NVIDIA vGPU, ensure the following.
The NVIDIA vGPU Host Driver version 12.0 (or later) is pre-installed on all hypervisors hosting NVIDIA vGPU accelerated Kubernetes worker node virtual machines. Please refer to NVIDIA vGPU Documentation for details.
A NVIDIA vGPU License Server is installed and reachable from all Kubernetes worker node virtual machines.
A private registry is available to upload the NVIDIA vGPU specific driver container image.
Each Kubernetes worker node in the cluster has access to the private registry. Private registry access is usually managed through imagePullSecrets. See the Kubernetes Documentation for more information. The user is required to provide these secrets to the NVIDIA GPU-Operator in the driver section of the values.yaml file.
Git and Docker/Podman are required to build the vGPU driver image from source repository and push to local registry.
Note
Uploading the NVIDIA vGPU driver to a publicly available repository or otherwise publicly sharing the driver is a violation of the NVIDIA vGPU EULA.
The rest of this document includes instructions for installing the GPU Operator on supported Linux distributions.
Install NVIDIA GPU Operator
Install Helm
The preferred method to deploy the GPU Operator is using helm
.
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
Now, add the NVIDIA Helm repository:
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
Install the GPU Operator
The GPU Operator Helm chart offers a number of customizable options that can be configured depending on your environment.

Chart Customization Options
The following options are available when using the Helm chart. These options can be used with --set
when installing via Helm.
Parameter |
Description |
Default |
---|---|---|
|
When set to Pods can specify |
|
|
When set to |
|
|
Map of custom annotations to add to all GPU Operator managed pods. |
|
|
Map of custom labels to add to all GPU Operator managed pods. |
|
|
By default, the Operator deploys NVIDIA drivers as a container on the system.
Set this value to |
|
|
The images are downloaded from NGC. Specify another image repository when using custom driver images. |
|
|
Controls whether the driver daemonset should build and load the |
|
|
Indicate if MOFED is directly pre-installed on the host. This is used to build and load |
|
|
By default, the driver container has an initial delay of |
|
|
When set to |
|
|
Version of the NVIDIA datacenter driver supported by the Operator. If you set |
Depends on the version of the Operator. See the Component Matrix for more information on supported drivers. |
|
Controls the strategy to be used with MIG on supported NVIDIA GPUs. Options
are either |
|
|
The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. By default, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100). |
|
|
Deploys Node Feature Discovery plugin as a daemonset.
Set this variable to |
|
|
DEPRECATED as of v1.9 |
|
|
The GPU operator deploys |
|
|
By default, the Operator deploys the NVIDIA Container Toolkit ( |
|
|
DEPRECATED as of v1.9 |
|
|
Map of custom labels that will be added to all GPU Operator managed pods. |
|
|
The GPU operator deploys |
|
|
By default, the Operator deploys the NVIDIA Container Toolkit ( |
|
Namespace
Prior to GPU Operator v1.9, the operator was installed in the default
namespace while all operands were
installed in the gpu-operator-resources
namespace.
Starting with GPU Operator v1.9, both the operator and operands get installed in the same namespace.
The namespace is configurable and is determined during installation. For example, to install the GPU Operator
in the gpu-operator
namespace:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
If a namespace is not specified during installation, all GPU Operator components will be installed in the
default
namespace.
Operands
By default, the GPU Operator operands are deployed on all GPU worker nodes in the cluster.
GPU worker nodes are identified by the presence of the label feature.node.kubernetes.io/pci-10de.present=true
,
where 0x10de
is the PCI vendor ID assigned to NVIDIA.
To disable operands from getting deployed on a GPU worker node, label the node with nvidia.com/gpu.deploy.operands=false
.
$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=false
Common Deployment Scenarios
In this section, we present some common deployment recipes when using the Helm chart to install the GPU Operator.
Bare-metal/Passthrough with default configurations on Ubuntu
In this scenario, the default configuration options are used:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
Note
For installing on Secure Boot systems or using Precompiled modules refer to Precompiled Driver Containers.
Bare-metal/Passthrough with default configurations on Red Hat Enterprise Linux
In this scenario, the default configuration options are used:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
Note
When using RHEL8 with Kubernetes, SELinux has to be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network restricted environments are not supported.
Bare-metal/Passthrough with default configurations on CentOS
In this scenario, the CentOS toolkit image is used:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set toolkit.version=1.7.1-centos7
Note
For CentOS 8 systems, use toolkit.version=1.7.1-centos8.
Replace 1.7.1 toolkit version used here with the latest one available here.
NVIDIA vGPU
Note
The GPU Operator with NVIDIA vGPUs requires additional steps to build a private driver image prior to install. Refer to the document NVIDIA vGPU for detailed instructions on the workflow and required values of the variables used in this command.
The command below will install the GPU Operator with its default configuration for vGPU:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.repository=$PRIVATE_REGISTRY \
--set driver.version=$VERSION \
--set driver.imagePullSecrets={$REGISTRY_SECRET_NAME} \
--set driver.licensingConfig.configMapName=licensing-config
NVIDIA AI Enterprise
Refer to GPU Operator with NVIDIA AI Enterprise.
Bare-metal/Passthrough with pre-installed NVIDIA drivers
In this example, the user has already pre-installed NVIDIA drivers as part of the system image:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false
Bare-metal/Passthrough with pre-installed drivers and NVIDIA Container Toolkit
In this example, the user has already pre-installed the NVIDIA drivers and NVIDIA Container Toolkit (nvidia-docker2
)
as part of the system image.
Note
These steps should be followed when using the GPU Operator v1.9+ on DGX A100 systems with DGX OS 5.1+.
Before installing the operator, ensure that the following configurations are modified depending on the container runtime configured in your cluster.
Docker:
Update the Docker configuration to add
nvidia
as the default runtime. Thenvidia
runtime should be setup as the default container runtime for Docker on GPU nodes. This can be done by adding thedefault-runtime
line into the Docker daemon config file, which is usually located on the system at/etc/docker/daemon.json
:{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }Restart the Docker daemon to complete the installation after setting the default runtime:
$ sudo systemctl restart docker
Containerd:
Update
containerd
to usenvidia
as the default runtime and addnvidia
runtime configuration. This can be done by adding below config to/etc/containerd/config.toml
and restartingcontainerd
service.version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] privileged_without_host_devices = false runtime_engine = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime"Restart the Containerd daemon to complete the installation after setting the default runtime:
$ sudo systemctl restart containerd
Install the GPU operator with the following options:
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false \
--set toolkit.enabled=false
Bare-metal/Passthrough with pre-installed NVIDIA Container Toolkit (but no drivers)
In this example, the user has already pre-installed the NVIDIA Container Toolkit (nvidia-docker2
) as part of the system image.
Before installing the operator, ensure that the following configurations are modified depending on the container runtime configured in your cluster.
Docker:
Update the Docker configuration to add
nvidia
as the default runtime. Thenvidia
runtime should be setup as the default container runtime for Docker on GPU nodes. This can be done by adding thedefault-runtime
line into the Docker daemon config file, which is usually located on the system at/etc/docker/daemon.json
:{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }Restart the Docker daemon to complete the installation after setting the default runtime:
$ sudo systemctl restart docker
Containerd:
Update
containerd
to usenvidia
as the default runtime and addnvidia
runtime configuration. This can be done by adding below config to/etc/containerd/config.toml
and restartingcontainerd
service.version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] privileged_without_host_devices = false runtime_engine = "" runtime_root = "" runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime"Restart the Containerd daemon to complete the installation after setting the default runtime:
$ sudo systemctl restart containerd
Configure toolkit to use the root
directory of the driver installation as /run/nvidia/driver
, which is the path mounted by driver container.
$ sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
Once these steps are complete, now install the GPU operator with the following options (which will provision a driver):
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set toolkit.enabled=false
Custom driver image (based off a specific driver version)
If you want to use custom driver container images (for e.g. using 465.27), then you would need to build a new driver container image. Follow these steps:
Rebuild the driver container by specifying the
$DRIVER_VERSION
argument when building the Docker image. For reference, the driver container Dockerfiles are available on the Git repo hereBuild the container using the appropriate Dockerfile. For example:
$ docker build --pull -t \ --build-arg DRIVER_VERSION=455.28 \ nvidia/driver:455.28-ubuntu20.04 \ --file Dockerfile .
Ensure that the driver container is tagged as shown in the example by using the
driver:<version>-<os>
schema.Specify the new driver image and repository by overriding the defaults in the Helm install command. For example:
$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.repository=docker.io/nvidia \ --set driver.version="465.27"
Note that these instructions are provided for reference and evaluation purposes. Not using the standard releases of the GPU Operator from NVIDIA would mean limited support for such custom configurations.
Custom configuration for runtime containerd
When containerd is the container runtime used, the following configuration options are used with the container-toolkit deployed with GPU Operator:
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: true
These options are defined as follows:
- CONTAINERD_CONFIGThe path on the host to the
containerd
configyou would like to have updated with support for the
nvidia-container-runtime
. By default this will point to/etc/containerd/config.toml
(the default location forcontainerd
). It should be customized if yourcontainerd
installation is not in the default location.
- CONTAINERD_SOCKETThe path on the host to the socket file used to
communicate with
containerd
. The operator will use this to send aSIGHUP
signal to thecontainerd
daemon to reload its config. By default this will point to/run/containerd/containerd.sock
(the default location forcontainerd
). It should be customized if yourcontainerd
installation is not in the default location.
- CONTAINERD_RUNTIME_CLASSThe name of the
Runtime Class you would like to associate with the
nvidia-container-runtime
. Pods launched with aruntimeClassName
equal to CONTAINERD_RUNTIME_CLASS will always run with thenvidia-container-runtime
. The default CONTAINERD_RUNTIME_CLASS isnvidia
.
- CONTAINERD_SET_AS_DEFAULTA flag indicating whether you want to set
nvidia-container-runtime
as the default runtime used to launch all containers. When set to false, only containers in pods with aruntimeClassName
equal to CONTAINERD_RUNTIME_CLASS will be run with thenvidia-container-runtime
. The default value istrue
.
For Rancher Kubernetes Engine 2 (RKE2), set the following in the ClusterPolicy.
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
These options can be passed to GPU Operator during install time as below.
helm install -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true
Proxy Environments
Refer to the section Install GPU Operator in Proxy Environments for more information on how to install the Operator on clusters behind a HTTP proxy.
Air-gapped Environments
Refer to the section Install NVIDIA GPU Operator in Air-Gapped Environments for more information on how to install the Operator in air-gapped environments.
Multi-Instance GPU (MIG)
Refer to the document GPU Operator with MIG for more information on how use the Operator with Multi-Instance GPU (MIG) on NVIDIA Ampere products. For guidance on configuring MIG support for the NVIDIA GPU Operator in an OpenShift Container Platform cluster, see the user guide.
KubeVirt / OpenShift Virtualization
Refer to the document GPU Operator with KubeVirt for more information on how to use the GPU Operator to provision GPU nodes for running KubeVirt virtual machines with access to GPU. For guidance on using the GPU Operator with OpenShift Virtualization, refer to the document NVIDIA GPU Operator with OpenShift Virtualization.
Outdated Kernels
Refer to the section Considerations when Installing with Outdated Kernels in Cluster for more information on how to install the Operator successfully when nodes in the cluster are not running the latest kernel
Verify GPU Operator Install
Once the Helm chart is installed, check the status of the pods to ensure all the containers are running and the validation is complete:
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-crrsq 1/1 Running 0 60s
gpu-operator-7fb75556c7-x8spj 1/1 Running 0 5m13s
gpu-operator-node-feature-discovery-master-58d884d5cc-w7q7b 1/1 Running 0 5m13s
gpu-operator-node-feature-discovery-worker-6rht2 1/1 Running 0 5m13s
gpu-operator-node-feature-discovery-worker-9r8js 1/1 Running 0 5m13s
nvidia-container-toolkit-daemonset-lhgqf 1/1 Running 0 4m53s
nvidia-cuda-validator-rhvbb 0/1 Completed 0 54s
nvidia-dcgm-5jqzg 1/1 Running 0 60s
nvidia-dcgm-exporter-h964h 1/1 Running 0 60s
nvidia-device-plugin-daemonset-d9ntc 1/1 Running 0 60s
nvidia-device-plugin-validator-cm2fd 0/1 Completed 0 48s
nvidia-driver-daemonset-5xj6g 1/1 Running 0 4m53s
nvidia-mig-manager-89z9b 1/1 Running 0 4m53s
nvidia-operator-validator-bwx99 1/1 Running 0 58s
We can now proceed to running some sample GPU workloads to verify that the Operator (and its components) are working correctly.
Running Sample GPU Applications
CUDA VectorAdd
In the first example, let’s run a simple CUDA sample, which adds two vectors together:
Create a file, such as
cuda-vectoradd.yaml
, with contents like the following:apiVersion: v1 kind: Pod metadata: name: cuda-vectoradd spec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" resources: limits: nvidia.com/gpu: 1
Run the pod:
$ kubectl apply -f cuda-vectoradd.yaml
The pod starts, runs the
vectorAdd
command, and then exits.View the logs from the container:
$ kubectl logs pod/cuda-vectoradd
Example Output
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
Removed the stopped pod:
$ kubectl delete -f cuda-vectoradd.yaml
Example Output
pod "cuda-vectoradd" deleted
Jupyter Notebook
In the next example, let’s try running a TensorFlow Jupyter notebook.
First, deploy the pods:
$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml
Check to determine if the pod has successfully started:
$ kubectl get pod tf-notebook
NAMESPACE NAME READY STATUS RESTARTS AGE
default tf-notebook 1/1 Running 0 3m45s
Since the example also includes a service, let’s obtain the external port at which the notebook is accessible:
$ kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default tf-notebook NodePort 10.106.229.20 <none> 80:30001/TCP 4m41s
..
And the token for the Jupyter notebook:
$ kubectl logs tf-notebook
[I 21:50:23.188 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 21:50:23.390 NotebookApp] Serving notebooks from local directory: /tf
[I 21:50:23.391 NotebookApp] The Jupyter Notebook is running at:
[I 21:50:23.391 NotebookApp] http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
[I 21:50:23.391 NotebookApp] or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
[I 21:50:23.391 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 21:50:23.394 NotebookApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
Or copy and paste one of these URLs:
http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
The notebook should now be accessible from your browser at this URL: http:://<your-machine-ip>:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
Demo
Check out the demo below where we scale GPU nodes in a K8s cluster using the GPU Operator:

GPU Telemetry
To gather GPU telemetry in Kubernetes, the GPU Operator deploys the dcgm-exporter
. dcgm-exporter
, based
on DCGM exposes GPU metrics for Prometheus and can be visualized using Grafana. dcgm-exporter
is architected to take advantage of
KubeletPodResources
API and exposes GPU metrics in a format that can be
scraped by Prometheus.
Custom Metrics Config
With GPU Operator users can customize the metrics to be collected by dcgm-exporter
. Below are the steps for this
Fetch the metrics file and save as dcgm-metrics.csv
$ curl https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/main/etc/dcp-metrics-included.csv > dcgm-metrics.csv
Edit the metrics file as required to add/remove any metrics to be collected.
Create a Namespace
gpu-operator
if one is not already present.$ kubectl create namespace gpu-operator
Create a ConfigMap using the file edited above.
$ kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv
Install GPU Operator with additional options
--set dcgmExporter.config.name=metrics-config
and--set dcgmExporter.env[0].name=DCGM_EXPORTER_COLLECTORS --set dcgmExporter.env[0].value=/etc/dcgm-exporter/dcgm-metrics.csv
Collecting Metrics on NVIDIA DGX A100 with DGX OS
NVIDIA DGX systems running with DGX OS bundles drivers, DCGM, etc. in the system image and have nv-hostengine running already. To avoid any compatibility issues, it is recommended to have dcgm-exporter connect to the existing nv-hostengine daemon to gather/publish GPU telemetry data.
Warning
The dcgm-exporter container image includes a DCGM client library (libdcgm.so
) to communicate with
nv-hostengine. In this deployment scenario we have dcgm-exporter (or rather libdcgm.so
) connect
to an existing nv-hostengine running on the host. The DCGM client library uses an internal protocol to exchange
information with nv-hostengine. To avoid any potential incompatibilities between the container image’s DCGM client library
and the host’s nv-hostengine, it is strongly recommended to use a version of DCGM on which dcgm-exporter is based is
greater than or equal to (but not less than) the version of DCGM running on the host. This can be easily determined by
comparing the version tags of the dcgm-exporter image and by running nv-hostengine --version
on the host.
In this scenario, we need to set DCGM_REMOTE_HOSTENGINE_INFO
to localhost:5555
for dcgm-exporter to connect to nv-hostengine running on the host.
$ kubectl patch clusterpolicy/cluster-policy --type='json' -p='[{"op": "add", "path": "/spec/dcgmExporter/env/-", "value":{"name":"DCGM_REMOTE_HOSTENGINE_INFO", "value":"localhost:5555"}}]'
Verify dcgm-exporter pod is running after this change
$ kubectl get pods -l app=nvidia-dcgm-exporter --all-namespaces
The rest of this section walks through how to setup Prometheus, Grafana using Operators and using Prometheus with dcgm-exporter
.
Setting up Prometheus
Implementing a Prometheus stack can be complicated but can be managed by taking advantage of the Helm
package manager and
the Prometheus Operator and kube-prometheus projects.
The Operator uses standard configurations and dashboards for Prometheus and Grafana and the Helm prometheus-operator
chart allows you to get a full cluster monitoring solution up and running by installing Prometheus Operator and the rest of the components listed above.
First, add the helm
repo:
$ helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
Now, search for the available prometheus
charts:
$ helm search repo kube-prometheus
Once you’ve located which the version of the chart to use, inspect the chart so we can modify the settings:
$ helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values
Next, we’ll need to edit the values file to change the port at which the Prometheus server service is available. In the prometheus
instance
section of the chart, change the service type from ClusterIP
to NodePort
. This will allow the Prometheus server to be accessible at your
machine ip address at port 30090 as http://<machine-ip>:30090/
From:
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30090
## Loadbalancer IP
## Only use if service.type is "loadbalancer"
loadBalancerIP: ""
loadBalancerSourceRanges: []
## Service type
##
type: ClusterIP
To:
## Port to expose on each node
## Only used if service.type is 'NodePort'
##
nodePort: 30090
## Loadbalancer IP
## Only use if service.type is "loadbalancer"
loadBalancerIP: ""
loadBalancerSourceRanges: []
## Service type
##
type: NodePort
Also, modify the prometheusSpec.serviceMonitorSelectorNilUsesHelmValues
settings to false
below:
## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
## prometheus resource to be created with selectors based on values in the helm deployment,
## which will also match the servicemonitors created
##
serviceMonitorSelectorNilUsesHelmValues: false
Add the following configMap
to the section on additionalScrapeConfigs
in the Helm chart.
## AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
## are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
## as specified in the official Prometheus documentation:
## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
## appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
## to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
## scrape configs are going to break Prometheus after the upgrade.
##
## The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
## port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
##
additionalScrapeConfigs:
- job_name: gpu-metrics
scrape_interval: 1s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- gpu-operator
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: kubernetes_node
Finally, we can deploy the Prometheus and Grafana pods using the kube-prometheus-stack
via Helm:
$ helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \
--generate-name \
--values /tmp/kube-prometheus-stack.values
Note
You can also override values in the Prometheus chart directly on the Helm command line:
$ helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \
--generate-name \
--set prometheus.service.type=NodePort \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
You should see a console output as below:
NAME: kube-prometheus-stack-1637791640
LAST DEPLOYED: Wed Nov 24 22:07:22 2021
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1637791640"
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
Now you can see the Prometheus and Grafana pods:
$ kubectl get pods -n prometheus
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-1637-alertmanager-0 2/2 Running 0 23s
kube-prometheus-stack-1637-operator-7bd6d6455c-pcv6n 1/1 Running 0 25s
kube-prometheus-stack-1637791640-grafana-f99f499df-kwm4f 2/2 Running 0 25s
kube-prometheus-stack-1637791640-kube-state-metrics-65bf4526xnl 1/1 Running 0 25s
kube-prometheus-stack-1637791640-prometheus-node-exporter-8pwc4 1/1 Running 0 25s
kube-prometheus-stack-1637791640-prometheus-node-exporter-nvzhq 1/1 Running 0 25s
prometheus-kube-prometheus-stack-1637-prometheus-0 2/2 Running 0 23s
You can view the services setup as part of the operator and dcgm-exporter
:
$ kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 34d
gpu-operator gpu-operator ClusterIP 10.106.165.20 <none> 8080/TCP 29m
gpu-operator gpu-operator-node-feature-discovery-master ClusterIP 10.102.207.205 <none> 8080/TCP 30m
gpu-operator nvidia-dcgm-exporter ClusterIP 10.108.99.82 <none> 9400/TCP 29m
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 34d
kube-system kube-prometheus-stack-1637-coredns ClusterIP None <none> 9153/TCP 56s
kube-system kube-prometheus-stack-1637-kube-controller-manager ClusterIP None <none> 10252/TCP 56s
kube-system kube-prometheus-stack-1637-kube-etcd ClusterIP None <none> 2379/TCP 56s
kube-system kube-prometheus-stack-1637-kube-proxy ClusterIP None <none> 10249/TCP 56s
kube-system kube-prometheus-stack-1637-kube-scheduler ClusterIP None <none> 10251/TCP 56s
kube-system kube-prometheus-stack-1637-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 6m42s
prometheus alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 54s
prometheus kube-prometheus-stack-1637-alertmanager ClusterIP 10.99.137.105 <none> 9093/TCP 56s
prometheus kube-prometheus-stack-1637-operator ClusterIP 10.101.198.43 <none> 443/TCP 56s
prometheus kube-prometheus-stack-1637-prometheus NodePort 10.105.175.245 <none> 9090:30090/TCP 56s
prometheus kube-prometheus-stack-1637791640-grafana ClusterIP 10.111.115.192 <none> 80/TCP 56s
prometheus kube-prometheus-stack-1637791640-kube-state-metrics ClusterIP 10.105.66.181 <none> 8080/TCP 56s
prometheus kube-prometheus-stack-1637791640-prometheus-node-exporter ClusterIP 10.108.72.70 <none> 9100/TCP 56s
prometheus prometheus-operated ClusterIP None <none> 9090/TCP 54s
You can observe that the Prometheus server is available at port 30090 on the node’s IP address. Open your browser to http://<machine-ip-address>:30090
.
It may take a few minutes for DCGM to start publishing the metrics to Prometheus. The metrics availability can be verified by typing DCGM_FI_DEV_GPU_UTIL
in the event bar to determine if the GPU metrics are visible:

Using Grafana
You can also launch the Grafana tools for visualizing the GPU metrics.
There are two mechanisms for dealing with the ports on which Grafana is available - the service can be patched or port-forwarding can be used to reach the home page. Either option can be chosen based on preference.
Patching the Grafana Service
By default, Grafana uses a ClusterIP
to expose the ports on which the service is accessible. This can be changed to a NodePort
instead, so the page is accessible
from the browser, similar to the Prometheus dashboard.
You can use kubectl patch to update the service API
object to expose a NodePort
instead.
First, modify the spec to change the service type:
$ cat << EOF | tee grafana-patch.yaml
spec:
type: NodePort
nodePort: 32322
EOF
And now use kubectl patch
:
$ kubectl patch svc prometheus-operator-1637791640-grafana -n prometheus --patch "$(cat grafana-patch.yaml)"
service/prometheus-operator-1637791640-grafana patched
You can verify that the service is now exposed at an externally accessible port:
$ kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
<snip>
prometheus prometheus-operator-1637791640-grafana NodePort 10.111.115.192 <none> 80:32258/TCP 2m2s
Open your browser to http://<machine-ip-address>:32258
and view the Grafana login page. Access Grafana home using the admin
username.
The password credentials for the login are available in the prometheus.values
file we edited in the earlier section of the doc:
## Deploy default dashboards.
##
defaultDashboardsEnabled: true
adminPassword: prom-operator

Upgrading the GPU Operator
Using Helm
The GPU Operator supports dynamic updates to existing resources. This ability enables the GPU Operator to ensure settings from the cluster policy specification are always applied and current.
Because Helm does not support automatic upgrade of existing CRDs, you can upgrade the GPU Operator chart manually or by enabling a Helm hook.
Option 1 - manually upgrade CRD

With this workflow, all existing GPU operator resources are updated inline and the cluster policy resource is patched with updates from values.yaml
.
Specify the Operator release tag in an environment variable:
$ export RELEASE_TAG=v23.3.1
Apply the custom resource definition for the cluster policy:
$ kubectl apply -f \ https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml
Example Output
customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured
Apply the custom resource definition for Node Feature Discovery:
$ kubectl apply -f \ https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml
Example Output
customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io configured
Update the information about the Operator chart:
$ helm repo update nvidia
Example Output
Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "nvidia" chart repository Update Complete. ⎈Happy Helming!⎈
Fetch the values from the chart:
$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml
Update the values file as needed.
Upgrade the Operator:
$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml
Example Output
Release "gpu-operator" has been upgraded. Happy Helming! NAME: gpu-operator LAST DEPLOYED: Thu Apr 20 15:05:52 2023 NAMESPACE: gpu-operator STATUS: deployed REVISION: 2 TEST SUITE: None
Option 2 - auto upgrade CRD using Helm hook
Starting with GPU Operator v22.09, a pre-upgrade
Helm hook is utilized to automatically upgrade to latest CRD.
A new parameter operator.upgradeCRD
is added to to trigger this hook during GPU Operator upgrade using Helm. This is disabled by default.
This parameter needs to be set using --set operator.upgradeCRD=true
option during upgrade command as below.
Specify the Operator release tag in an environment variable:
$ export RELEASE_TAG=v23.3.1
Update the information about the Operator chart:
$ helm repo update nvidia
Example Output
Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "nvidia" chart repository Update Complete. ⎈Happy Helming!⎈
Fetch the values from the chart:
$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml
Update the values file as needed.
Upgrade the Operator:
$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \ --set operator.upgradeCRD=true --disable-openapi-validation -f values-$RELEASE_TAG.yaml
Note
Option
--disable-openapi-validation
is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible.Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. In this case, chart needs to be deleted using
--no-hooks
option to avoid deletion to be hung on hook failures.
Cluster Policy Updates
The GPU Operator also supports dynamic updates to the ClusterPolicy
CustomResource using kubectl
:
$ kubectl edit clusterpolicy
After the edits are complete, Kubernetes will automatically apply the updates to cluster.
Additional Controls for Driver Upgrades
While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. Refer to GPU Driver Upgrades for more information.
Using OLM in OpenShift
For upgrading the GPU Operator when running in OpenShift, refer to the official documentation on upgrading installed operators: https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html
Uninstall
To uninstall the operator:
$ helm delete -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')
You should now see all the pods being deleted:
$ kubectl get pods -n gpu-operator
No resources found.
By default, Helm does not support deletion of existing CRDs when the Chart is deleted.
Thus clusterpolicy
CRD will still remain by default.
$ kubectl get crds -A | grep -i clusterpolicies.nvidia.com
To overcome this, a post-delete
hook is used in the GPU Operator to perform the CRD cleanup. A new parameter operator.cleanupCRD
is added to enable this hook. This is disabled by default. This parameter needs to be enabled with --set operator.cleanupCRD=true
during install or upgrade for automatic CRD cleanup to happen on chart deletion.
Note
After un-install of GPU Operator, the NVIDIA driver modules might still be loaded. Either reboot the node or unload them using the following command:
$ sudo rmmod nvidia_modeset nvidia_uvm nvidia
Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. In this case, chart needs to be deleted using
--no-hooks
option to avoid deletion to be hung on hook failures.