Getting Started

This document provides instructions, including pre-requisites for getting started with the NVIDIA GPU Operator.

Prerequisites

Before installing the GPU Operator, you should ensure that the Kubernetes cluster meets some prerequisites.

  1. Nodes must not be pre-configured with NVIDIA components (driver, container runtime, device plugin).

  2. Nodes must be configured with Docker CE/EE, cri-o, or containerd. For docker, follow the official install instructions.

  3. If the HWE kernel (e.g. kernel 5.x) is used with Ubuntu 18.04 LTS, then the nouveau driver for NVIDIA GPUs must be blacklisted before starting the GPU Operator. Follow the steps in the CUDA installation guide to disable the nouveau driver and update initramfs.

  4. Node Feature Discovery (NFD) is required on each node. By default, NFD master and worker are automatically deployed. If NFD is already running in the cluster prior to the deployment of the operator, set the Helm chart variable nfd.enabled to false during the Helm install step.

  5. For monitoring in Kubernetes 1.13 and 1.14, enable the kubelet KubeletPodResources feature gate. From Kubernetes 1.15 onwards, its enabled by default.

Note

To enable the KubeletPodResources feature gate, run the following command: echo -e "KUBELET_EXTRA_ARGS=--feature-gates=KubeletPodResources=true" | sudo tee /etc/default/kubelet

Before installing the GPU Operator on NVIDIA vGPU, ensure the following.

  1. The NVIDIA vGPU Host Driver version 12.0 (or later) is pre-installed on all hypervisors hosting NVIDIA vGPU accelerated Kubernetes worker node virtual machines. Please refer to NVIDIA vGPU Documentation for details.

  2. A NVIDIA vGPU License Server is installed and reachable from all Kubernetes worker node virtual machines.

  3. A private registry is available to upload the NVIDIA vGPU specific driver container image.

  4. Each Kubernetes worker node in the cluster has access to the private registry. Private registry access is usually managed through imagePullSecrets. See the Kubernetes Documentation for more information. The user is required to provide these secrets to the NVIDIA GPU-Operator in the driver section of the values.yaml file.

  5. Git and Docker/Podman are required to build the vGPU driver image from source repository and push to local registry.

Note

Uploading the NVIDIA vGPU driver to a publicly available repository or otherwise publicly sharing the driver is a violation of the NVIDIA vGPU EULA.


Red Hat OpenShift 4

For installing the GPU Operator on clusters with Red Hat OpenShift 4.5 and 4.6 using RHCOS worker nodes, follow the user guide.


Google Cloud Anthos

For getting started with NVIDIA GPUs for Google Cloud Anthos, follow the getting started document.


The rest of this document includes instructions for installing the GPU Operator supported Linux distributions.

Install Kubernetes

Refer to Install Kubernetes for getting started with setting up a Kubernetes cluster.

Install NVIDIA GPU Operator

Install Helm

The preferred method to deploy the GPU Operator is using helm.

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Now, add the NVIDIA Helm repository:

$ helm repo add nvidia https://nvidia.github.io/gpu-operator \
   && helm repo update

Install the GPU Operator

Now setup the operator using the Helm chart:

Note

If NFD is already running in the cluster prior to the deployment of the operator, use the --set nfd.enabled=false Helm chart variable

The command below will install the GPU Operator with its default configuration:

$ helm install --wait --generate-name \
   nvidia/gpu-operator

The command below will install the GPU Operator with its default configuration:

$ helm install --wait --generate-name \
   nvidia/gpu-operator --set driver.repository=$PRIVATE_REGISTRY \
   --set driver.version=$VERSION \
   --set driver.imagePullSecrets={$REGISTRY_SECRET_NAME}

Note

The GPU Operator with NVIDIA vGPUs requires additional steps to build a private driver image prior to install. Please refer to section Considerations to Install GPU Operator with NVIDIA vGPU Driver for detailed instructions and required values of PRIVATE_REGISTRY, VERSION and REGISTRY_SECRET_NAME.

Note

This command assumes default container runtime as docker. For containerd please pass --set operator.defaultRuntime=containerd option. For additional containerd config please refer to below note.

Note

By default, the operator assumes your Kubernetes deployment is running with docker as its container runtime. If your Kubernetes deployment is instead using cri-o or containerd, you can update the defaultRuntime assumed by the operator when you deploy it.

$ helm install --wait --generate-name \
   nvidia/gpu-operator \
   --set operator.defaultRuntime=crio
$ helm install --wait --generate-name \
   nvidia/gpu-operator \
   --set operator.defaultRuntime=containerd

Furthermore, when setting containerd as the defaultRuntime the following options are also available:

toolkit:
  env:
  - name: CONTAINERD_CONFIG
    value: /etc/containerd/config.toml
  - name: CONTAINERD_SOCKET
    value: /run/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: true

CONTAINERD_CONFIG : The path on the host to the containerd config you would like to have updated with support for the nvidia-container-runtime. By default this will point to /etc/containerd/config.toml (the default location for containerd). It should be customized if your containerd installation is not in the default location.

CONTAINERD_SOCKET : The path on the host to the socket file used to communicate with containerd. The operator will use this to send a SIGHUP signal to the containerd daemon to reload its config. By default this will point to /run/containerd/containerd.sock (the default location for containerd). It should be customized if your containerd installation is not in the default location.

CONTAINERD_RUNTIME_CLASS : The name of the Runtime Class you would like to associate with the nvidia-container-runtime. Pods launched with a runtimeClassName equal to CONTAINERD_RUNTIME_CLASS will always run with the nvidia-container-runtime. The default CONTAINERD_RUNTIME_CLASS is nvidia.

CONTAINERD_SET_AS_DEFAULT : A flag indicating whether you want to set nvidia-container-runtime as the default runtime used to launch all containers. When set to false, only containers in pods with a runtimeClassName equal to CONTAINERD_RUNTIME_CLASS will be run with the nvidia-container-runtime. The default value is true.

Note

If you want to use custom driver container images (for e.g. using 455.28), then you would need to build a new driver container image. Follow these steps:

  • Modify the Dockerfile (for e.g. by specifying the driver version in the Ubuntu 20.04 container here)

  • Build the container (e.g. docker build --pull -t nvidia/driver:455.28-ubuntu20.04 --file Dockerfile .). Ensure that the driver container is tagged as shown in the example by using the driver:<version>-<os> schema.

  • Specify the new driver image and repository by overriding the defaults in the Helm install command. For example:

    $ helm install --wait --generate-name \
       nvidia/gpu-operator \
       --set driver.repository=docker.io/nvidia \
       --set driver.version="455.28"
    

Note that these instructions are provided for reference and evaluation purposes. Not using the standard releases of the GPU Operator from NVIDIA would mean limited support for such custom configurations.

Check the status of the pods to ensure all the containers are running:

$ kubectl get pods -A
NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
default                  gpu-operator-1597953523-node-feature-discovery-master-5bcfgvtzn   1/1     Running     0          2m18s
default                  gpu-operator-1597953523-node-feature-discovery-worker-fx9xc       1/1     Running     0          2m18s
default                  gpu-operator-774ff7994c-nwpvz                                     1/1     Running     0          2m18s
gpu-operator-resources   nvidia-container-toolkit-daemonset-tt9zh                          1/1     Running     0          2m7s
gpu-operator-resources   nvidia-dcgm-exporter-zpprv                                        1/1     Running     0          2m7s
gpu-operator-resources   nvidia-device-plugin-daemonset-5ztkl                              1/1     Running     3          2m7s
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Completed   0          2m7s
gpu-operator-resources   nvidia-driver-daemonset-qtn6p                                     1/1     Running     0          2m7s
gpu-operator-resources   nvidia-driver-validation                                          0/1     Completed   0          2m7s
kube-system              calico-kube-controllers-578894d4cd-pv5kw                          1/1     Running     0          5m36s
kube-system              calico-node-ffhdd                                                 1/1     Running     0          5m36s
kube-system              coredns-66bff467f8-nwdrx                                          1/1     Running     0          9m4s
kube-system              coredns-66bff467f8-srg8d                                          1/1     Running     0          9m4s
kube-system              etcd-ip-172-31-80-124                                             1/1     Running     0          9m19s
kube-system              kube-apiserver-ip-172-31-80-124                                   1/1     Running     0          9m19s
kube-system              kube-controller-manager-ip-172-31-80-124                          1/1     Running     0          9m19s
kube-system              kube-proxy-kj5qb                                                  1/1     Running     0          9m4s
kube-system              kube-scheduler-ip-172-31-80-124                                   1/1     Running     0          9m18s

Considerations to Install in Air-Gapped Clusters

Local Image Registry

With Air-Gapped installs, the GPU Operator requires all images to be hosted in a local image registry accessible to each node in the cluster. To allow GPU Operator to work with local registry, users can specify local repository, image, tag along with pull secrets in values.yaml.

Get the values.yaml

$ curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/values.yaml

Update values.yaml with repository, image details as applicable

Note

replace <repo.example.com:port> below with your local image registry url and port

Note

some pods use initContainers with image(11.0-base-ubi8) as nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59 make sure to push this to local repository as well.

operator:
  repository: <repo.example.com:port>
  image: gpu-operator
  version: 1.4.0
  imagePullSecrets: []
  validator:
    image: cuda-sample
    repository: <my-repository:port>
    version: vectoradd-cuda10.2
    imagePullSecrets: []

driver:
  repository: <repo.example.com:port>
  image: driver
  version: "450.80.02"
  imagePullSecrets: []

toolkit:
  repository: <repo.example.com:port>
  image: container-toolkit
  version: 1.4.0-ubuntu18.04
  imagePullSecrets: []

devicePlugin:
  repository: <repo.example.com:port>
  image: k8s-device-plugin
  version: v0.7.1
  imagePullSecrets: []

dcgmExporter:
  repository: <repo.example.com:port>
  image: dcgm-exporter
  version: 2.0.13-2.1.2-ubuntu20.04
  imagePullSecrets: []

gfd:
  repository: <repo.example.com:port>
  image: gpu-feature-discovery
  version: v0.2.2
  imagePullSecrets: []

node-feature-discovery:
  imagePullSecrets: []
  image:
    repository: <repo.example.com:port>
    tag: "v0.6.0"

Local Package Repository

The Driver container deployed as part of GPU operator require certain packages to be available as part of driver installation. In Air-Gapped installations, users are required to create a mirror repository for their OS distribution and make following packages available:

Note

KERNEL_VERSION is the underlying running kernel version on the GPU node GCC_VERSION is the gcc version matching the one used for building underlying kernel

ubuntu:
   linux-headers-${KERNEL_VERSION}
   linux-image-${KERNEL_VERSION}
   linux-modules-${KERNEL_VERSION}

centos:
   elfutils-libelf.x86_64
   elfutils-libelf-devel.x86_64
   kernel-headers-${KERNEL_VERSION}
   kernel-devel-${KERNEL_VERSION}
   kernel-core-${KERNEL_VERSION}
   gcc-${GCC_VERSION}

rhel/rhcos:
   kernel-headers-${KERNEL_VERSION}
   kernel-devel-${KERNEL_VERSION}
   kernel-core-${KERNEL_VERSION}
   gcc-${GCC_VERSION}

Once, all above required packages are mirrored to local repository, repo lists needs to be created following distribution specific documentation. A ConfigMap needs to be created with the repo list file created under gpu-operator-resources namespace.

$ kubectl create configmap repo-config -n gpu-operator-resources --from-file=<path-to-repo-list-file>

Once the ConfigMap is created using above command, update values.yaml with this information, to let GPU Operator mount the repo configiguration within Driver container to pull required packages.

Ubuntu
driver:
   repoConfig:
      configMapName: repo-config
      destinationDir: /etc/apt/sources.list.d
CentOS/RHEL/RHCOS
driver:
   repoConfig:
      configMapName: repo-config
      destinationDir: /etc/yum.repos.d

If mirror repository is configured behind a proxy, specify driver.env in values.yaml with HTTP_PROXY, HTTPS_PROXY and NO_PROXY environment variables.

driver:
   env:
   - name: HTTPS_PROXY
     value: <example.proxy.com:port>
   - name: HTTP_PROXY
     value: <example.proxy.com:port>
   - name: NO_PROXY
     value: .example.com

Deploy GPU Operator with updated values.yaml

$ helm install --wait --generate-name \
   nvidia/gpu-operator -f values.yaml

Check the status of the pods to ensure all the containers are running:

$ kubectl get pods -n gpu-operator-resources

Considerations to Install GPU Operator with NVIDIA vGPU Driver

High Level Workflow

The following section outlines the high level workflow to use the GPU Operator with NVIDIA vGPUs.

  1. Download the vGPU Software and latest NVIDIA vGPU Driver Catalog file.

  2. Clone driver container source repository for building private driver image.

  3. Create vGPU license configuration file.

  4. Build the driver container image.

  5. Push the driver container image to your private repository.

  6. Install the GPU Operator.

Detailed Workflow

Download the vGPU Software and latest NVIDIA vGPU catalog driver file from the NVIDIA Licensing Portal.

  1. Login to the NVIDIA Licensing Portal and navigate to the “Software Downloads” section.

  2. The NVIDIA vGPU Software is located in the Software Downloads section of the NVIDIA Licensing Portal.

  3. The NVIDIA vGPU catalog driver file is located in the “Additional Software” section.

  4. The vGPU Software bundle is packaged as a zip file. Extract the zip file

Clone the driver container repository and build driver image

  • Open a terminal and clone the driver container image repository

$ git clone https://gitlab.com/nvidia/container-images/driver
$ cd driver
  • Change to the OS directory under the driver directory

$ cd ubuntu20.04
  • Copy the NVIDIA vGPU guest driver from your extracted zip file and the NVIDIA vGPU driver catalog file

$ cp <local-driver-download-directory>/*-grid.run drivers
$ cp vgpuDriverCatalog.yaml drivers
  • Create NVIDIA vGPU license configuration file

Create a NVIDIA vGPU license file named gridd.conf in the drivers/ folder with the below content.

# Description: Set License Server Address
# Data type: string
# Format:  "<address>"
ServerAddress=<license server address>

Input the license server address of the License Server

  • Build the driver container image

Set the private registry name using below command on the terminal

$ export PRIVATE_REGISTRY=<private registry name>

Set the OS_TAG. The OS_TAG has to match the Guest OS version. Supported values are ubuntu20.04, rhcos4.6

$ export OS_TAG=ubuntu20.04

Set the driver container image version to a user defined version number. For example 1.0.0

$ export VERSION=1.0.0
$ export VGPU_DRIVER_VERSION=460.16-grid (replace this with the guest vgpu grid driver version downloaded from NVIDIA software portal)

Note

VERSION can be any user defined value. Please note this value to use during operator installation command

Build the driver container image

$ sudo docker build \
  --build-arg DRIVER_TYPE=vgpu \
  --build-arg DRIVER_VERSION=$VGPU_DRIVER_VERSION \
  -t ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG} .
  • Push the driver container image to your private repository

$ sudo docker login ${PRIVATE_REGISTRY} --username=<username> {enter password on prompt}
$ sudo docker push ${PRIVATE_REGISTRY}/driver:${VERSION}-${OS_TAG}
  • Install the GPU Operator.

Creating an image pull secrets

$ kubectl  create namespace gpu-operator-resources
$ export REGISTRY_SECRET_NAME=registry-secret
$ kubectl create secret docker-registry ${REGISTRY_SECRET_NAME} \
  --docker-server=${PRIVATE_REGISTRY} --docker-username=<username> \
  --docker-password=<password> \
  --docker-email=<email-id> -n gpu-operator-resources

Note

Please note the secret name REGISTRY_SECRET_NAME for using during operator installation command.

  • Install GPU Operator helm chart

Please refer to Install NVIDIA GPU Operator section for GPU operator installation command and options for vGPU.

Demo

Check out the demo below where we scale GPU nodes in a K8s cluster using the GPU Operator:

../_images/gpu-operator-demo.gif

Running Sample GPU Applications

CUDA FP16 Matrix multiply

In the first example, let’s try running a quick CUDA load generator, which does an FP16 matrix-multiply on the GPU:

$ cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
   name: dcgmproftester
spec:
   restartPolicy: OnFailure
   containers:
   - name: dcgmproftester11
   image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04
   args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
   resources:
      limits:
         nvidia.com/gpu: 1
   securityContext:
      capabilities:
         add: ["SYS_ADMIN"]

EOF

and then view the logs of the dcgmproftester pod:

$ kubectl logs -f dcgmproftester

You should see the FP16 GEMM being run on the GPU:

Skipping CreateDcgmGroups() since DCGM validation is disabled
CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 1024
CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 40
CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 65536
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 7
CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 5
CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 256
CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 5001000
Max Memory bandwidth: 320064000000 bytes (320.06 GiB)
CudaInit completed successfully.

Skipping WatchFields() since DCGM validation is disabled
TensorEngineActive: generated ???, dcgm 0.000 (26096.4 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (26344.4 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (26351.2 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (26359.9 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (26750.7 gflops)
TensorEngineActive: generated ???, dcgm 0.000 (25378.8 gflops)

Jupyter Notebook

In the next example, let’s try running a TensorFlow Jupyter notebook.

First, deploy the pods:

$ kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml

Check to determine if the pod has successfully started:

$ kubectl get pod tf-notebook
NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
default                  tf-notebook                                                       1/1     Running     0          3m45s

Since the example also includes a service, let’s obtain the external port at which the notebook is accessible:

$ kubectl get svc -A
NAMESPACE                NAME                                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
default                  tf-notebook                                             NodePort    10.106.229.20   <none>        80:30001/TCP             4m41s
..

And the token for the Jupyter notebook:

$ kubectl logs tf-notebook
[I 21:50:23.188 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 21:50:23.390 NotebookApp] Serving notebooks from local directory: /tf
[I 21:50:23.391 NotebookApp] The Jupyter Notebook is running at:
[I 21:50:23.391 NotebookApp] http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
[I 21:50:23.391 NotebookApp]  or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
[I 21:50:23.391 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 21:50:23.394 NotebookApp]

   To access the notebook, open this file in a browser:
      file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
   Or copy and paste one of these URLs:
      http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9
   or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9

The notebook should now be accessible from your browser at this URL: http:://<your-machine-ip>:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9

GPU Telemetry

To gather GPU telemetry in Kubernetes, the GPU Operator deploys the dcgm-exporter. dcgm-exporter, based on DCGM exposes GPU metrics for Prometheus and can be visualized using Grafana. dcgm-exporter is architected to take advantage of KubeletPodResources API and exposes GPU metrics in a format that can be scraped by Prometheus.

The rest of this section walks through how to setup Prometheus, Grafana using Operators and using Prometheus with dcgm-exporter.

Setting up Prometheus

Implementing a Prometheus stack can be complicated but can be managed by taking advantage of the Helm package manager and the Prometheus Operator and kube-prometheus projects. The Operator uses standard configurations and dashboards for Prometheus and Grafana and the Helm prometheus-operator chart allows you to get a full cluster monitoring solution up and running by installing Prometheus Operator and the rest of the components listed above.

First, add the helm repo:

$ helm repo add prometheus-community \
   https://prometheus-community.github.io/helm-charts

Now, search for the available prometheus charts:

$ helm search repo kube-prometheus

Once you’ve located which the version of the chart to use, inspect the chart so we can modify the settings:

$ helm inspect values prometheus-community/kube-prometheus-stack > /tmp/kube-prometheus-stack.values

Next, we’ll need to edit the values file to change the port at which the Prometheus server service is available. In the prometheus instance section of the chart, change the service type from ClusterIP to NodePort. This will allow the Prometheus server to be accessible at your machine ip address at port 30090 as http://<machine-ip>:30090/

From:
 ## Port to expose on each node
 ## Only used if service.type is 'NodePort'
 ##
 nodePort: 30090

 ## Loadbalancer IP
 ## Only use if service.type is "loadbalancer"
 loadBalancerIP: ""
 loadBalancerSourceRanges: []
 ## Service type
 ##
 type: ClusterIP

To:
 ## Port to expose on each node
 ## Only used if service.type is 'NodePort'
 ##
 nodePort: 30090

 ## Loadbalancer IP
 ## Only use if service.type is "loadbalancer"
 loadBalancerIP: ""
 loadBalancerSourceRanges: []
 ## Service type
 ##
 type: NodePort

Also, modify the prometheusSpec.serviceMonitorSelectorNilUsesHelmValues settings to false below:

## If true, a nil or {} value for prometheus.prometheusSpec.serviceMonitorSelector will cause the
## prometheus resource to be created with selectors based on values in the helm deployment,
## which will also match the servicemonitors created
##
serviceMonitorSelectorNilUsesHelmValues: false

Add the following configMap to the section on additionalScrapeConfigs in the Helm chart:

## AdditionalScrapeConfigs allows specifying additional Prometheus scrape configurations. Scrape configurations
## are appended to the configurations generated by the Prometheus Operator. Job configurations must have the form
## as specified in the official Prometheus documentation:
## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config. As scrape configs are
## appended, the user is responsible to make sure it is valid. Note that using this feature may expose the possibility
## to break upgrades of Prometheus. It is advised to review Prometheus release notes to ensure that no incompatible
## scrape configs are going to break Prometheus after the upgrade.
##
## The scrape configuration example below will find master nodes, provided they have the name .*mst.*, relabel the
## port to 2379 and allow etcd scraping provided it is running on all Kubernetes master nodes
##
additionalScrapeConfigs:
- job_name: gpu-metrics
  scrape_interval: 1s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - gpu-operator-resources
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_node_name]
    action: replace
    target_label: kubernetes_node

Finally, we can deploy the Prometheus and Grafana pods using the kube-prometheus-stack via Helm:

$ helm install prometheus-community/kube-prometheus-stack \
   --create-namespace --namespace prometheus \
   --generate-name \
   --values /tmp/kube-prometheus-stack.values

Note

You can also override values in the Prometheus chart directly on the Helm command line:

$ helm install prometheus-community/kube-prometheus-stack \
   --create-namespace --namespace prometheus \
   --generate-name \
   --set prometheus.service.type=NodePort \
   --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

You should see a console output as below:

NAME: kube-prometheus-stack-1603211794
LAST DEPLOYED: Tue Oct 20 16:36:39 2020
NAMESPACE: prometheus
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace prometheus get pods -l "release=kube-prometheus-stack-1603211794"

Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

Now you can see the Prometheus and Grafana pods:

$ kubectl get pods -A
NAMESPACE                NAME                                                              READY   STATUS      RESTARTS   AGE
default                  gpu-operator-1597965115-node-feature-discovery-master-fbf9rczx5   1/1     Running     1          6h57m
default                  gpu-operator-1597965115-node-feature-discovery-worker-n58pm       1/1     Running     1          6h57m
default                  gpu-operator-774ff7994c-xh62d                                     1/1     Running     1          6h57m
default                  gpu-operator-test                                                 0/1     Completed   0          8h
gpu-operator-resources   nvidia-container-toolkit-daemonset-grnnd                          1/1     Running     1          6h57m
gpu-operator-resources   nvidia-dcgm-exporter-nv5z7                                        1/1     Running     7          6h57m
gpu-operator-resources   nvidia-device-plugin-daemonset-qq6lq                              1/1     Running     7          6h57m
gpu-operator-resources   nvidia-device-plugin-validation                                   0/1     Completed   0          6h57m
gpu-operator-resources   nvidia-driver-daemonset-vwzvq                                     1/1     Running     1          6h57m
gpu-operator-resources   nvidia-driver-validation                                          0/1     Completed   3          6h57m
kube-system              calico-kube-controllers-578894d4cd-pv5kw                          1/1     Running     1          10h
kube-system              calico-node-ffhdd                                                 1/1     Running     1          10h
kube-system              coredns-66bff467f8-nwdrx                                          1/1     Running     1          10h
kube-system              coredns-66bff467f8-srg8d                                          1/1     Running     1          10h
kube-system              etcd-ip-172-31-80-124                                             1/1     Running     1          10h
kube-system              kube-apiserver-ip-172-31-80-124                                   1/1     Running     1          10h
kube-system              kube-controller-manager-ip-172-31-80-124                          1/1     Running     1          10h
kube-system              kube-proxy-kj5qb                                                  1/1     Running     1          10h
kube-system              kube-scheduler-ip-172-31-80-124                                   1/1     Running     1          10h
prometheus               alertmanager-prometheus-operator-159799-alertmanager-0            2/2     Running     0          12s
prometheus               prometheus-operator-159799-operator-78f95fccbd-hcl76              2/2     Running     0          16s
prometheus               prometheus-operator-1597990146-grafana-5c7db4f7d4-qcjbj           2/2     Running     0          16s
prometheus               prometheus-operator-1597990146-kube-state-metrics-645c57c8x28nv   1/1     Running     0          16s
prometheus               prometheus-operator-1597990146-prometheus-node-exporter-6lchc     1/1     Running     0          16s
prometheus               prometheus-prometheus-operator-159799-prometheus-0                2/3     Running     0          2s

You can view the services setup as part of the operator and dcgm-exporter:

$ kubectl get svc -A
NAMESPACE                NAME                                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                        AGE
default                  gpu-operator-1597965115-node-feature-discovery-master     ClusterIP   10.110.46.7      <none>        8080/TCP                       6h57m
default                  kubernetes                                                ClusterIP   10.96.0.1        <none>        443/TCP                        10h
default                  tf-notebook                                               NodePort    10.106.229.20    <none>        80:30001/TCP                   8h
gpu-operator-resources   nvidia-dcgm-exporter                                      ClusterIP   10.99.250.100    <none>        9400/TCP                       6h57m
kube-system              kube-dns                                                  ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP         10h
kube-system              prometheus-operator-159797-kubelet                        ClusterIP   None             <none>        10250/TCP,10255/TCP,4194/TCP   4h50m
kube-system              prometheus-operator-159799-coredns                        ClusterIP   None             <none>        9153/TCP                       32s
kube-system              prometheus-operator-159799-kube-controller-manager        ClusterIP   None             <none>        10252/TCP                      32s
kube-system              prometheus-operator-159799-kube-etcd                      ClusterIP   None             <none>        2379/TCP                       32s
kube-system              prometheus-operator-159799-kube-proxy                     ClusterIP   None             <none>        10249/TCP                      32s
kube-system              prometheus-operator-159799-kube-scheduler                 ClusterIP   None             <none>        10251/TCP                      32s
kube-system              prometheus-operator-159799-kubelet                        ClusterIP   None             <none>        10250/TCP,10255/TCP,4194/TCP   18s
prometheus               alertmanager-operated                                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP     28s
prometheus               prometheus-operated                                       ClusterIP   None             <none>        9090/TCP                       18s
prometheus               prometheus-operator-159799-alertmanager                   ClusterIP   10.106.93.161    <none>        9093/TCP                       32s
prometheus               prometheus-operator-159799-operator                       ClusterIP   10.100.116.170   <none>        8080/TCP,443/TCP               32s
prometheus               prometheus-operator-159799-prometheus                     NodePort    10.102.169.42    <none>        9090:30090/TCP                 32s
prometheus               prometheus-operator-1597990146-grafana                    ClusterIP   10.104.40.69     <none>        80/TCP                         32s
prometheus               prometheus-operator-1597990146-kube-state-metrics         ClusterIP   10.100.204.91    <none>        8080/TCP                       32s
prometheus               prometheus-operator-1597990146-prometheus-node-exporter   ClusterIP   10.97.64.60      <none>        9100/TCP                       32s

You can observe that the Prometheus server is available at port 30090 on the node’s IP address. Open your browser to http://<machine-ip-address>:30090. It may take a few minutes for DCGM to start publishing the metrics to Prometheus. The metrics availability can be verified by typing DCGM_FI_DEV_GPU_UTIL in the event bar to determine if the GPU metrics are visible:

../_images/001-dcgm-e2e-prom-screenshot.png

Using Grafana

You can also launch the Grafana tools for visualizing the GPU metrics.

There are two mechanisms for dealing with the ports on which Grafana is available - the service can be patched or port-forwarding can be used to reach the home page. Either option can be chosen based on preference.

Patching the Grafana Service

By default, Grafana uses a ClusterIP to expose the ports on which the service is accessible. This can be changed to a NodePort instead, so the page is accessible from the browser, similar to the Prometheus dashboard.

You can use kubectl patch to update the service API object to expose a NodePort instead.

First, modify the spec to change the service type:

$ cat << EOF | tee grafana-patch.yaml
spec:
  type: NodePort
  nodePort: 32322
EOF

And now use kubectl patch:

$ kubectl patch svc prometheus-operator-1597990146-grafana -n prometheus --patch "$(cat grafana-patch.yaml)"
service/prometheus-operator-1597990146-grafana patched

You can verify that the service is now exposed at an externally accessible port:

$ kubectl get svc -A
NAMESPACE     NAME                                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                        AGE
<snip>
prometheus    prometheus-operator-1597990146-grafana                    NodePort    10.108.187.141   <none>        80:32258/TCP                   17h

Open your browser to http://<machine-ip-address>:32258 and view the Grafana login page. Access Grafana home using the admin username. The password credentials for the login are available in the prometheus.values file we edited in the earlier section of the doc:

## Deploy default dashboards.
##
defaultDashboardsEnabled: true

adminPassword: prom-operator
../_images/002-dcgm-e2e-grafana-screenshot.png

Uninstalling GPU Operator

To uninstall the operator:

$ helm delete $(helm list | grep gpu-operator | awk '{print $1}')

You should now see all the pods being deleted:

$ kubectl get pods -n gpu-operator-resources
No resources found.