Upgrading the NVIDIA GPU Operator
Prerequisites
If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged:
$ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
Using Helm
The GPU Operator supports dynamic updates to existing resources. This ability enables the GPU Operator to ensure settings from the cluster policy specification are always applied and current.
Because Helm does not support automatic upgrade of existing CRDs, you can upgrade the GPU Operator chart manually or by enabling a Helm hook.
Option 1: Manually Upgrading CRDs
flowchart LR A["Update CRD from the latest chart"] --> B["Upgrade by using Helm"]
With this procedure, all existing GPU operator resources are updated inline and the cluster policy resource is patched with updates from values.yaml
.
Specify the Operator release tag in an environment variable:
$ export RELEASE_TAG=v23.9.0
Apply the custom resource definitions for the cluster policy and NVIDIA driver:
$ kubectl apply -f \ https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml $ kubectl apply -f \ https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml
Example Output
customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured customresourcedefinition.apiextensions.k8s.io/nvidiadrivers.nvidia.com created
Apply the custom resource definition for Node Feature Discovery:
$ kubectl apply -f \ https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml
Example Output
customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io configured
Update the information about the Operator chart:
$ helm repo update nvidia
Example Output
Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "nvidia" chart repository Update Complete. ⎈Happy Helming!⎈
Fetch the values from the chart:
$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml
Update the values file as needed.
Upgrade the Operator:
$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml
Example Output
Release "gpu-operator" has been upgraded. Happy Helming! NAME: gpu-operator LAST DEPLOYED: Thu Apr 20 15:05:52 2023 NAMESPACE: gpu-operator STATUS: deployed REVISION: 2 TEST SUITE: None
Option 2: Automatically Upgrading CRDs Using a Helm Hook
Starting with GPU Operator v22.09, a pre-upgrade
Helm hook is utilized to automatically upgrade to latest CRD.
A new parameter operator.upgradeCRD
is added to to trigger this hook during GPU Operator upgrade using Helm. This is disabled by default.
This parameter needs to be set using --set operator.upgradeCRD=true
option during upgrade command as below.
Specify the Operator release tag in an environment variable:
$ export RELEASE_TAG=v23.9.0
Update the information about the Operator chart:
$ helm repo update nvidia
Example Output
Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "nvidia" chart repository Update Complete. ⎈Happy Helming!⎈
Fetch the values from the chart:
$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml
Update the values file as needed.
Upgrade the Operator:
$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \ --set operator.upgradeCRD=true --disable-openapi-validation -f values-$RELEASE_TAG.yaml
Note
Option
--disable-openapi-validation
is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible.Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. In this case, chart needs to be deleted using
--no-hooks
option to avoid deletion to be hung on hook failures.
Cluster Policy Updates
The GPU Operator also supports dynamic updates to the ClusterPolicy
CustomResource using kubectl
:
$ kubectl edit clusterpolicy
After the edits are complete, Kubernetes will automatically apply the updates to cluster.
Additional Controls for Driver Upgrades
While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. Refer to GPU Driver Upgrades for more information.
Using OLM in OpenShift
For upgrading the GPU Operator when running in OpenShift, refer to the official documentation on upgrading installed operators: https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html