Upgrading the NVIDIA GPU Operator#

Prerequisites#

If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged:
```
$ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
```

Using Helm#

The GPU Operator supports dynamic updates to existing resources. This ability enables the GPU Operator to ensure settings from the cluster policy specification are always applied and current.

Because Helm does not support automatic upgrade of existing CRDs, you can upgrade the GPU Operator chart manually or by enabling a Helm hook.

Option 1: Manually Upgrading CRDs#

flowchart LR A["Update CRD from the latest chart"] --> B["Upgrade by using Helm"]

With this procedure, all existing GPU Operator resources are updated inline and the cluster policy resource is patched with updates from values.yaml.

Specify the Operator release tag in an environment variable:
```
$ export RELEASE_TAG=v25.3.3
```

Apply the custom resource definitions for the cluster policy and NVIDIA driver:

$ kubectl apply -f \
    https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_clusterpolicies.yaml

$ kubectl apply -f \
    https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml

Example Output

customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured
customresourcedefinition.apiextensions.k8s.io/nvidiadrivers.nvidia.com created

Apply the custom resource definition for Node Feature Discovery:

$ kubectl apply -f \
    https://raw.githubusercontent.com/NVIDIA/gpu-operator/refs/tags/$RELEASE_TAG/deployments/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml

Example Output

customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io configured

Update the information about the Operator chart:

$ helm repo update nvidia

Example Output

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈

Fetch the values from the chart:

$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml

Update the values file as needed.

Upgrade the Operator:

$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml --version $RELEASE_TAG

Example Output

Release "gpu-operator" has been upgraded. Happy Helming!
NAME: gpu-operator
LAST DEPLOYED: Thu Apr 20 15:05:52 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 2
TEST SUITE: None

Option 2: Automatically Upgrading CRDs Using a Helm Hook#

Starting with GPU Operator v22.09, a pre-upgrade Helm hook can automatically upgrade to latest CRD.

Starting with GPU Operator v24.9.0, the upgrade CRD Helm hook is enabled by default and runs an upgrade CRD job when you upgrade using Helm.

Specify the Operator release tag in an environment variable:
```
$ export RELEASE_TAG=v25.3.3
```

Update the information about the Operator chart:

$ helm repo update nvidia

Example Output

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈

Fetch the values from the chart:

$ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml

Update the values file as needed.
Upgrade the Operator:
```
$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \
    --disable-openapi-validation -f values-$RELEASE_TAG.yaml --version $RELEASE_TAG
```
Note
- Option --disable-openapi-validation is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible.
- Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. In this case, chart needs to be deleted using --no-hooks option to avoid deletion to be hung on hook failures.

Cluster Policy Updates#

The GPU Operator also supports dynamic updates to the ClusterPolicy CustomResource using kubectl:

$ kubectl edit clusterpolicy

After the edits are complete, Kubernetes will automatically apply the updates to cluster.

Additional Controls for Driver Upgrades#

While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. Refer to GPU Driver Upgrades for more information.

Using OLM in OpenShift#

For upgrading the GPU Operator when running in OpenShift, refer to the official documentation on upgrading installed operators: https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html