Upgrading the NVIDIA GPU Operator

Prerequisites

  • If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods, label the namespace for the Operator to set the enforcement policy to privileged:

    $ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
    

Using Helm

The GPU Operator supports dynamic updates to existing resources. This ability enables the GPU Operator to ensure settings from the cluster policy specification are always applied and current.

Because Helm does not support automatic upgrade of existing CRDs, you can upgrade the GPU Operator chart manually or by enabling a Helm hook.

Option 1: Manually Upgrading CRDs

flowchart LR A["Update CRD from the latest chart"] --> B["Upgrade by using Helm"]

With this procedure, all existing GPU operator resources are updated inline and the cluster policy resource is patched with updates from values.yaml.

  1. Specify the Operator release tag in an environment variable:

    $ export RELEASE_TAG=v23.9.0
    
  2. Apply the custom resource definitions for the cluster policy and NVIDIA driver:

    $ kubectl apply -f \
        https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml
    
    $ kubectl apply -f \
        https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml
    

    Example Output

    customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured
    customresourcedefinition.apiextensions.k8s.io/nvidiadrivers.nvidia.com created
    
  3. Apply the custom resource definition for Node Feature Discovery:

    $ kubectl apply -f \
        https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/$RELEASE_TAG/deployments/gpu-operator/charts/node-feature-discovery/crds/nfd-api-crds.yaml
    

    Example Output

    customresourcedefinition.apiextensions.k8s.io/nodefeaturerules.nfd.k8s-sigs.io configured
    
  4. Update the information about the Operator chart:

    $ helm repo update nvidia
    

    Example Output

    Hang tight while we grab the latest from your chart repositories...
    ...Successfully got an update from the "nvidia" chart repository
    Update Complete. ⎈Happy Helming!⎈
    
  5. Fetch the values from the chart:

    $ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml
    
  6. Update the values file as needed.

  7. Upgrade the Operator:

    $ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator -f values-$RELEASE_TAG.yaml
    

    Example Output

    Release "gpu-operator" has been upgraded. Happy Helming!
    NAME: gpu-operator
    LAST DEPLOYED: Thu Apr 20 15:05:52 2023
    NAMESPACE: gpu-operator
    STATUS: deployed
    REVISION: 2
    TEST SUITE: None
    

Option 2: Automatically Upgrading CRDs Using a Helm Hook

Starting with GPU Operator v22.09, a pre-upgrade Helm hook can automatically upgrade to latest CRD.

Starting with GPU Operator v24.9.0, the upgrade CRD Helm hook is enabled by default and runs an upgrade CRD job when you upgrade using Helm.

  1. Specify the Operator release tag in an environment variable:

    $ export RELEASE_TAG=v23.9.0
    
  2. Update the information about the Operator chart:

    $ helm repo update nvidia
    

    Example Output

    Hang tight while we grab the latest from your chart repositories...
    ...Successfully got an update from the "nvidia" chart repository
    Update Complete. ⎈Happy Helming!⎈
    
  3. Fetch the values from the chart:

    $ helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml
    
  4. Update the values file as needed.

  5. Upgrade the Operator:

    $ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \
        --disable-openapi-validation -f values-$RELEASE_TAG.yaml
    

    Note

    • Option --disable-openapi-validation is required in this case so that Helm will not try to validate if CR instance from the new chart is valid as per old CRD. Since CR instance in the Chart is valid for the upgraded CRD, this will be compatible.

    • Helm hooks used with the GPU Operator use the operator image itself. If operator image itself cannot be pulled successfully (either due to network error or an invalid NGC registry secret in case of NVAIE), hooks will fail. In this case, chart needs to be deleted using --no-hooks option to avoid deletion to be hung on hook failures.

Cluster Policy Updates

The GPU Operator also supports dynamic updates to the ClusterPolicy CustomResource using kubectl:

$ kubectl edit clusterpolicy

After the edits are complete, Kubernetes will automatically apply the updates to cluster.

Additional Controls for Driver Upgrades

While most of the GPU Operator managed daemonsets can be upgraded seamlessly, the NVIDIA driver daemonset has special considerations. Refer to GPU Driver Upgrades for more information.

Using OLM in OpenShift

For upgrading the GPU Operator when running in OpenShift, refer to the official documentation on upgrading installed operators: https://docs.openshift.com/container-platform/4.8/operators/admin/olm-upgrading-operators.html