Release Notes#

This document describes the new features, improvements, fixed issues, and known issues for the NVIDIA GPU Operator.

Refer to the GPU Operator Component Matrix for a list of software components and versions included in each release.

Note

GPU Operator beta releases are documented on GitHub. NVIDIA AI Enterprise builds are not posted on GitHub.

26.3.3#

New Features#

Updated software component versions:
- NVIDIA Kubernetes Device Plugin v0.19.3
- NVIDIA GPU Feature Discovery for Kubernetes v0.19.3

Fixed Issues#

Fixed a regression where feature flags such as MOFED_ENABLED and GDS_ENABLED were enabled by default for the device plugin operand. This caused all ibverbs device nodes on a node to be injected into GPU workload containers, disrupting RDMA and NCCL workloads by exposing network interfaces that were not intended for the workload. The GPU Operator now infers these feature flags dynamically from the kernel modules that are loaded on each node, rather than enabling them unconditionally. (PR #2525, k8s-device-plugin PR #1837)

26.3.2#

New Features#

Updated software component versions:
- NVIDIA Driver Manager for Kubernetes v0.11.0
- NVIDIA Container Toolkit v1.19.1
- NVIDIA Kubernetes Device Plugin v0.19.2
- NVIDIA GPU Feature Discovery for Kubernetes v0.19.2
- NVIDIA MIG Manager for Kubernetes v0.14.2
- NVIDIA DCGM Exporter v4.5.3-4.8.2
Added support for Kubernetes 1.36.
Added support for including Kubernetes pod metadata in DCGM Exporter GPU metrics. Configure this using new fields under spec.dcgmExporter in the ClusterPolicy custom resource:
- enablePodLabels adds pod labels as Prometheus label dimensions on the GPU metrics.
- enablePodUID adds the pod UID as a Prometheus label dimension on the GPU metrics.
When either field is set to true, the GPU Operator provisions a cluster-scoped ClusterRole and ClusterRoleBinding (nvidia-dcgm-exporter-read-pods) that grants the DCGM Exporter service account get, list, and watch access to pods across the cluster, and sets the corresponding DCGM_EXPORTER_KUBERNETES_* environment variables on the exporter container. This removes the need to manually set those environment variables or hand-create the corresponding RBAC resources for the DCGM Exporter service account.

Use the podLabelAllowlistRegex field (a list of regular expressions) to limit which pod labels are emitted as Prometheus dimensions. It is recommended to configure this allowlist in clusters with many pod labels to reduce Prometheus cardinality. (PR #2406)

Note

DRA resourceSlices enrichment, which the upstream DCGM Exporter Helm chart exposes, is not supported through the GPU Operator in this release.
Added support for setting custom annotations on the DCGM Exporter DaemonSet through the new spec.dcgmExporter.annotations field in the ClusterPolicy custom resource (or dcgmExporter.annotations in the Helm chart). (PR #2292)
Added support for the Node Resource Interface (NRI) Plugin with CRI-O v1.34 or later. Refer to Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support for more information.

Fixed Issues#

Fixed an issue in the NVIDIA Driver Manager for Kubernetes where Kubernetes API server connectivity interruptions could leave nodes with stale paused-for-driver-upgrade label values during cluster bring-up or driver upgrades. The stale labels prevented operand DaemonSets from matching the node, leaving the ClusterPolicy in a not-ready state. The driver manager now retries label updates with a longer exponential backoff and treats Kubernetes API errors as fatal so failures surface immediately instead of being silently dropped. (PR #176)

26.3.1#

New Features#

Updated software component versions:
- NVIDIA GDRCopy Driver v2.5.2
- NVIDIA Kata Sandbox Device Plugin v0.0.3
- NVIDIA Confidential Computing Manager for Kubernetes v0.4.0
The ClusterPolicy and NVIDIADriver custom resources now support hostNetwork for all GPU Operator operands. Previously, only DCGM Exporter supported the hostNetwork field. Setting hostNetwork: true for a component causes its pods to share the host’s network namespace, binding directly to the host’s network interfaces and IP address rather than using the cluster’s virtual network. This is useful in environments where GPU Operator component pods need to expose ports directly on the host network, such as when a Prometheus instance scrapes metrics from the host network namespace, or in bare-metal and HPC environments where cluster network overhead or non-standard network configuration makes host networking preferable. (PR #2246)
Added support for mounting /lib/modules from the host when using precompiled drivers. This is required for precompiled driver containers on SUSE Linux Enterprise Server (SLES) 15 SP7 and SLES 16, which use host kernel modules without requiring the full kernel to be bundled in the driver container. (PR #2252)

Fixed Issues#

Fixed an issue in the OLM bundle where the NVIDIA KubeVirt GPU Device Plugin referenced an amd64-only image digest instead of a multi-arch digest. On ARM servers, this caused pods to fail with an Exec format error. (PR #2265)
Fixed an issue where the operating system release name was recalculated from the node label tag rather than being stored when it was first retrieved. This could cause errors when the tag format was not recognized. (PR #2244)

Known Issues#

Pod specifications that set spec.hostUsers: false to enable Kubernetes user namespaces are not supported. When a pod runs in its own user namespace, the NVIDIA Container Toolkit createContainer hook (nvidia-cdi-hook) runs as the remapped user inside the container’s user namespace and cannot read the OCI bundle’s config.json to determine the container root. As a result, container creation fails with an error such as:
```
Error: container create failed: read status from sync socket: No such process
```
As a workaround, omit the hostUsers field or set spec.hostUsers: true for any pods that request GPUs or that are managed by the GPU Operator.

Refer to NVIDIA Container Toolkit issue #648 for more information.

26.3.0#

New Features#

Updated software component versions:
- NVIDIA Driver Manager for Kubernetes v0.10.0
- NVIDIA Container Toolkit v1.19.0
- NVIDIA DCGM v4.5.2-1
- NVIDIA DCGM Exporter v4.5.1-4.8.0
- NVIDIA GDS Driver v2.27.3
- NVIDIA Kubernetes Device Plugin v0.19.0
- NVIDIA MIG Manager for Kubernetes v0.14.0
- NVIDIA GPU Feature Discovery for Kubernetes v0.19.0
- NVIDIA vGPU Device Manager v0.4.2
- NVIDIA KubeVirt GPU Device Plugin v1.5.0
- NVIDIA Kata Sandbox Device Plugin v0.0.2
- NVIDIA Confidential Computing Manager for Kubernetes v0.3.0
Added support for these NVIDIA Data Center GPU Driver versions:
- 580.126.20 (default)
Added support for Node Resource Interface (NRI) Plugin. The NRI Plugin offers a new way of injecting GPUs into GPU management containers, without needing the nvidia runtime class. Enable by setting the cdi.nriPluginEnabled field to true in the ClusterPolicy custom resource or by setting the cdi.nriPluginEnabled flag in the Helm chart.

When the NRI Plugin is enabled, no nvidia runtime class gets created and no modifications are made to the container runtime configuration, e.g. no modifications are made to containerd’s config.toml file. This is particularly advantageous for platforms like k3s, k0s, and Rancher Kubernetes Engine 2 that configure containerd in a non-standard way. On such platforms, when the NRI plugin is enabled users no longer need to configure environment variables like CONTAINERD_CONFIG, CONTAINERD_SOCKET, or RUNTIME_CONFIG_SOURCE.

This feature requires containerd v1.7.30+, v2.1.x, or v2.2.x.

To learn more, refer to Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support.

Note

Enabling the NRI plugin is not supported with cri-o.
Added support for dynamic MIG config generation. By default, the MIG Manager will automatically generate a per-node ConfigMap with the default MIG profiles for the available GPUs on the node. This replaces the previous static ConfigMap. You are still able to use a custom MIG configuration if you have specific requirements. Refer to the MIG Manager documentation for more information.

There is a known issue with MIG configurations on RHEL 8 with pre-installed NVIDIA drivers, refer to the Known Issues section for more information.
Added support for the NVIDIA Driver Custom Resource Definition (CRD). Use this feature on new cluster installations to configure multiple driver types and versions on different nodes or multiple operating system versions on nodes. Refer to the NVIDIA Driver Custom Resource Definition documentation for more information.

Note

This feature does not support an upgrade from an earlier version of the NVIDIA GPU Operator or switching from ClusterPolicy to the NVIDIA Driver CRD. It is recommended that you only use this feature from new installations.
Added support for KubeVirt with GPU passthrough on Ubuntu 24.04 LTS
Added support for K3s.
Added support for containerd 2.2.
Added support for new operating systems:
- Rocky Linux 9.7
- Red Hat Enterprise Linux 10.0, 10.1
- Red Hat Enterprise Linux 9.7
Added support for NVIDIA GB200 NVL4
Added support for NVIDIA RTX Pro 4500 Blackwell Server Edition.
Added support for NVIDIA Network Operator v26.1.0.
Added support for including extra manifests with the Helm chart in the extraObjects field.
Added support for the DCGM Exporter to expose a metric port on the host network namespace. Enabled by setting hostNetwork: true in the ClusterPolicy custom resource, or passing --set dcgmExporter.hostNetwork=true to the Helm chart. (PR #1962)
Added liveness and readiness probes for the DCGM and DCGM Exporter pods. The probes ensure that pods are not marked as ready until DCGM is ready to serve traffic. (PR #2175)
Added PodSecurityContext support for DaemonSets (PR #2120). In ClusterPolicy, set spec.daemonsets.podSecurityContext; in NVIDIADriver, set spec.podSecurityContext.
Validated Operator government-ready component support with Rancher Kubernetes Engine 2 using Ubuntu 24.04.
The following components are now available as government ready components: NVIDIA sandbox device plugin, Kubevirt Device Plugin, and vGPU Device Manager.

Improvements#

Improved NVIDIA Driver resiliency when the driver container is removed. In previous versions, the NVIDIA Driver would unload the kernel modules and perform the driver compilation process, which could take several minutes to complete, delaying the driver container from restarting. In v26.3.0, if there is no change to the CUDA driver version (or other driver configuration) in the ClusterPolicy, the NVIDIA Driver will reuse the kernel modules that are available on the node. This reduces the time to recover from the driver container removal from minutes to seconds.
Reduced unnecessary API calls and decreased reconciliation time on large GPU clusters by improving node label logic (PR #2113).
Improved the GPU Operator to now use operating system version labels from GPU worker nodes (added by NFD) when determining OS-specific paths for repository configuration files. (PR #562, PR #2138)
Driver validation now waits for all enabled additional drivers (such as GDS and GDRCopy) to be installed before proceeding, and each node records a node-local view of enabled features when using multiple NVIDIADriver CRs or optional components. (PR #2014)
Improved support for Kata Containers. Changes in this release include:
- Deprecating the NVIDIA Kata Manager. You now use kata-deploy to install the Kata Container and the Kata runtime class
- Adding support for the NVIDIA Kata Sandbox Device Plugin.
- Configure sandboxWorkload.mode=kata during installation or in the ClusterPolicy to enable Kata Containers.
Refer to the Kata Containers documentation for full details on configuring the GPU Operator to use Kata Containers.
Improved support for Confidential Containers. The GPU Operator now supports deploying Confidential Containers using Kata Containers and NVIDIA Reference Architecture for Confidential Containers. Refer to the Confidential Containers documentation for full details the Confidential Contaienrs reference architecture and on configuring the GPU Operator to use Confidential Containers.

Fixed Issues#

Fixed an issue where driver installations can fail because cached packages were incorrectly referenced. (PR #592)
Fixed a shared state issue that caused incorrect driver images in multi-node-pool clusters. (PR #1952)
Fixed an issue where the GPU Operator was applying driver upgrade annotations when the driver is disabled. (PR #1968)
Fixed an issue where an empty value in the Helm chart for device.plugin was incorrectly causing an error. (PR #1999)
Fixed an issue on OpenShift clusters where the dcgm-exporter pod gets bound to another Security Context Constraint (SCC) object instead of the nvidia-dcgm-exporter SCC that the GPU Operator creates. (PR #2122)
Fixed an issue where the GPU Operator was not correctly cleaning up DaemonSets (PR #2081).
Fixed an issue where the GPU Operator was not adding a namespace to ServiceAccount objects. (PR #2039)

Known Issues#

When GPUDirect RDMA is enabled, the nvidia-peermem container may fail to restart after the driver pod restarts without a node reboot and without any driver configuration changes. In this scenario, the driver uses a fast-path optimization that skips recompilation, but the nvidia-peermem sidecar does not detect that its module is already loaded and fails to start. This occurs because the kernel state is not cleared when the driver pod restarts.

To work around this issue, set the FORCE_REINSTALL=true environment variable in the ClusterPolicy.
```
$ kubectl patch clusterpolicy cluster-policy --type=json \
    -p='[{"op": "add", "path": "/spec/driver/manager/env/-", "value": {"name": "FORCE_REINSTALL", "value": "true"}}]'
```
Setting FORCE_REINSTALL=true forces full driver recompilation, node drain, and GPU workload disruption on every restart. Alternatively, rebooting the node clears the kernel state and allows the nvidia-peermem module to load successfully, though this may disrupt running workloads.
On RHEL 8 nodes with pre-installed NVIDIA drivers (driver.enabled=false), MIG configuration can fail when using NVIDIA MIG Manager v0.13.1 or later. NVIDIA MIG Manager copies the nvidia-mig-parted binary to the host and runs it in the host userspace by using chroot. Recent versions of the binary were compiled against a UBI9 base image and require GLIBC 2.32 and GLIBC 2.34 which are not available on RHEL 8, causing the following errors in the MIG Manager pod logs:
```
/usr/local/nvidia/mig-manager/nvidia-mig-parted: /lib64/libc.so.6: version `GLIBC_2.32' not found
/usr/local/nvidia/mig-manager/nvidia-mig-parted: /lib64/libc.so.6: version `GLIBC_2.34' not found
```
To work around this issue, downgrade the NVIDIA MIG Manager component to v0.12.3. After downgrading, automatically generated per-node MIG configuration ConfigMaps will not be available. MIG configuration information will be available in the default-mig-parted-config ConfigMap instead. Refer to the MIG Manager documentation for more information on MIG configuration.

Refer to the MIG Controller issue #329 for more information.
After you delete the default NVIDIADriver custom resource, any custom NVIDIADriver custom resources that you created might not become active automatically. The custom resources remain in a pending state because the Operator controller does not re-evaluate them after the conflicting default custom resource is removed.

To work around this issue, restart the GPU Operator controller by deleting the controller pod:
```
$ kubectl delete pod -n gpu-operator -l app=gpu-operator
```
Restarting the controller pod does not disrupt running GPU workloads or driver pods on nodes.

Refer to issue #2259 for more information.

Removals and Deprecations#

Marked unused field defaultRuntime as optional in the ClusterPolicy. (PR #2000)
NVIDIA Kata Manager is now deprecated. Refer to the Kata Containers documentation for more information on using Kata Containers without this component.

25.10.1#

New Features#

Updated software component versions:
- NVIDIA Container Toolkit v1.18.1
- NVIDIA DCGM v4.4.2-1
- NVIDIA DCGM Exporter v4.4.2-4.7.0
- NVIDIA Kubernetes Device Plugin v0.18.1
- NVIDIA GPU Feature Discovery v0.18.1
- NVIDIA MIG Manager for Kubernetes 0.13.1
- NVIDIA Driver Manager for Kubernetes v0.9.1
Added support for this NVIDIA Data Center GPU Driver version:
- 580.105.08 (default)
Add HPC job mapping support to DCGM Exporter to collect metrics for HPC jobs running on the cluster.

Configure the HPC job mapping by setting the dcgmExporter.hpcJobMapping.enabled field to true in the ClusterPolicy custom resource. Set dcgmExporter.hpcJobMapping.directory with the directory path where HPC job mapping files are created by the workload manager. The default directory is /var/lib/dcgm-exporter/job-mapping.
Improved the cluster policy reconciler to be more resilient to race conditions during node updates.

Fixed Issues#

Fixed the following known issue introduced in GPU Operator v25.10.0:
- When using cri-o as the container runtime, several GPU Operator pods can be stuck in the Init:RunContainerError or Init:CreateContainerError state during GPU Operator installation or upgrade, or during GPU driver daemonset upgrade.
- NVIDIA Container Toolkit 1.18.0 overwrites the imports field in the top-level containerd configuration file, so any previously imported paths are lost. This was fixed in NVIDIA Container Toolkit v1.18.1.
Fixed a race condition where user-supplied NVIDIA kernel module parameters were sometimes not being applied by the driver daemonset. For more information, refer to PR #1939.
Fixed a bug where driver images were being incorrectly assigned in multi-nodepool clusters. For more information, refer to Issue #1622.
Fixed a bug where the GPU Operator Helm chart template was not assigning the correct namespace to resources it created.
Fixed a bug where the k8s-driver-manager would wait indefinitely when MOFED is enabled and USE_HOST_MOFED is set to true despite the MOFED being pre-installed on the host.

Known Issues#

When deploying the GPU Operator on systems with SELinux in enforcing mode, the MIG Manager does not get scheduled on GPU nodes. This happens because the GPU Feature Discovery pod has insufficient permissions on Node Feature Discovery’s feature-file drop-in directory, so it cannot add the label that indicates a MIG-capable GPU is present. To work around this issue, configure NVIDIA GPU Feature Discovery to use the Node Feature API instead of feature files in ClusterPolicy:
```
gfd:
  env:
  - name: USE_NODE_FEATURE_API
    value: "true"
```

25.10.0#

New Features#

The NVIDIA GPU Operator now supports government ready components for NVIDIA AI Enterprise customers. Government ready is NVIDIA’s designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case. For more information on NVIDIA’s government ready support, refer to the NVIDIA GPU Operator Government Ready deployment guide or the AI Software for Regulated Environments White Paper.
Updated software component versions:
- NVIDIA Driver Manager for Kubernetes v0.9.0
- NVIDIA Container Toolkit v1.18.0
- NVIDIA DCGM v4.4.1
- NVIDIA DCGM Exporter v4.4.1-4.6.0
- Node Feature Discovery v0.18.2
- NVIDIA GDS Driver v2.26.6
- NVIDIA Kubernetes Device Plugin v0.18.0
- NVIDIA MIG Manager for Kubernetes v0.13.0
- NVIDIA vGPU Device Manager v0.4.1
Added support for these NVIDIA Data Center GPU Driver versions:
- 580.95.05 (default)
- 570.195.03
- 535.274.02
Container Device Interface (CDI) is now enabled by default when installing or upgrading (via helm) the GPU Operator to 25.10.0. The cdi.enabled field in the ClusterPolicy is now set to true by default. The cdi.default field is now deprecated and will be ignored.
- When cdi.enabled is true the GPU Operator now leverages CDI support in container runtimes, such as containerd and cri-o, for injecting GPU support into workload containers. This differs from prior releases where CDI support in container runtimes was not used, and instead, an nvidia runtime class configured in CDI mode was used.
- When CDI is enabled, no configuration changes are required for standard workloads using GPU allocation through the Device Plugin. Setting runtimeClassName is not required for standard workloads. For workloads that already have runtimeClassName: nvidia set in their pod spec YAML, no change is necessary.
- GPU Management Containers that use the NVIDIA_VISIBLE_DEVICES environment variable to get GPU access, bypassing GPU allocation via the Device Plugin or DRA Driver for GPUs, must set runtimeClassName: nvidia in the pod specification. It’s recommended that NVIDIA_VISIBLE_DEVICES only be used by GPU Management Containers. A GPU Management Container is a container that requires access to all GPUs without them being allocated by Kubernetes. Examples include monitoring agents and device plugins.
- For OpenShift users upgrading to v25.10.0, we recommend updating the cdi.enabled field in ClusterPolicy to true post-upgrade. This field will not automatically be updated to true since the Operator Lifecycle Manager (OLM) does not mutate custom resources on operator upgrades.
When using NVIDIA vGPU with KubeVirt / OpenShift Virtualization, on GPUs that support MIG, you now have the option to select MIG-backed vGPU instances instead of time-sliced vGPU instances. To select a MIG-backed vGPU profile, label the node with the name of the MIG-backed vGPU profile.
Added support for NVIDIA Network Operator 25.7.0 integration. Refer to Support for GPUDirect RDMA and Support for GPUDirect Storage.
Added support for Mirantis k0s.
Added support for Red Hat OpenShift GPU dashboard integration.
Added support for Red Hat OpenShift Container Platform 4.20.
Added support for Red Hat OpenShift with HGX GB200 NVL72.
Added support for Kubernetes v1.34.
Added support for NVIDIA HGX B300 and NVIDIA HGX GB300 NVL72.
Added support for new MIG profiles with NVIDIA HGX B300.
- Supports these profiles:
  - 1g.34gb
  - 1g.34gb+me
  - 1g.67gb
  - 2g.67gb
  - 3g.135gb
  - 4g.135gb
  - 7g.269gb
- Added an all-balanced profile that creates the following GPU instances:
  - 1g.34gb $\times$ 2
  - 2g.67gb $\times$ 1
  - 3g.135gb $\times$ 1
Added support for new MIG profiles with NVIDIA HGX GB300 NVL72.
- Supports these profiles:
  - 1g.35gb
  - 1g.35gb+me
  - 1g.70gb
  - 2g.70gb
  - 3g.139gb
  - 4g.139gb
  - 7g.278gb
- Added an all-balanced profile that creates the following GPU instances:
  - 1g.35gb $\times$ 2
  - 2g.70gb $\times$ 1
  - 3g.139gb $\times$ 1

Improvements#

The GPU Operator now configures containerd and cri-o to use drop-in files for container runtime config overrides by default. As a consequence of this change, some of the install procedures for Kubernetes distributions that use custom containerd installations have changed.
- The install procedure for microk8s has changed. Refer to the latest MicroK8s install procedure.
Hardened the GPU Operator container image by using a distroless image as a base image.
Validator for NVIDIA GPU Operator is now included as part of the GPU Operator container image. It is no longer a separate image.
The GPU Operator now supports passing the vGPU licensing token as a secret. It is recommended that you migrate to using secrets instead of a configMap for improved security.
Enhanced the driver pod to allow resource requests and limits to be configurable for all containers in the driver pod.
Added support for specifying hostPID via the GPU Operator Helm charts

Fixed Issues#

Fixed an issue where the vGPU Manager pod was terminated before it finished disabling VFs on all GPUs. The terminationGracePeriodSeconds is now set to 120 seconds to ensure the vGPU Manager has enough time to finish its cleanup logic when the pod is terminated.
Added GDRCopy validation to validator daemonset. When GDRCopy is enabled, this ensures that the GDRCopy driver is loaded prior to the k8s-device-plugin from starting up.
Added required permissions when GPU Feature Discovery is configured to use the Node Feature API instead of feature files.

Known Issues#

When using cri-o as the container runtime, several of the GPU Operator pods may be stuck in the Init:RunContainerError or Init:CreateContainerError state during installation of GPU Operator, upgrade of GPU Operator, or upgrade of the GPU driver daemonset. The pods may be in this state for several minutes and restart several times. The pods will recover from this state as soon as the container toolkit pod starts running.
NVIDIA Container Toolkit 1.18.0 will overwrite the imports field in the top-level containerd configuration file, so any previously imported paths will be lost.
When using MIG-backed vGPU on the RTX Pro 6000 Blackwell Server Edition, the vgpu-device-manager will fail to configure nodes with the default vgpu-device-manager configuration. To work around this, create a custom ConfigMap that adds the GFX suffix to the vGPU profile name. All of the MIG-backed vGPU profiles are only supported on MIG instances created with the +gfx attribute. Refer to the following example:
```
version: v1
vgpu-configs:
  DC-1-2Q:
    - devices: all
      vgpu-devices:
        DC-1-2QGFX: 48
```
Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the vgpuDeviceManager.config.name, and restart the vgpu-device-manager pod.

When using GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd config.toml file and prevent GPU Operator containers from starting up correctly. To resolve this issue, set the RUNTIME_CONFIG_SOURCE=file environment variable in the toolkit container. You can set this environment variable by setting the below in the ClusterPolicy CR:
```
toolkit:
  env:
  - name: RUNTIME_CONFIG_SOURCE
    value: "file"
```

25.3.4#

New Features#

Supports these NVIDIA Data Center GPU Driver versions:
- 580.82.07 (default)
Added support for additional features:
- RTX Pro 6000 Blackwell Server Edition
  - MIG profiles support
  - KubeVirt and OpenShift Virtualization: VM with GPU passthrough (Ubuntu 22.04 only)
  - KubeVirt and OpenShift Virtualization: VM with time-slice vGPU (Ubuntu 22.04 only)
- RTX Pro 6000D
  - KubeVirt and OpenShift Virtualization: VM with GPU passthrough (Ubuntu 22.04 only)

Fixed Issues#

Fixed an issue where user-supplied environment variables configured in ClusterPolicy were not getting set in the rendered DaemonSet. User-supplied environment variables now take precedence over environment variables set by the ClusterPolicy controller.

25.3.3#

Fixed Issues#

Fixed an issue where the GPU Operator failed to render the nvidia-container-toolkit DaemonSet correctly when a custom value for CONTAINERD_SOCKET was provided as input. Specifically, the hostPath volumes were not included in the DaemonSet. Refer to GitHub issue NVIDIA/gpu-operator#1694 for more details.

25.3.2#

New Features#

Updated software component versions:
- NVIDIA Kubernetes Device Plugin/NVIDIA GPU Feature Discovery v0.17.3
- NVIDIA MIG Manager for Kubernetes v0.12.2
Supports these NVIDIA Data Center GPU Driver versions:
- 580.65.06 (recommended)
- 570.172.08 (default)
- 535.261.03

Known Issues#

Starting with version 580.65.06, the driver container has Coherent Driver Memory Management (CDMM) enabled by default to support GB200 on Kubernetes. For more information about CDMM, refer to the release notes.

Note

Currently, CDMM is not compatible with the Multi-Instance GPUs (MIG) sharing. CDMM is also not compatible with GPU Direct Storage. CDMM support for these features is planned for future driver updates. However, these limitations will remain in place until a future driver update removes them.

CDMM enablement applies only to Grace-based systems such as GH200 and GB200 and is ignored on other GPU platforms. NVIDIA strongly recommends keeping CDMM enabled with Kubernetes on supported systems to prevent memory over-reporting and uncontrolled GPU memory access.
For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. This manifests as GPU pods getting stuck indefinitely in the Pending state. NVIDIA recommends that you upgrade the driver to version 570.172.08 to avoid this issue. For more detailed information, see GitHub issue NVIDIA/gpu-operator#1361.
Configuring the Operator to enable CDI is not supported on Rancher Kubernetes Engine 2 (RKE2).
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.

Fixed Issues#

Fixed security vulnerabilities in NVIDIA Container Toolkit and related components. This release addresses CVE-2025-23266 (Critical) and CVE-2025-23267 (High) that could allow arbitrary code execution and link following attacks in container environments. For complete details, refer to the NVIDIA Security Bulletin.

25.3.1#

New Features#

Includes these software component versions:
- NVIDIA Container Toolkit version v1.17.8
- NVIDIA DCGM v4.2.3
- NVIDIA DCGM Exporter v4.2.3-4.1.3
- NVIDIA Kubernetes Device Plugin v0.17.2
- Node Feature Discovery v0.17.3
- NVIDIA GDRCopy Driver v2.5.0
Supports these NVIDIA Data Center GPU Driver versions:
- 580.65.06 (recommended)
- 570.172.08 (default)
- 570.148.08
- 570.133.20
- 550.163.01
- 535.261.03
- 535.247.01
Added support for Red Hat Enterprise Linux 9. Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 versions are available for x86 based platforms only. They are not available for ARM based systems.
Added support for Kubernetes v1.33.
Added support for setting the internalTrafficPolicy for the DCGM Exporter service. You can configure this in the Helm chart value by setting dcgmexporter.service.internalTrafficPolicy to Local or Cluster (default). Choose Local if you want to route internal traffic within the node only.

Known Issues#

For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. This manifests as GPU pods getting stuck indefinitely in the Pending state. NVIDIA recommends that you upgrade the driver to version 570.172.08 to avoid this issue. For more detailed information, see GitHub issue NVIDIA/gpu-operator#1361.
GPU Operator in CDI mode is not operational with RKE2.

Fixed Issues#

Fixed an issue where the NVIDIADriver controller may enter an endless loop of creating and deleting a DaemonSet. This could occur when the NVIDIADriver DaemonSet does not tolerate a taint present on all nodes matching its configured nodeSelector, or when none of the DaemonSet pods have been scheduled yet. Refer to GitHub pull request #1416 for more details.

25.3.0#

New Features#

Includes these software component versions:
- NVIDIA Container Toolkit version v1.17.5
- NVIDIA Driver Manager for Kubernetes v0.8.0
- NVIDIA Kubernetes Device Plugin v0.17.1
- NVIDIA DCGM Exporter v4.1.1-4.0.4
- NVIDIA DCGM v4.1.1-2
- Node Feature Discovery v0.17.2
- NVIDIA GPU Feature Discovery for Kubernetes v0.17.1
- NVIDIA MIG Manager for Kubernetes v0.12.1
- NVIDIA KubeVirt GPU Device Plugin v1.3.1
- NVIDIA vGPU Device Manager v0.3.0
- NVIDIA Kata Manager for Kubernetes v0.2.3
- NVIDIA GDRCopy Driver v2.4.4
Added support for the NVIDIA GPU DRA Driver v25.3.0 component (coming soon) which enables Multi-Node NVLink through Kubernetes Dynamic Resource Allocation (DRA) and IMEX support.

This component can be installed alongside the GPU Operator. It is supported on Kubernetes v1.32 clusters, running on NVIDIA HGX GB200 NVL, and with CDI enabled on your GPU Operator.
Transitioned to installing the open kernel modules by default starting with R570 driver containers.
Added a new parameter, kernelModuleType, to the ClusterPolicy and NVIDIADriver APIs which specifies how the GPU Operator and driver containers will choose kernel modules to use.

Valid values include:
- auto: Default and recommended option. auto means that the recommended kernel module type (open or proprietary) is chosen based on the GPU devices on the host and the driver branch used.
- open: Use the NVIDIA Open GPU kernel module driver.
- proprietary: Use the NVIDIA Proprietary GPU kernel module driver.
Currently, auto is only supported with the 570.86.15 and 570.124.06 or later driver containers. 550 and 535 branch drivers do not yet support this mode.

In previous versions, the useOpenKernelModules field specified the driver containers to install the NVIDIA Open GPU kernel module driver. This field is now deprecated and will be removed in a future release. If you were using the useOpenKernelModules field, NVIDIA recommends that you update your configuration to use the kernelModuleType field instead.
Added support for Ubuntu 24.04 LTS.
Added support for NVIDIA HGX GB200 NVL and NVIDIA HGX B200. Note that HGX B200 requires a driver container version of 570.133.20 or later.
Added support for the NVIDIA Data Center GPU Driver version 570.124.06.
Added support for KubeVirt and OpenShift Virtualization with vGPU v18 on H200NVL.
Added support for NVIDIA Network Operator v25.1.0. Refer to Support for GPUDirect RDMA and Support for GPUDirect Storage.
Added support for OpenShift v4.18.
Added support for Containerd v2.0.
Added support for Kubernetes v1.32. Note that the minimum supported Kubernetes versions has been updated to v1.29.
Added support for new MIG profiles with HGX B200.
- Supports these profiles:
  - 1g.23gb
  - 1g.23gb+me
  - 1g.45gb
  - 2g.45gb
  - 3g.90gb
  - 4g.90gb
  - 7g.180gb
- Added an all-balanced profile that creates the following GPU instances:
  - 1g.23gb $\times$ 2
  - 2g.45gb $\times$ 1
  - 3g.90gb $\times$ 1
Added support for new MIG profiles with HGX GB200.
- Supports these profiles:
  - 1g.24gb
  - 1g.24gb+me
  - 1g.47gb
  - 2g.47gb
  - 3g.95gb
  - 4g.95gb
  - 7g.189gb
- Added an all-balanced profile that creates the following GPU instances:
  - 1g.24gb $\times$ 2
  - 2g.47gb $\times$ 1
  - 3g.95gb $\times$ 1

Improvements#

Improved security by removing unnecessary permissions in the GPU Operator ClusterRole.
Improved GPU Operator metrics to include a operatorMetricsNamespace field that sets the metrics namespace to gpu_operator.
Improved error handling in Driver Manager for Kubernetes by adding pod watch permissions.

Fixed Issues#

Removed default liveness probe from the nvidia-fs-ctr and nvidia-gdrcopy-ctr containers of the GPU driver daemonset. Long response times of the lsmod commands were causing timeout errors in the probe and unnecessary restarts of the container, resulting in the DaemonSet being in a bad state.
Fixed an issue where the GPU Operator failed to create a valid DaemonSet name on OpenShift Container Platform when using 64 kernel page size. Refer to GitHub issue #1207 for more details.
Removed deprecated operator.defaultRuntime` parameter.

24.9.2#

New Features#

Added support for the NVIDIA Data Center GPU Driver version 570.86.15.
The default driver in this version is now 550.144.03. Refer to the GPU Operator Component Matrix on the platform support page for more details on supported drivers.
Added support for NVIDIA Container Toolkit 1.17.4. This version includes updates for NVIDIA CVE-2025-23359.

To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.

For more information regarding NVIDIA security vulnerability remediation policies, refer to https://www.nvidia.com/en-us/security/psirt-policies/.

24.9.1#

New Features#

Added support for the NVIDIA Data Center GPU Driver versions 550.127.08 and 535.216.03. Refer to the GPU Operator Component Matrix on the platform support page.
Added support for NVIDIA Container Toolkit 1.17.3. This version includes updates for:
To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.

For more information regarding NVIDIA security vulnerability remediation policies, refer to https://www.nvidia.com/en-us/security/psirt-policies/.
Includes these software component versions:
- NVIDIA Container Toolkit v1.17.3
- NVIDIA DCGM v3.3.9-1
- NVIDIA DCGM Exporter v3.3.9-3.6.1
Added support for NVIDIA Network Operator v24.10.0. Refer to Support for GPUDirect RDMA and Support for GPUDirect Storage.
Added an all-balanced MIG profile for H200 NVL which creates the following GPU instances:
- 1g.18gb $\times$ 2
- 2g.35gb $\times$ 1
- 3g.71gb $\times$ 1

Fixed Issues#

Fixed an issue where NVIDIA Container Toolkit would fail to start on Rancher RKE2, K3s, and Canonical MicroK8s. Refer to GitHub issue #1109 for more details.
Fixed an issue where events were not being generated by the NVIDIA driver upgrade controller. Refer to GitHub issue #1101 for more details.

24.9.0#

New Features#

Added support for NVIDIA Container Toolkit 1.17.0. This version includes updates for the following CVEs:
- NVIDIA CVE-2024-0134
To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.

For more information regarding NVIDIA security vulnerability remediation policies, refer to https://www.nvidia.com/en-us/security/psirt-policies/.

For Rancher RKE2 and K3s, refer to the Known Limitations.
Added support for the NVIDIA Data Center GPU Driver version 550.127.05. Refer to the GPU Operator Component Matrix on the platform support page.
Includes these software component versions:
- NVIDIA Container Toolkit v1.17.0
- NVIDIA Driver Manager for Kubernetes v0.7.0
- NVIDIA Kubernetes Device Plugin v0.17.0
- NVIDIA DCGM Exporter v3.3.8-3.6.0
- NVIDIA DCGM v3.3.8-1
- Node Feature Discovery v0.16.6
- NVIDIA GPU Feature Discovery for Kubernetes v0.17.0
- NVIDIA MIG Manager for Kubernetes v0.10.0
- NVIDIA KubeVirt GPU Device Plugin v1.2.10
- NVIDIA vGPU Device Manager v0.2.8
- NVIDIA GDS Driver v2.20.5
- NVIDIA Kata Manager for Kubernetes v0.2.2
Added support for NVIDIA Network Operator v24.7.0. Refer to Support for GPUDirect RDMA and Support for GPUDirect Storage.
Added generally available (GA) support for precompiled driver containers. This feature was previously a technical preview feature. For more information, refer to Precompiled Driver Containers.
Enabled automatic upgrade of Operator and Node Feature Discovery CRDs by default. In previous releases, the operator.upgradeCRD field was false. This release sets the default value to true and automatically runs a Helm hook when you upgrade the Operator. For more information, refer to Option 2: Automatically Upgrading CRDs Using a Helm Hook.
Added support for new MIG profiles with GH200 NVL2 144GB HBM3e.
- Supports these profiles:
  - 1g.18gb
  - 1g.18gb+me
  - 1g.36gb
  - 2g.36gb
  - 3g.72gb
  - 4g.72gb
  - 7g.144gb
- Added an all-balanced profile that creates the following GPU instances:
  - 1g.18gb $\times$ 2
  - 2g.36gb $\times$ 1
  - 3g.72gb $\times$ 1
Added support for KubeVirt and OpenShift Virtualization with vGPU v17.4 for A30, A100, and H100 GPUs. These GPUs are supported with an NVIDIA AI Enterprise subscription only and require building the NVIDIA vGPU Manager container image with the aie .run file.
Revised roles and role-based access controls for the Operator. The Operator is revised to use Kubernetes controller-runtime caching that is limited to the Operator namespace and the OpenShift namespace, openshift. The OpenShift namespace is required for the Operator to monitor for changes to image stream objects. Limiting caching to specific namespaces enables the Operator to use the namespace-scoped role, gpu-operator, instead of a cluster role for monitoring changes to resources in the Operator namespace. This change follows the principle of least privilege and improves the security posture of the Operator.
Enhanced the GPU Driver Container to set the NODE_NAME environment variable from the node host name and the NODE_IP environment variable from the node host IP address.

Fixed Issues#

Fixed an issue with the cleanup CRD and upgrade CRD jobs that are triggered by Helm hooks. On clusters that have nodes with taints, even when operator.tolerations includes tolerations, the jobs are not scheduled. In this release, the tolerations that you specify for the Operator are applied to the jobs. For more information about the hooks, refer to Option 2: Automatically Upgrading CRDs Using a Helm Hook.
Fixed an issue with configuring NVIDIA Container Toolkit to use CDI on nodes that use CRI-O. Previously, the toolkit could configure the runc handler with the nvidia runtime handler even if runc was not the default runtime and cause CRI-O to crash. In this release, the toolkit determines the default runtime by running crio status config and configures that runtime with the nvidia runtime handler.

Known Limitations#

On Rancher RKE2 and K3s, NVIDIA Container Toolkit v1.17.0 fails to start. The toolkit attempts to run containerd config dump to determine the container runtime configuration on the host. On these platforms, the containerd executable is not on the PATH and results in an error.

NVIDIA recommends installing v1.17.1 of the toolkit when you install or upgrade the Operator. You can specify the --set toolkit.version=v1.17.1-ubuntu20.04 or v1.17.1-ubi8 argument to Helm.

24.6.2#

New Features#

This release provides critical security updates and is recommended for all users.

Added support for NVIDIA Container Toolkit 1.16.2. This version includes updates for the following CVEs:

To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/.

For more information regarding NVIDIA security vulnerability remediation policies, refer to https://www.nvidia.com/en-us/security/psirt-policies/.

24.6.1#

New Features#

Includes these software component versions:
- NVIDIA Kubernetes Device Plugin v0.16.2
- NVIDIA GPU Feature Discovery for Kubernetes v0.16.2
Refer to the GPU Operator Component Matrix on the platform support page.

Fixed Issues#

Fixed an issue with role-based access controls that prevented a service account from accessing config maps. Refer to GitHub issue #883 for more details.
Fixed an issue with role-based access controls in the GPU Operator validator that prevented retrieving NVIDIA Driver daemon set information. On OpenShift Container Platform, this issue triggered GPUOperatorNodeDeploymentDriverFailed alerts. Refer to GitHub issue #892 for more details.

24.6.0#

New Features#

Added support for the NVIDIA Data Center GPU Driver version 550.90.07. Refer to the GPU Operator Component Matrix on the platform support page.
Includes these software component versions:
- NVIDIA Container Toolkit v1.16.1
- NVIDIA Driver Manager for Kubernetes v0.6.10
- NVIDIA Kubernetes Device Plugin v0.16.1
- NVIDIA DCGM Exporter v3.3.7-3.5.0
- NVIDIA DCGM v3.3.7-1
- Node Feature Discovery v0.16.3
- NVIDIA GPU Feature Discovery for Kubernetes v0.16.1
- NVIDIA MIG Manager for Kubernetes v0.8.0
- NVIDIA KubeVirt GPU Device Plugin v1.2.9
- NVIDIA vGPU Device Manager v0.2.7
- NVIDIA GDS Driver v2.17.5
- NVIDIA Kata Manager for Kubernetes v0.2.1
- NVIDIA GDRCopy Driver v2.4.1-1
Added support for NVIDIA Network Operator v24.4.0. Refer to Support for GPUDirect RDMA and Support for GPUDirect Storage.
Added support for using the Operator with Container-Optimized OS on Google Kubernetes Engine (GKE). The process uses the Google driver installer to manage the NVIDIA GPU Driver. For Ubuntu on GKE, you can use the Google driver installer or continue to use the NVIDIA Driver Manager as with previous releases. Refer to NVIDIA GPU Operator with Google GKE for more information.
Added support for precompiled driver containers with Open Kernel module drivers. Specify --set driver.useOpenKernelModules=true --set driver.usePrecompiled=true --set driver.version=<driver-branch> when you install or upgrade the Operator. Support remains limited to Ubuntu 22.04. Refer to Precompiled Driver Containers for more information.

NVIDIA began publishing driver containers with this support on July 15, 2024. The tags for the first containers with this support are as follows:
- <driver-branch>-5.15.0-116-generic-ubuntu22.04
- <driver-branch>-5.15.0-1060-nvidia-ubuntu22.04
- <driver-branch>-5.15.0-1063-oracle-ubuntu22.04
- <driver-branch>-5.15.0-1068-azure-ubuntu22.04
- <driver-branch>-5.15.0-1065-aws-ubuntu22.04
Precompiled driver containers built after July 15 include support for the Open Kernel module drivers.
Added support for new MIG profiles.
- For H200 devices:
  - 1g.18gb
  - 1g.18gb+me
  - 1g.35gb
  - 2g.35gb
  - 3g.71gb
  - 4g.71gb
  - 7g.141gb
- Added an all-balanced profile for H200 devices that creates the following GPU instances:
  - 1g.12gb $\times$ 2
  - 2g.24gb $\times$ 1
  - 3g.48gb $\times$ 1
Added support for creating a config map with custom MIG profiles during installation or upgrade with Helm. Refer to Example: Custom MIG Configuration During Installation for more information.

Fixed Issues#

Role-based access controls for the following components were reviewed and revised to use least-required privileges:
- GPU Operator
- Operator Validator
- MIG Manager
- GPU Driver Manager
- GPU Feature Discovery
- Kubernetes Device Plugin
- KubeVirt Device Plugin
- vGPU Host Manager
In previous releases, the permissions were more permissive than necessary.
Fixed an issue with Node Feature Discovery (NFD). When an NFD pod was deleted or restarted, all NFD node labels were removed from the node and GPU Operator operands were restarted. The v0.16.2 release of NFD fixes the issue. Refer to GitHub issue #782 for more details.
Fixed an issue with NVIDIA vGPU Manager not working correctly on nodes with GPUs that require Open Kernel module drivers and GPU System Processor (GSP) firmware. Refer to GitHub issue #761 for more details.
DGCM is revised to use a cluster IP and a service with the internal traffic policy set to Local. In previous releases, DCGM was a host networked pod. The dcgm.hostPort field of the NVIDIA cluster policy resource is now deprecated.
Fixed an issue that prevented enabling GDRCopy and additional volume mounts with the NVIDIA Driver custom resource. Previously, the driver daemon set did not update with the change and the Operator logs included an error message. Refer to GitHub issue #713 for more details.
Fixed an issue with deleting GPU Driver daemon sets due to having misscheduled pods rather than zero pods. Previously, if a node had an untolerated taint such as node.kubernetes.io/unreachable:NoSchedule, the Operator could repeatedly delete and recreate the driver daemon sets. Refer to GitHub issue #715 for more details.
Fixed an issue with reporting the correct GPU capacity and allocatable resources from the KubeVirt GPU Device Plugin. Previously, if a GPU became unavailable, the reported GPU capacity and allocatable resources remained unchanged. Refer to GitHub issue #97 for more details.

Known Limitations#

The 1g.12gb MIG profile does not operate as expected on the NVIDIA GH200 GPU when the MIG configuration is set to all-balanced.
The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature because of the missing kernel-headers package within the container. With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to run traditional container workloads with NVIDIA GPUs.
If you cordon a node while the GPU driver upgrade process is already in progress, the Operator uncordons the node and upgrades the driver on the node. You can determine if an upgrade is in progress by checking the node label nvidia.com/gpu-driver-upgrade-state != upgrade-done.
NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well as OpenShift Virtualization 4.12.0—4.12.2.
Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is also an alternative.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is an alternative.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the enable_selinux=true configuration option. Additionally, network-restricted environments are not supported.

24.3.0#

New Features#

Added support to enable NVIDIA GDRCopy v2.4.1.

When you enable support for GDRCopy, the Operator configures the GDRCopy Driver container image as a sidecar container in the GPU driver pod. The sidecar container compiles and installs the gdrdrv Linux kernel module. This feature is supported on Ubuntu 22.04 and RHCOS operating systems and on X86_64 and ARM64 architectures.

Refer to Common Chart Customization Options for more information about the driver.gdrcopy field.
Added support for the NVIDIA Data Center GPU Driver version 550.54.15. Refer to the GPU Operator Component Matrix on the platform support page.
Includes these software component versions:
- NVIDIA Container Toolkit version v1.15.0
- NVIDIA MIG Manager version v0.7.0
- NVIDIA Driver Manager for K8s v0.6.8
- NVIDIA Kubernetes Device Plugin v0.15.0
- DCGM 3.3.5-1
- DCGM Exporter 3.3.5-3.4.1
- Node Feature Discovery v0.15.4
- NVIDIA GPU Feature Discovery for Kubernetes v0.15.0
- NVIDIA KubeVirt GPU Device Plugin v1.2.7
- NVIDIA vGPU Device Manager v0.2.6
- NVIDIA Kata Manager for Kubernetes v0.2.0
Added support for Kubernetes v1.29 and v1.30. Refer to Supported Operating Systems and Kubernetes Platforms.
Added support for NVIDIA GH200 Grace Hopper Superchip as a generally available feature. Refer to Supported NVIDIA Data Center GPUs and Systems.

The following prerequisites are required for using the Operator with GH200:
- Run Ubuntu 22.04, the 550.54.15 GPU driver, and an NVIDIA Linux kernel, such as one provided with a linux-nvidia-<x.x> package.
- Add init_on_alloc=0 and memhp_default_state=online_movable as Linux kernel boot parameters.
- Run the NVIDIA Open GPU Kernel module driver.
Added support for NVIDIA Network Operator v24.1.1. Refer to Support for GPUDirect RDMA and Support for GPUDirect Storage.
Added support for the NVIDIA IGX Orin platform when configured to use the discrete GPU. Refer to Supported ARM Based Platforms.
Removed support for Kubernetes pod security policy (PSP). PSP was deprecated in the Kubernetes v1.21 release and removed in v1.25.

Fixed Issues#

Installation on Red Hat OpenShift Container Platform 4.15 no longer requires a workaround related to secrets and storage for the integrated image registry.
Previously, the vGPU Device Manager would panic if no NVIDIA devices were found in /sys/class/mdev_bus.
Previously, the MOFED validation init container would run for the GPU driver pod. In this release, the init container no longer runs because the MOFED installation check is performed by the Kubernetes Driver Manager init container.
Previously, for Red Hat OpenShift Container Platform, the GPU driver installation would fail when the Linux kernel version did not match the /etc/os-release file. In this release, the Kernel version is determined from the running kernel to prevent the issue. Refer to GitHub issue #617 for more details.
Previously, if the metrics for DCGM Exporter were configured in a config map and the cluster policy specified the name of the config map as <namespace>:<config-map> in the DCGM_EXPORTER_CONFIGMAP_DATA environment variable, the exporter pods could not read the configuration from the config map. In this release, the role used by the exporter is granted access to read from config maps.
Previously, under load, the Operator could fail with the message fatal error: concurrent map read and map write. In this release, the Operator controller is refactored to prevent the race condition. Refer to GitHub issue #689 for more details.
Previously, if any node in the cluster was in the NotReady state, the GPU driver upgrade controller failed to make progress. In this release, the upgrade library is updated and skips unhealthy nodes. Refer to GitHub issue #688 for more details.

Known Limitations#

NVIDIA vGPU Manager does not work correctly on nodes with GPUs that require Open Kernel module drivers and GPU System Processor (GSP) firmware. The logs for vGPU Device Manager pods include lines like the following example:

time="2024-07-23T08:50:11Z" level=fatal msg="error setting VGPU config: no parent devices found for GPU at index '1'"
time="2024-07-23T08:50:11Z" level=error msg="Failed to apply vGPU config: unable to apply config 'default': exit status 1"
time="2024-07-23T08:50:11Z" level=info msg="Setting node label: nvidia.com/vgpu.config.state=failed"
time="2024-07-23T08:50:11Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"

The output of the kubectl exec -it nvidia-vgpu-manager-daemonset-xxxxx -n gpu-operator -- bash -c 'dmesg | grep -i nvrm' command resembles the following example:

kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  550.90.05  Release Build  (dvs-builder@U16-I1-N08-05-1)
kernel: NVRM: RmFetchGspRmImages: No firmware image found
kernel: NVRM: GPU 0000:ae:00.0: RmInitAdapter failed! (0x61:0x56:1697)
kernel: NVRM: GPU 0000:ae:00.0: rm_init_adapter failed, device minor number 0

The vGPU Manager pods do not mount the /sys/module/firmware_class/parameters/path and /lib/firmware paths on the host and the pods fail to copy the GSP firmware files on the host.

As a workaround, you can add the following volume mounts to the vGPU Manager daemon set, for the nvidia-vgpu-manager-ctr container:

- name: firmware-search-path
  mountPath: /sys/module/firmware_class/parameters/path
- name: nv-firmware
  mountPath: /lib/firmware

This issue is fixed in the next release of the GPU Operator.

The 1g.12gb MIG profile does not operate as expected on the NVIDIA GH200 GPU when the MIG configuration is set to all-balanced.
The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature because of the missing kernel-headers package within the container. With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to run traditional container workloads with NVIDIA GPUs.
If you cordon a node while the GPU driver upgrade process is already in progress, the Operator uncordons the node and upgrades the driver on the node. You can determine if an upgrade is in progress by checking the node label nvidia.com/gpu-driver-upgrade-state != upgrade-done.
NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well as OpenShift Virtualization 4.12.0—4.12.2.
Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
When installing the Operator on Amazon EKS and using Kubernetes versions lower than 1.25, specify the --set psp.enabled=true Helm argument because EKS enables pod security policy (PSP). If you use Kubernetes version 1.25 or higher, do not specify the psp.enabled argument so that the default value, false, is used.
All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is also an alternative.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is an alternative.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the enable_selinux=true configuration option. Additionally, network-restricted environments are not supported.

23.9.2#

New Features#

Added support for the NVIDIA Data Center GPU Driver version 550.54.14. Refer to the GPU Operator Component Matrix on the platform support page.
Added support for Kubernetes v1.29. Refer to Supported Operating Systems and Kubernetes Platforms on the platform support page.
Added support for Red Hat OpenShift Container Platform 4.15. Refer to Supported Operating Systems and Kubernetes Platforms on the platform support page.
Includes these software component versions:
- NVIDIA Data Center GPU Driver version 550.54.14
- NVIDIA Container Toolkit version v1.14.6
- NVIDIA Kubernetes Device Plugin version v1.14.5
- NVIDIA MIG Manager version v0.6.0
Added support for NVIDIA AI Enterprise release 5.0. Refer to NVIDIA AI Enterprise for information about installing the Operator with a Bash script.

Fixed issues#

Previously, duplicate image pull secrets were added to some daemon sets and caused an error like the following when a node is deleted and the controller manager deleted the pods.

I1031 00:09:44.553742       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-k69f2"
E1031 00:09:44.556500       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-k69f2; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="ngc-secret"]

Previously, common daemon set labels, annotations, and tolerations configured in ClusterPolicy were not also applied to the default NVIDIADriver CR instance. Refer to GitHub issue #665 for more details.
Previously, the technical preview NVIDIA driver custom resource was failing to render the licensing-config volume mount that is required for licensing a vGPU guest driver. Refer to GitHub issue #672 for more details.
Previously, the technical preview NVIDIA driver custom resource was broken when GDS was enabled. An OS suffix was not appended to the image path of the GDS driver container image. Refer to GitHub issue #608 for more details.
Previously, the technical preview NVIDIA driver custom resource failed to render daemon sets when additionalConfig volumes were configured that were host path volumes. This issue prevented users from mounting entitlements on RHEL systems.
Previously, it was not possible to disable the CUDA workload validation pod that the operator-validator pod deploys. You can now disable this pod by setting the following environment variable in ClusterPolicy:
```
validator:
  cuda:
    env:
    - name: "WITH_WORKLOAD"
      value: "false"
```

Known Limitations#

When installing on Red Hat OpenShift Container Platform 4.15 clusters that disable the integrated image registry, secrets are no longer automatically generated and this change causes installation of the Operator to stall. Refer to Special Considerations for OpenShift 4.15 for more information.
The 1g.12gb MIG profile does not operate as expected on the NVIDIA GH200 GPU when the MIG configuration is set to all-balanced.
The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature because of the missing kernel-headers package within the container. With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to run traditional container workloads with NVIDIA GPUs.
If you cordon a node while the GPU driver upgrade process is already in progress, the Operator uncordons the node and upgrades the driver on the node. You can determine if an upgrade is in progress by checking the node label nvidia.com/gpu-driver-upgrade-state != upgrade-done.
NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well as OpenShift Virtualization 4.12.0—4.12.2.
Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
When installing the Operator on Amazon EKS and using Kubernetes versions lower than 1.25, specify the --set psp.enabled=true Helm argument because EKS enables pod security policy (PSP). If you use Kubernetes version 1.25 or higher, do not specify the psp.enabled argument so that the default value, false, is used.
All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is also an alternative.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is an alternative.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the enable_selinux=true configuration option. Additionally, network-restricted environments are not supported.

23.9.1#

New Features#

Added support for NVIDIA GH200 Grace Hopper Superchip as a technology preview feature. Refer to Supported NVIDIA Data Center GPUs and Systems.

The following prerequisites are required for using the Operator with GH200:
- Run Ubuntu 22.04 and an NVIDIA Linux kernel, such as one provided with a linux-nvidia-<x.x> package.
- Add init_on_alloc=0 and memhp_default_state=online_movable as Linux kernel boot parameters.
- Run the NVIDIA Open GPU Kernel module driver.
Added support for configuring the driver container to use the NVIDIA Open GPU Kernel module driver. Support is limited to installation using the runfile installer. Support for precompiled driver containers with open kernel modules is not available.

For clusters that use GPUDirect Storage (GDS), beginning with CUDA toolkit 12.2.2 and the NVIDIA GPUDirect Storage kernel driver version v2.17.5, are only supported with the open kernel modules.

NVIDIA GH200 Grace Hopper Superchip systems are only supported with the open kernel modules.
- Refer to Common Chart Customization Options for information about setting useOpenKernelModules if you manage the driver containers with the NVIDIA cluster policy custom resource definition.
- Refer to NVIDIA GPU Driver Custom Resource Definition for information about setting spec.useOpenKernelModules if you manage the driver containers with the technology preview NVIDIA driver custom resource.
Includes these software component versions:
- NVIDIA Data Center GPU Driver version 535.129.03
- NVIDIA Driver Manager for Kubernetes v0.6.5
- NVIDIA Kubernetes Device Plugin v1.14.3
- NVIDIA DCGM Exporter 3.3.0-3.2.0
- NVIDIA Data Center GPU Manager (DCGM) v3.3.0-1
- NVIDIA KubeVirt GPU Device Plugin v1.2.4
- NVIDIA GPUDirect Storage (GDS) Driver v2.17.5
  
  Important
  
  This version, and newer versions of the NVIDIA GDS kernel driver, require that you use the NVIDIA open kernel modules.
Refer to the GPU Operator Component Matrix on the platform support page.
Added support for NVIDIA Network Operator v23.10.0.

Improvements#

The must-gather.sh script that is used for support is enhanced to collect logs from NVIDIA vGPU Manager pods.

Fixed issues#

Previously, the technical preview NVIDIA driver custom resource did not support adding custom labels, annotations, or tolerations to the pods that run as part of the driver daemon set. This limitation prevented scheduling the driver daemon set in some environments. Refer to GitHub issue #602 for more details.
Previously, when you specified the operator.upgradeCRD=true argument to the helm upgrade command, the pre-upgrade hook ran with the gpu-operator service account that is added by running helm install. This dependency is a known issue for Argo CD users. Argo CD treats pre-install and pre-upgrade hooks the same as pre-sync hooks and leads to failures because the hook depends on the gpu-operator service account that does not exist on an initial installation.

Now, the Operator is enhanced to run the hook with a new service account, gpu-operator-upgrade-crd-hook-sa. This fix creates the new service account, a new cluster role, and a new cluster role binding. The update prevents failures with Argo CD.

Previously, adding an NVIDIA driver custom resource with a node selector that conflicts with another driver custom resource, the controller failed to set the error condition in the custom resource status. The issue produced an error message like the following example:

{"level":"error","ts":1698702848.8472972,"msg":"NVIDIADriver.nvidia.com \"<conflicting-cr-name>"\" is invalid: state: Unsupported value: \"\": supported values: \"ignored\", \"ready\", \"notReady\"","controller":"nvidia-driver-\
controller","object":{"name":"<conflicting-cr-name>"},"namespace":"","name":"<conflicting-cr-name>","reconcileID":"78d58d7b-cd94-4849-a292-391da9a0b049"}

Previously, the NVIDIA KubeVirt GPU Device Plugin could have a GLIBC mismatch error and produce a log message like the following example:
```
nvidia-kubevirt-gpu-device-plugin: /lib64/libc.so.6: version `GLIBC_2.32` not found (required by nvidia-kubevirt-gpu-device-plugin)
```
This issue is fixed by including v1.2.4 of the plugin in this release.
Previously, on some machines and Linux kernel versions, GPU Feature Discovery was unable to determine the machine type because the /sys/class/dmi/id/product_name file did not exist on the host. Now, accessing the file is performed by mounting /sys instead of the fully-qualified path and if the file does not exist, GPU Feature Discovery is able to label the node with nvidia.com/gpu.machine=unknown.
Previously, enabling GPUDirect RDMA on Red Hat OpenShift Container Platform clusters could experience an error with the nvidia-peermem container. The error was related to the RHEL_VERSION variable being unbound.

Known Limitations#

The 1g.12gb MIG profile does not operate as expected on the NVIDIA GH200 GPU when the MIG configuration is set to all-balanced.
The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature because of the missing kernel-headers package within the container. With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to run traditional container workloads with NVIDIA GPUs.
If you cordon a node while the GPU driver upgrade process is already in progress, the Operator uncordons the node and upgrades the driver on the node. You can determine if an upgrade is in progress by checking the node label nvidia.com/gpu-driver-upgrade-state != upgrade-done.
NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well as OpenShift Virtualization 4.12.0—4.12.2.
Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
When installing the Operator on Amazon EKS and using Kubernetes versions lower than 1.25, specify the --set psp.enabled=true Helm argument because EKS enables pod security policy (PSP). If you use Kubernetes version 1.25 or higher, do not specify the psp.enabled argument so that the default value, false, is used.
All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is also an alternative.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is an alternative.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the enable_selinux=true configuration option. Additionally, network-restricted environments are not supported.

23.9.0#

New Features#

Added support for an NVIDIA driver custom resource definition that enables running multiple GPU driver types and versions on the same cluster and adds support for multiple operating system versions. This feature is a technology preview. Refer to NVIDIA GPU Driver Custom Resource Definition for more information.
Added support for additional Linux kernel variants for precompiled driver containers.
- driver:535-5.15.0-xxxx-nvidia-ubuntu22.04
- driver:535-5.15.0-xxxx-azure-ubuntu22.04
- driver:535-5.15.0-xxxx-aws-ubuntu22.04
Refer to the Tags tab of the NVIDIA GPU Driver page in the NGC catalog to determine if a container for your kernel is built. Refer to Precompiled Driver Containers for information about using precompiled driver containers and steps to build your own driver container.
The API for the NVIDIA cluster policy custom resource definition is enhanced to include the current state of the cluster policy. When you view the cluster policy with a command like kubectl get cluster-policy, the response now includes a Status.Conditions field.
Includes these software component versions:
- NVIDIA Data Center GPU Driver version 535.104.12.
- NVIDIA Driver Manager for Kubernetes v0.6.4
- NVIDIA Container Toolkit v1.14.3
- NVIDIA Kubernetes Device Plugin v1.14.2
- NVIDIA DCGM Exporter 3.2.6-3.1.9
- NVIDIA GPU Feature Discovery for Kubernetes v0.8.2
- NVIDIA MIG Manager for Kubernetes v0.5.5
- NVIDIA Data Center GPU Manager (DCGM) v3.2.6-1
- NVIDIA KubeVirt GPU Device Plugin v1.2.3
- NVIDIA vGPU Device Manager v0.2.4
- NVIDIA Kata Manager for Kubernetes v0.1.2
- NVIDIA Confidential Computing Manager for Kubernetes v0.1.1
- Node Feature Discovery v0.14.2
Refer to the GPU Operator Component Matrix on the platform support page.

Fixed issues#

Previously, if the RHEL_VERSION environment variable was set for the Operator, the variable was propagated to the driver container and used in the --releasever argument to the dnf command. With this release, you can specify the DNF_RELEASEVER environment variable for the driver container to override the value of the --releasever argument.
Previously, stale node feature and node feature topology objects could remain in the Kubernetes API server after a node is deleted from the cluster. The upgrade to Node Feature Discovery v0.14.2 includes an enhancement to garbage collection that ensures the objects are removed after a node is deleted.

Known Limitations#

The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature because of the missing kernel-headers package within the container. With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to run traditional container workloads with NVIDIA GPUs.
If you cordon a node while the GPU driver upgrade process is already in progress, the Operator uncordons the node and upgrades the driver on the node. You can determine if an upgrade is in progress by checking the node label nvidia.com/gpu-driver-upgrade-state != upgrade-done.
NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well as OpenShift Virtualization 4.12.0—4.12.2.
Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
When installing the Operator on Amazon EKS and using Kubernetes versions lower than 1.25, specify the --set psp.enabled=true Helm argument because EKS enables pod security policy (PSP). If you use Kubernetes version 1.25 or higher, do not specify the psp.enabled argument so that the default value, false, is used.
All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container. Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is also an alternative.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. The technical preview feature that provides NVIDIA GPU Driver Custom Resource Definition is an alternative.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the enable_selinux=true configuration option. Additionally, network-restricted environments are not supported.

23.6.2#

This patch release back ports a fix that was introduced in the v23.9.1 release.

Fixed Issues#

Previously, when you specified the operator.upgradeCRD=true argument to the helm upgrade command, the pre-upgrade hook ran with the gpu-operator service account that is added by running helm install. This dependency is a known issue for Argo CD users. Argo CD treats pre-install and pre-upgrade hooks the same as pre-sync hooks and leads to failures because the hook depends on the gpu-operator service account that does not exist on an initial installation.

Now, the Operator is enhanced to run the hook with a new service account, gpu-operator-upgrade-crd-hook-sa. This fix creates the new service account, a new cluster role, and a new cluster role binding. The update prevents failures with Argo CD.

23.6.1#

New Features#

Added support for NVIDIA L40S GPUs.
Added support for the NVIDIA Data Center GPU Driver version 535.104.05. Refer to the GPU Operator Component Matrix on the platform support page.

Fixed issues#

Previously, the NVIDIA Container Toolkit daemon set could fail when running on nodes with certain types of GPUs. The driver-validation init container would fail when iterating over NVIDIA PCI devices if the device PCI ID was not in the PCI database. The error message is similar to the following example:

Error: error validating driver installation: error creating symlinks:
failed to get device nodes: failed to get GPU information: error getting
all NVIDIA devices: error constructing NVIDIA PCI device 0000:21:00.0:
unable to get device name: failed to find device with id '26b9'\n\n
Failed to create symlinks under /dev/char that point to all possible NVIDIA
character devices.\nThe existence of these symlinks is required to address
the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\n
This bug impacts container runtimes configured with systemd cgroup management
enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n
validator:\n    driver:\n     env:\n  - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\""

23.6.0#

New Features#

Added support for configuring Kata Containers for GPU workloads as a technology preview feature. This feature introduces NVIDIA Kata Manager for Kubernetes as an operand of GPU Operator.
Added support for configuring Confidential Containers for GPU workloads as a technology preview feature. This feature builds on the work for configuring Kata Containers and introduces NVIDIA Confidential Computing Manager for Kubernetes as an operand of GPU Operator. Refer to gpu-operator-confidential-containers for more information.
Added support for the NVIDIA Data Center GPU Driver version 535.86.10. Refer to the GPU Operator Component Matrix on the platform support page.
Added support for NVIDIA vGPU 16.0.
Added support for NVIDIA Network Operator 23.7.0.
Added support for new MIG profiles with the 535 driver.
- For H100 NVL and H800 NVL devices:
  - 1g.12gb.me
  - 1g.24gb
  - 2g.24gb
  - 3g.47gb
  - 4g.47gb
  - 7g.94gb

Improvements#

The Operator is updated to use the node-role.kubernetes.io/control-plane label that is the default label for Kubernetes version 1.27. As a fallback for older Kubernetes versions, the Operator runs on nodes with the master label if the control-plane label is not available.
Added support for setting Pod Security Admission for the GPU Operator namespace. Pod Security Admission applies to Kubernetes versions 1.25 and higher. You can specify --set psa.enabled=true when you install or upgrade the Operator, or you can patch the cluster-policy instance of the ClusterPolicy object. The Operator sets the following standards:
```
pod-security.kubernetes.io/audit=privileged
pod-security.kubernetes.io/enforce=privileged
pod-security.kubernetes.io/warn=privileged
```
The Operator performs plugin validation when the Operator is installed or upgraded. Previously, the plugin validation ran a workload pod that requires access to a GPU. On a busy node with the GPUs consumed by other workloads, the validation can falsely report failure because it was not scheduled. The plugin validation still confirms that GPUs are advertised to kubelet, but it no longer runs a workload. To override the new behavior and run a plugin validation workload, specify --set validator.plugin.env.WITH_WORKLOAD=true when you install or upgrade the Operator.

Fixed issues#

In clusters that use a network proxy and configure GPU Direct Storage, the nvidia-fs-ctr container can use the network proxy and any other environment variable that you specify with the --set gds.env=key1=val1,key2=val2 option when you install or upgrade the Operator.
In previous releases, when you performed a GPU driver upgrade with the OnDelete strategy, the status reported in the cluster-policy instance of the ClusterPolicy object could indicate Ready even though the driver daemon set has not completed the upgrade of pods on all nodes. In this release, the status is reported as notReady until the upgrade is complete.

Known Limitations#

The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature because of the missing kernel-headers package within the container. With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to run traditional container workloads with NVIDIA GPUs.
If you cordon a node while the GPU driver upgrade process is already in progress, the Operator uncordons the node and upgrades the driver on the node. You can determine if an upgrade is in progress by checking the node label nvidia.com/gpu-driver-upgrade-state != upgrade-done.
NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well as OpenShift Virtualization 4.12.0—4.12.2.
Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
When installing the Operator on Amazon EKS and using Kubernetes versions lower than 1.25, specify the --set psp.enabled=true Helm argument because EKS enables pod security policy (PSP). If you use Kubernetes version 1.25 or higher, do not specify the psp.enabled argument so that the default value, false, is used.
All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container.
Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported.

23.3.2#

New Features#

Added support for Kubernetes v1.27. Refer to Supported Operating Systems and Kubernetes Platforms on the platform support page.
Added support for Red Hat OpenShift Container Platform 4.13. Refer to Supported Operating Systems and Kubernetes Platforms on the platform support page.
Added support for KubeVirt v0.59 and Red Hat OpenShift Virtualization 4.13. Starting with KubeVirt versions v0.58.2 and v0.59.1 and OpenShift Virtualization 4.12.3 and 4.13.0, you must set the DisableMDEVConfiguration feature gate to use NVIDIA vGPU. Refer to GPU Operator with KubeVirt or NVIDIA GPU Operator with OpenShift Virtualization.
Added support for running the Operator with Microsoft Azure Kubernetes Service (AKS). You must use an AKS image with a preinstalled NVIDIA GPU driver and a preinstalled NVIDIA Container Toolkit. Refer to NVIDIA GPU Operator with Azure Kubernetes Service for more information.
Added support for VMWare vSphere 8.0 U1 with Tanzu.
Added support for CRI-0 v1.26 with Red Hat Enterprise Linux 8.7 and support for CRI-0 v1.27 with Ubuntu 20.04.

Improvements#

Increased the default timeout for the nvidia-smi command that is used by the NVIDIA Driver Container startup probe and made the timeout configurable. Previously, the timeout duration for the startup probe was 30s. In this release, the default duration is 60s. This change reduces the frequency of container restarts when nvidia-smi runs slowly. Refer to Common Chart Customization Options for more information.

Fixed issues#

Fixed an issue with NVIDIA GPU Direct Storage (GDS) and Ubuntu 22.04. The Operator was not able to deploy GDS and other daemon sets.

Previously, the Operator produced the following error log:

{"level":"error","ts":1681889507.829097,"msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"cluster-policy"},"namespace":"","name":"cluster-policy","reconcileID":"c5d55183-3ce9-4376-9d20-e3d53dc441cb","error":"ERROR: failed to transform the Driver Toolkit Container: could not find the 'openshift-driver-toolkit-ctr' container"}

Known Limitations#

If you cordon a node while the GPU driver upgrade process is already in progress, the Operator uncordons the node and upgrades the driver on the node. You can determine if an upgrade is in progress by checking the node label nvidia.com/gpu-driver-upgrade-state != upgrade-done.
NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well as OpenShift Virtualization 4.12.0—4.12.2.
Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
When installing the Operator on Amazon EKS and using Kubernetes versions lower than 1.25, specify the --set psp.enabled=true Helm argument because EKS enables pod security policy (PSP). If you use Kubernetes version 1.25 or higher, do not specify the psp.enabled argument so that the default value, false, is used.
Ubuntu 18.04 is scheduled to reach end of standard support in May of 2023. When Ubuntu transitions it to end of life (EOL), the NVIDIA GPU Operator and related projects plan to cease building containers for 18.04 and to cease providing support.
All worker nodes within the Kubernetes cluster must use the same operating system version.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported.

23.3.1#

This release provides a packaging-only update to the 23.3.0 release to fix installation on Red Hat OpenShift Container Platform. Refer to GitHub issue #513.

23.3.0#

New Features#

Added support for the NVIDIA Data Center GPU Driver version 525.105.17. Refer to the GPU Operator Component Matrix on the platform support page.
Added support for GPUDirect Storage with Red Hat OpenShift Container Platform 4.11. Refer to Support for GPUDirect Storage on the platform support page.
Added support for Canonical MicroK8s v1.26. Refer to Supported Operating Systems and Kubernetes Platforms on the platform support page.
Added support for containerd v1.7. Refer to Supported Container Runtimes on the platform support page.
Added support for Node Feature Discovery v0.12.1. Added support for using the NodeFeature API CRD for labeling nodes instead of labeling nodes over gRPC. The documentation for upgrading the Operator manually is updated to include applying the custom resource definitions for Node Feature Discovery.
Added support for running the NVIDIA GPU Operator in Amazon EKS and Google GKE. You must configure the cluster with custom nodes that run a supported operating system, such as Ubuntu 22.04.
Added support for the Container Device Interface (CDI) that is implemented by the NVIDIA Container Toolkit v1.13.0. Refer to Common Chart Customization Options for information about the cdi.enable and cdi.default options to enable CDI during installation or Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support for post-installation configuration information.
[Technology Preview] Added support for precompiled driver containers for select operating systems. This feature removes the dynamic dependencies to build the driver during installation in the cluster such as downloading kernel header packages and GCC tooling. Sites with isolated networks that cannot access the internet can benefit. Sites with machines that are resource constrained can also benefit by removing the computational demand to compile the driver. For more information, see Precompiled Driver Containers.
Added support for the NVIDIA H800 GPU in the Supported NVIDIA Data Center GPUs and Systems table on the Platform Support page.

Improvements#

The upgrade process for the GPU driver is enhanced. This release introduces a maxUnavailable field that you can use to specify the number of nodes that can be unavailable during an upgrade. The value can be an integer or a string that specifies a percentage. If you specify a percentage, the number of nodes is calculated by rounding up. The default value is 25%.

If you specify a value for maxUnavailable and also specify maxParallelUpgrades, the maxUnavailable value applies an additional constraint on the value of maxParallelUpgrades to ensure that the number of parallel upgrades does not cause more than the intended number of nodes to become unavailable during the upgrade. For example, if you specify maxUnavailable=100% and maxParallelUpgrades=1, one node at a time is upgraded.
In previous releases, when you upgrade the GPU driver, the Operator validator pod could fail to complete all the validation checks. As a result, the node could remain in the validation required state indefinitely and prevent performing the driver upgrade on the other nodes in the cluster. This release adds a 600 second timeout for the validation process. If the validation does not complete successfully within the duration, the node is labelled upgrade-failed and the upgrade process proceeds on other nodes.
The Multi-Instance GPU (MIG) manager is enhanced to support setting an initial value for the nvidia.com/mig.config node annotation. On nodes with MIG-capable GPUs that do not already have the annotation set, the value is set to all-disabled and the MIG manager does not create MIG devices. The value is overwritten when you label the node with a MIG profile. For configuration information, see GPU Operator with MIG.

Fixed issues#

Fixed an issue that prevented building the GPU driver container when a Local Package Repository is used. Previously, if you needed to provide CA certificates, the certificates were not installed correctly. The certificates are now installed in the correct directories. Refer to GitHub issue #299 for more details.
Fixed an issue that created audit log records related to deprecated API requests for pod security policy. on Red Hat OpenShift Container Platform. Refer to GitHub issue #451 and issue #490 for more details.
Fixed an issue that caused the Operator to attempt to add a pod security policy on pre-release versions of Kubernetes v1.25. Refer to GitHub issue #484 for more details.
Fixed a race condition that is related to preinstalled GPU drivers, validator pods, and the device plugin pods. The race condition can cause the device plugin pods to set the wrong path to the GPU driver. Refer to GitHub issue #508 for more details.
Fixed an issue with the driver manager that prevented the manager from accurately detecting whether a node has preinstalled GPU drivers. This issue can appear if preinstalled GPU drivers were initially installed and later removed. The resolution is for the manager to check that the nvidia-smi file exists on the host and to check the output from executing the file.
Fixed an issue that prevented adding custom annotations to daemon sets that the Operator starts. Refer to GitHub issue #499 for more details.
Fixed an issue that is related to not starting the GPU Feature Discovery (GFD) pods when the DCGM Exporter service monitor is enabled, but a service monitor custom resource definition does not exist. Previously, there was no log record to describe why the GFD pods were not started. In this release, the Operator logs the error Couldn't find ServiceMonitor CRD and the message Install Prometheus and necessary CRDs for gathering GPU metrics to indicate the reason.
Fixed a race condition that prevented the GPU driver containers from loading the nvidia-peermem Linux kernel module and caused the driver daemon set pods to crash loop back off. The condition could occur when both GPUDirect RDMA and GPUDirect Storage are enabled. In this release, the start script for the driver containers confirm that Operator validator indicates the driver container is ready before attempting to load the kernel module.
Fixed an issue that caused upgrade of the GPU driver to fail when GPUDirect Storage is enabled. In this release, the driver manager unloads the nvidia-fs Linux kernel module before performing the upgrade.
Added support for new MIG profiles with the 525 driver.
- For A100-40GB devices:
  - 1g.5gb.me
  - 1g.10gb
  - 4g.20gb
- For H100-80GB and A100-80GB devices:
  - 1g.10gb
  - 1g.10gb.me
  - 1g.20gb
  - 4g.40gb
- For A30-24GB devices:
  - 1g.6gb.me
  - 2g.12gb.me

Common Vulnerabilities and Exposures (CVEs)#

The gpu-operator:v23.3.0 and gpu-operator-validator:v23.3.0 images have the following known high-vulnerability CVEs. These CVEs are from the base images and are not in libraries that are used by the GPU Operator:

openssl-libs - CVE-2023-0286
platform-python and python3-libs - CVE-2023-24329

Known Limitations#

Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
When installing the Operator on Amazon EKS and using Kubernetes versions lower than 1.25, specify the --set psp.enabled=true Helm argument because EKS enables pod security policy (PSP). If you use Kubernetes version 1.25 or higher, do not specify the psp.enabled argument so that the default value, false, is used.
Ubuntu 18.04 is scheduled to reach end of standard support in May of 2023. When Ubuntu transitions it to end of life (EOL), the NVIDIA GPU Operator and related projects plan to cease building containers for 18.04 and to cease providing support.
All worker nodes within the Kubernetes cluster must use the same operating system version.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported.

22.9.2#

New Features#

Added support for Kubernetes v1.26 and Red Hat OpenShift 4.12. Refer to Platform Support for more details.
Added a new controller that is responsible for managing NVIDIA driver upgrades. Refer to GPU Driver Upgrades for more details.
Added the ability to apply custom labels and annotations for all of the GPU Operator pods. Refer to Common Chart Customization Options for how to configure custom labels and annotations.
Added support for NVIDIA vGPU 15.1. Refer to the NVIDIA Virtual GPU Software Documentation.
Added support for the NVIDIA HGX H100 System in the Supported NVIDIA Data Center GPUs and Systems table on the Platform Support page.
Added 525.85.12 as the recommended driver version and 3.1.6 as the recommended DCGM version in the GPU Operator Component Matrix. These updates enable support for the NVIDIA HGX H100 System.

Improvements#

Enhanced the driver validation logic to make sure that the current instance of the driver container has successfully finished installing drivers. This enhancement prevents other operands from incorrectly starting with previously loaded drivers.
Increased overall driver startup probe timeout from 10 to 20 minutes. The increased timeout improves the installation experience for clusters with slow networks by avoiding unnecessary driver container restarts.

Fixed issues#

Fixed an issue where containers allocated GPU lose access to them when systemd is triggered to run some reevaluation of the cgroups it manages. The issue affects systems using runc configured with systemd cgroups. Refer to GitHub issue #430 for more details.
Fixed an issue that prevented the GPU Operator from applying PSA labels on the namespace when no prior labels existed.

Common Vulnerabilities and Exposures (CVEs)#

The gpu-operator:v22.9.2 and gpu-operator:v22.9.2-ubi8 images have the following known high-vulnerability CVEs. These CVEs are from the base images and are not in libraries that are used by the GPU Operator:

libksba - CVE-2022-47629

Known Limitations#

All worker nodes within the Kubernetes cluster must use the same operating system version.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster-level entitlements to be enabled in this case for the driver installation to succeed.
No support for newer MIG profiles 1g.10gb, 1g.20gb, 2.12gb+me with R525 drivers.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
The nouveau driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error Failed to enable MSI-X in the system journal logs. Additionally, all GPU Operator pods become stuck in the Init state.
When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported.

22.9.1#

New Features#

Support for CUDA 12.0 / R525 Data Center drivers on x86 / ARM servers.
Support for RHEL 8.7 with Kubernetes and Containerd or CRI-O.
Support for Ubuntu 20.4 and 22.04 with Kubernetes and CRI-O.
Support for NVIDIA GPUDirect Storage using Ubuntu 20.04 and Ubuntu 22.04 with Kubernetes.
Support for RTX 6000 ADA GPU
Support for A800 GPU
Support for vSphere 8.0 with Tanzu
Support for vGPU 15.0
Support for HPE Ezmeral Runtime Enterprise. Version 5.5 - with RHEL 8.4 and 8.5

Improvements#

Added helm parameters to control operator logging levels and time encoding.
When using CRI-O runtime with Kubernetes, it is no longer required to update the CRI-O config file to include /run/containers/oci/hooks.d as an additional path for OCI hooks. By default, the NVIDIA OCI runtime hook gets installed at /usr/share/containers/oci/hooks.d which is the default path configured with CRI-O.
Allow per node configurations for NVIDIA Device Plugin using a custom ConfigMap and node label nvidia.com/device-plugin.config=<config-name>.
Support for OnDelete upgrade strategy for all Daemonsets deployed by the GPU Operator. This can be configured using daemonsets.upgradeStrategy parameter in the ClusterPolicy. This prevents pods managed by the GPU Operator from being restarted automatically on spec updates.
Enable eviction of GPU Pods only during driver container upgrades with ENABLE_GPU_POD_EVICTION env (default: “true”) set under driver.manager.env in the ClusterPolicy. This eliminates the requirement to drain the entire node currently.

Fixed issues#

Fix repeated restarts of container-toolkit when used with containerd versions v1.6.9 and above. Refer to GitHub issue #432 for more details.
Disable creation of PodSecurityPolicies (PSP) with K8s versions 1.25 and above as it is removed.

Common Vulnerabilities and Exposures (CVEs)#

Fixed - Updated driver images for 515.86.01, 510.108.03, 470.161.03, 450.216.04 to address CVEs noted here.
The gpu-operator:v22.9.1 and gpu-operator:v22.9.1-ubi8 images have been released with the following known HIGH Vulnerability CVEs. These are from the base images and are not in libraries used by GPU Operator:
- krb5-libs - CVE-2022-42898

Known Limitations#

All worker nodes within the Kubernetes cluster must use the same operating system version.
NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
Driver Toolkit images are broken with Red Hat OpenShift version 4.11.12 and require cluster level entitlements to be enabled in this case for the driver installation to succeed.
No support for newer MIG profiles 1g.10gb, 1g.20gb, 2.12gb+me with R525 drivers. It will be added in the following release.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
nouveau driver has to be blacklisted when using NVIDIA vGPU. Otherwise the driver will fail to initialize the GPU with the error Failed to enable MSI-X in the system journal logs and all GPU Operator pods will be stuck in Init state.
When using RHEL8 with Kubernetes, SELinux has to be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network restricted environments are not supported.

22.9.0#

New Features#

Support for Hopper (H100) GPU with CUDA 11.8 / R520 Data Center drivers on x86 servers.
Support for RHEL 8 with Kubernetes and Containerd or CRI-O.
Support with Kubernetes 1.25.
Support for RKE2 (Rancher Kubernetes Engine 2) with Ubuntu 20.04 and RHEL8.
Support for GPUDirect RDMA with NVIDIA Network Operator 1.3.
Support for Red Hat OpenShift with Cloud Service Providers (CSPs) Amazon AWS, Google GKE and Microsoft Azure.
[General Availability] - Support for KubeVirt and Red Hat OpenShift Virtualization with GPU Passthrough and NVIDIA vGPU based products.
[General Availability] - OCP and Upstream Kubernetes on ARM with supported platforms.
Support for Pod Security Admission (PSA) through the psp.enabled flag. If enabled, the namespace where the operator is installed in will be labeled with the privileged pod security level.

Improvements#

Support automatic upgrade and cleanup of clusterpolicies.nvidia.com CRD using Helm hooks. Refer to Operator upgrades for more info.
Support for dynamically enabling/disabling GFD, MIG Manager, DCGM and DCGM-Exporter.
Switched to calendar versioning starting from this release for better life cycle management and support. Refer to NVIDIA GPU Operator Versioning for more info.

Fixed issues#

Remove CUDA compat libs from the operator and all operand images to avoid mismatch with installed CUDA driver version. More info here and here.
Migrate to node.k8s.io/v1 API for creation of RuntimeClass objects. More info here.
Remove PodSecurityPolicy (PSP) starting with Kubernetes v1.25. Setting psp.enabled will now enable Pod Security Admission (PSA) instead.

Known Limitations#

All worker nodes within the Kubernetes cluster must use the same operating system version.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
nouveau driver has to be blacklisted when using NVIDIA vGPU. Otherwise the driver will fail to initialize the GPU with the error Failed to enable MSI-X in the system journal logs and all GPU Operator pods will be stuck in Init state.
When using CRI-O runtime with Kubernetes, the config file /etc/crio/crio.conf has to include /run/containers/oci/hooks.d as path for hooks_dir. Refer Specifying Configuration Options for containerd for steps to configure this.
When using RHEL8 with Kubernetes, SELinux has to be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network restricted environments are not supported.
The gpu-operator:v22.9.0 and gpu-operator:v22.9.0-ubi8 images have been released with the following known HIGH Vulnerability CVEs. These are from the base images and are not in libraries used by GPU Operator:
- expat - CVE-2022-40674
- systemd-pam - CVE-2022-2526
- systemd - CVE-2022-2526
- systemd-libs - CVE-2022-2526

1.11.1#

Improvements#

Added startupProbe to NVIDIA driver container to allow RollingUpgrades to progress to other nodes only after driver modules are successfully loaded on current one.
Added support for driver.rollingUpdate.maxUnavailable parameter to specify maximum nodes for simultaneous driver upgrades. Default is 1.
NVIDIA driver container will auto-disable itself on the node with pre-installed drivers by applying label nvidia.com/gpu.deploy.driver=pre-installed. This is useful for heterogeneous clusters where only some GPU nodes have pre-installed drivers(e.g. DGX OS).

Fixed issues#

Apply tolerations to cuda-validator and device-plugin-validator Pods based on deamonsets.tolerations in ClusterPolicy. For more info refer here.
Fixed an issue causing cuda-validator Pod to fail when accept-nvidia-visible-devices-envvar-when-unprivileged = false is set with NVIDIA Container Toolkit. For more info refer here.
Fixed an issue which caused recursive mounts under /run/nvidia/driver when both driver.rdma.enabled and driver.rdma.useHostMofed are set to true. This caused other GPU Pods to fail to start.

1.11.0#

New Features#

Support for NVIDIA Data Center GPU Driver version 515.48.07.
Support for NVIDIA AI Enterprise 2.1.
Support for NVIDIA Virtual Compute Server 14.1 (vGPU).
Support for Ubuntu 22.04 LTS.
Support for secure boot with GPU Driver version 515 and Ubuntu Server 20.04 LTS and 22.04 LTS.
Support for Kubernetes 1.24.
Support for Time-Slicing GPUs in Kubernetes.
Support for Red Hat OpenShift on AWS, Azure and GCP instances. Refer to the Platform Support Matrix for the supported instances.
Support for Red Hat Openshift 4.10 on AWS EC2 G5g instances(ARM).
Support for Kubernetes 1.24 on AWS EC2 G5g instances(ARM).
Support for use with the NVIDIA Network Operator 1.2.
[Technical Preview] - Support for KubeVirt and Red Hat OpenShift Virtualization with GPU Passthrough and NVIDIA vGPU based products.
[Technical Preview] - Kubernetes on ARM with Server Base System Architecture (SBSA).

Improvements#

GPUDirect RDMA is now supported with CentOS using MOFED installed on the node.
The NVIDIA vGPU Manager can now be upgraded to a newer branch while using an older, compatible guest driver.
DGX A100 and non-DGX servers can now be used within the same cluster.
Improved user interface while deploying a ClusterPolicy instance(CR) for the GPU Operator through Red Hat OpenShift Console.
Improved the container-toolkit to handle v1 containerd configurations.

Fixed issues#

Fix for incorrect reporting of DCGM_FI_DEV_FB_USED where reserved memory is reported as used memory. For more details refer to GitHub issue.
Fixed nvidia-peermem sidecar container to correctly load the nvidia-peermem module when MOFED is directly installed on the node.
Fixed duplicate mounts of /run/mellanox/drivers within the driver container which caused driver cleanup or re-install to fail.
Fixed uncordoning of the node with k8s-driver-manager whenever ENABLE_AUTO_DRAIN env is disabled.
Fixed readiness check for MOFED driver installation by the NVIDIA Network Operator. This will avoid the GPU driver containers to be in CrashLoopBackOff while waiting for MOFED drivers to be ready.

Known Limitations#

All worker nodes within the Kubernetes cluster must use the same operating system version.
The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
See the limitations sections for the [Technical Preview] of GPU Operator support for KubeVirt.
The clusterpolicies.nvidia.com CRD has to be manually deleted after the GPU Operator is uninstalled using Helm.
nouveau driver has to be blacklisted when using the NVIDIA vGPU. Otherwise the driver will fail to initialize the GPU with the error Failed to enable MSI-X in the system journal logs and all GPU Operator pods will be stuck in init state.
The gpu-operator:v1.11.0 and gpu-operator:v1.11.0-ubi8 images have been released with the following known HIGH Vulnerability CVEs. These are from the base images and are not in libraries used by GPU Operator:
- xz-libs - CVE-2022-1271

1.10.1#

Improvements#

Validated secure boot with signed NVIDIA Data Center Driver R510.
Validated cgroup v2 with Ubuntu Server 20.04 LTS.

Fixed issues#

Fixed an issue when GPU Operator was installed and MIG was already enabled on a GPU. The GPU Operator will now install successfully and MIG can either be disabled via the label nvidia.com/mig.config=all-disabled or configured with the required MIG profiles.

Known Limitations#

The gpu-operator:v1.10.1 and gpu-operator:v1.10.1-ubi8 images have been released with the following known HIGH Vulnerability CVEs. These are from the base images and are not in libraries used by GPU Operator:
- openssl-libs - CVE-2022-0778
- zlib - CVE-2018-25032
- gzip - CVE-2022-1271

1.10.0#

New Features#

Support for NVIDIA Data Center GPU Driver version 510.47.03.
Support NVIDIA A2, A100X and A30X
Support for A100X and A30X on the DPU’s Arm processor.
Support for secure boot with Ubuntu Server 20.04 and NVIDIA Data Center GPU Driver version R470.
Support for Red Hat OpenShift 4.10.
Support for GPUDirect RDMA with Red Hat OpenShift.
Support for NVIDIA AI Enterprise 2.0.
Support for NVIDIA Virtual Compute Server 14 (vGPU).

Improvements#

Enabling/Disabling of GPU System Processor (GSP) Mode through NVIDIA driver module parameters.
Ability to avoid deploying GPU Operator Operands on certain worker nodes through labels. Useful for running VMs with GPUs using KubeVirt.

Fixed issues#

Increased lease duration of GPU Operator to 60s to avoid restarts during etcd defrag. More details here.
Avoid spurious alerts generated of type GPUOperatorOpenshiftDriverToolkitEnabledNfdTooOld on RedHat OpenShift when there are no GPU nodes in the cluster.
Avoid uncordoning nodes during driver pod startup when ENABLE_AUTO_DRAIN is set to false.
Collection of GPU metrics in MIG mode is now supported with 470+ drivers.
Fabric Manager (required for NVSwitch based systems) with CentOS 7 is now supported.

Known Limitations#

Upgrading to a new NVIDIA AI Enterprise major branch:

Upgrading the vGPU host driver to a newer major branch than the vGPU guest driver will result in GPU driver pod transitioning to a failed state. This happens for instance when the Host is upgraded to vGPU version 14.x while the Kubernetes nodes are still running with vGPU version 13.x.

To overcome this situation, before upgrading the host driver to the new vGPU branch, apply the following steps:
1. kubectl edit clusterpolicy
2. modify the policy and set the environment variable DISABLE_VGPU_VERSION_CHECK to true as shown below:
  driver: env: - name: DISABLE_VGPU_VERSION_CHECK value: "true"
3. write and quit the clusterpolicy edit
The gpu-operator:v1.10.0 and gpu-operator:v1.10.0-ubi8 images have been released with the following known HIGH Vulnerability CVEs. These are from the base images and are not in libraries used by GPU Operator:
- openssl-libs - CVE-2022-0778

1.9.1#

Improvements#

Improved logic in the driver container for waiting on MOFED driver readiness. This ensures that nvidia-peermem is built and installed correctly.

Fixed issues#

Allow driver container to fallback to using cluster entitlements on Red Hat OpenShift on build failures. This issue exposed itself when using GPU Operator with some Red Hat OpenShift 4.8.z versions and Red Hat OpenShift 4.9.8. GPU Operator 1.9+ with Red Hat OpenShift 4.9.9+ doesn’t require entitlements.
Fixed an issue when DCGM-Exporter didn’t work correctly with using the separate DCGM host engine that is part of the standalone DCGM pod. Fixed the issue and changed the default behavior to use the DCGM Host engine that is embedded in DCGM-Exporter. The standalone DCGM pod will not be launched by default but can be enabled for use with DGX A100.
Update to latest Go vendor packages to avoid any CVE’s.
Fixed an issue to allow GPU Operator to work with CRI-O runtime on Kubernetes.
Mount correct source path for Mellanox OFED 5.x drivers for enabling GPUDirect RDMA.

1.9.0#

New Features#

Support for NVIDIA Data Center GPU Driver version 470.82.01.
Support for DGX A100 with DGX OS 5.1+.
Support for preinstalled GPU Driver with MIG Manager.
Removed dependency to maintain active Red Hat OpenShift entitlements to build the GPU Driver. Introduce entitlement free driver builds starting with Red Hat OpenShift 4.9.9.
Support for GPUDirect RDMA with preinstalled Mellanox OFED drivers.
Support for GPU Operator and operands upgrades using Red Hat OpenShift Lifecycle Manager (OLM).
Support for NVIDIA Virtual Compute Server 13.1 (vGPU).

Improvements#

Automatic detection of default runtime used in the cluster. Deprecate the operator.defaultRuntime parameter.
GPU Operator and its operands are installed into a single user specified namespace.
A loaded Nouveau driver is automatically detected and unloaded as part of the GPU Operator install.
Added an option to mount a ConfigMap of self-signed certificates into the driver container. Enables SSL connections to private package repositories.

Fixed issues#

Fixed an issue when DCGM Exporter was in CrashLoopBackOff as it could not connect to the DCGM port on the same node.

Known Limitations#

GPUDirect RDMA is only supported with R470 drivers on Ubuntu 20.04 LTS and is not supported on other distributions (e.g. CoreOS, CentOS etc.)
The GPU Operator supports GPUDirect RDMA only in conjunction with the Network Operator. The Mellanox OFED drivers can be installed by the Network Operator or pre-installed on the host.
Upgrades from v1.8.x to v1.9.x are not supported due to GPU Operator 1.9 installing the GPU Operator and its operands into a single namespace. Previous GPU Operator versions installed them into different namespaces. Upgrading to GPU Operator 1.9 requires uninstalling pre 1.9 GPU Operator versions prior to installing GPU Operator 1.9
Collection of GPU metrics in MIG mode is not supported with 470+ drivers.
The GPU Operator requires all MIG related configurations to be executed by MIG Manager. Enabling/Disabling MIG and other MIG related configurations directly on the host is discouraged.
Fabric Manager (required for NVSwitch based systems) with CentOS 7 is not supported.

1.8.2#

Fixed issues#

Fixed an issue where Driver Daemonset was spuriously updated on RedHat OpenShift causing repeated restarts in Proxy environments.
The MIG Manager version was bumped to v0.1.3 to fix an issue when checking whether a GPU was in MIG mode or not. Previously, it would always check for MIG mode directly over the PCIe bus instead of using NVML. Now it checks with NVML when it can, only falling back to the PCIe bus when NVML is not available. Please refer to the Release notes for a complete list of fixed issues.
Container Toolkit bumped to version v1.7.1 to fix an issue when using A100 80GB.

Improvements#

Added support for user-defined MIG partition configuration via a ConfigMap.

1.8.1#

Fixed issues#

Fixed an issue with using the NVIDIA License System in NVIDIA AI Enterprise deployments.

1.8.0#

New Features#

Support for NVIDIA Data Center GPU Driver version 470.57.02.
Added support for NVSwitch systems such as HGX A100. The driver container detects the presence of NVSwitches in the system and automatically deploys the Fabric Manager for setting up the NVSwitch fabric.
The driver container now builds and loads the nvidia-peermem kernel module when GPUDirect RDMA is enabled and Mellanox devices are present in the system. This allows the GPU Operator to complement the NVIDIA Network Operator to enable GPUDirect RDMA in the Kubernetes cluster. Refer to the RDMA documentation on getting started.

Note

This feature is available only when used with R470 drivers on Ubuntu 20.04 LTS.
Added support for upgrades of the GPU Operator components. A new k8s-driver-manager component handles upgrades of the NVIDIA drivers on nodes in the cluster.
NVIDIA DCGM is now deployed as a component of the GPU Operator. The standalone DCGM container allows multiple clients such as DCGM-Exporter and NVSM to be deployed and connect to the existing DCGM container.
Added a nodeStatusExporter component that exports operator and node metrics in a Prometheus format. The component provides information on the status of the operator (e.g. reconciliation status, number of GPU enabled nodes).

Improvements#

Reduced the size of the ClusterPolicy CRD by removing duplicates and redundant fields.
The GPU Operator now supports detection of the virtual PCIe topology of the system and makes the topology available to vGPU drivers via a configuration file. The driver container starts the nvidia-topologyd daemon in vGPU configurations.
Added support for specifying the RuntimeClass variable via Helm.
Added nvidia-container-toolkit images to support CentOS 7 and CentOS 8.
nvidia-container-toolkit now supports configuring containerd correctly for RKE2.
Added new debug options (logging, verbosity levels) for nvidia-container-toolkit

Fixed issues#

The driver container now loads ipmi_devintf by default. This allows tools such as ipmitool that rely on ipmi char devices to be created and available.

Known Limitations#

GPUDirect RDMA is only supported with R470 drivers on Ubuntu 20.04 LTS and is not supported on other distributions (e.g. CoreOS, CentOS etc.)
The operator supports building and loading of nvidia-peermem only in conjunction with the Network Operator. Use with pre-installed MOFED drivers on the host is not supported. This capability will be added in a future release.
Support for DGX A100 with GPU Operator 1.8 will be available in an upcoming patch release.
This version of GPU Operator does not work well on RedHat OpenShift when a cluster-wide proxy is configured and causes constant restarts of driver container. This will be fixed in an upcoming patch release v1.8.2.

1.7.1#

Fixed issues#

NFD version bumped to v0.8.2 to support correct kernel version labeling on Anthos nodes. See NFD issue for more details.

1.7.0#

New Features#

Support for NVIDIA Data Center GPU Driver version 460.73.01.
Added support for automatic configuration of MIG geometry on NVIDIA Ampere products (e.g. A100) using the k8s-mig-manager.
GPU Operator can now be deployed on systems with pre-installed NVIDIA drivers and the NVIDIA Container Toolkit.
DCGM-Exporter now supports telemetry for MIG devices on supported Ampere products (e.g. A100).
Added support for a new nvidia RuntimeClass with containerd.
The Operator now supports PodSecurityPolicies when enabled in the cluster.

Improvements#

Changed the label selector used by the DaemonSets of the different states of the GPU Operator. Instead of having a global label nvidia.com/gpu.present=true, each DaemonSet now has its own label, nvidia.com/gpu.deploy.<state>=true. This new behavior allows a finer grain of control over the components deployed on each of the GPU nodes.
Migrated to using the latest operator-sdk for building the GPU Operator.
The operator components are deployed with node-critical PriorityClass to minimize the possibility of eviction.
Added a spec for the initContainer image, to allow flexibility to change the base images as required.
Added the ability to configure the MIG strategy to be applied by the Operator.
The driver container now auto-detects OpenShift/RHEL versions to better handle node/cluster upgrades.
Validations of the container-toolkit and device-plugin installations are done on all GPU nodes in the cluster.
Added an option to skip plugin validation workload pod during the Operator deployment.

Fixed issues#

The gpu-operator-resources namespace is now created by the Operator so that they can be used by both Helm and OpenShift installations.

Known Limitations#

DCGM does not support profiling metrics on RTX 6000 and RTX 8000. Support will be added in a future release of DCGM Exporter.
After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using sudo rmmod nvidia nvidia_modeset nvidia_uvm command before re-installing GPU Operator.
When MIG strategy of mixed is configured, device-plugin-validation may stay in Pending state due to incorrect GPU resource request type. User would need to modify the pod spec to apply correct resource type to match the MIG devices configured in the cluster.

1.6.2#

Fixed issues#

Fixed an issue with NVIDIA Container Toolkit 1.4.6 which causes an error with containerd as Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused. NVIDIA Container Toolkit 1.4.7 now sets version as an integer to fix this error.
Fixed an issue with NVIDIA Container Toolkit which causes nvidia-container-runtime settings to be persistent across node reboot and causes driver pod to fail. Now nvidia-container-runtime will fallback to using runc when driver modules are not yet loaded during node reboot.
GPU Operator now mounts runtime hook configuration for CRIO under /run/containers/oci/hooks.d.

1.6.1#

Fixed issues#

Fixed an issue with NVIDIA Container Toolkit 1.4.5 when used with containerd and an empty containerd configuration which file causes error Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused. NVIDIA Container Toolkit 1.4.6 now explicitly sets the version=2 along with other changes when the default containerd configuration file is empty.

1.6.0#

New Features#

Support for Red Hat OpenShift 4.7.
Support for NVIDIA Data Center GPU Driver version 460.32.03.
Automatic injection of Proxy settings and custom CA certificates into driver container for Red Hat OpenShift.

DCGM-Exporter support includes the following:

Updated DCGM to v2.1.4
Increased reporting interval to 30s instead of 2s to reduce overhead
Report NVIDIA vGPU licensing status and row-remapping metrics for Ampere GPUs

Improvements#

NVIDIA vGPU licensing configuration (gridd.conf) can be provided as a ConfigMap
ClusterPolicy CRD has been updated from v1beta1 to v1. As a result minimum supported Kubernetes version is 1.16 from GPU Operator 1.6.0 onwards.

Fixed issues#

Fixes for DCGM Exporter to work with CPU Manager.
nvidia-gridd daemon logs are now collected on host by rsyslog.

Known Limitations#

DCGM does not support profiling metrics on RTX 6000 and RTX 8000. Support will be added in a future release of DCGM Exporter.
After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using sudo rmmod nvidia nvidia_modeset nvidia_uvm command before re-installing GPU Operator.
When MIG strategy of mixed is configured, device-plugin-validation may stay in Pending state due to incorrect GPU resource request type. User would need to modify the pod spec to apply correct resource type to match the MIG devices configured in the cluster.
gpu-operator-resources project in Red Hat OpenShift requires label openshift.io/cluster-monitoring=true for Prometheus to collect DCGM metrics. User will need to add this label manually when project is created.

1.5.2#

Improvements#

Allow mig.strategy=single on nodes with non-MIG GPUs.
Pre-create MIG related nvcaps at startup.
Updated device-plugin and toolkit validation to work with CPU Manager.

Fixed issues#

Fixed issue which causes GFD pods to fail with error Failed to load NVML error even after driver is loaded.

1.5.1#

Improvements#

Kubelet’s cgroup driver as systemd is now supported.

Fixed issues#

Device-Plugin stuck in init phase on node reboot or when new node is added to the cluster.

1.5.0#

New Features#

Added support for NVIDIA vGPU

Improvements#

Driver Validation container is run as an initContainer within device-plugin Daemonset pods. Thus driver installation on each NVIDIA GPU/vGPU node will be validated.
GFD will label vGPU nodes with driver version and branch name of NVIDIA vGPU installed on Hypervisor.
Driver container will perform automatic compatibility check of NVIDIA vGPU driver with the version installed on the underlying Hypervisor.

Fixed issues#

GPU Operator will no longer crash when no GPU nodes are found.
Container Toolkit pods wait for drivers to be loaded on the system before setting the default container runtime as nvidia.
On host reboot, ordering of pods is maintained to ensure that drivers are always loaded first.
Fixed device-plugin issue causing symbol lookup error: nvidia-device-plugin: undefined symbol: nvmlEventSetWait_v2 error.

Known Limitations#

The GPU Operator v1.5.x does not support mixed types of GPUs in the same cluster. All GPUs within a cluster need to be either NVIDIA vGPUs, GPU Passthrough GPUs or Bare Metal GPUs.
GPU Operator v1.5.x with NVIDIA vGPUs support Turing and newer GPU architectures.
DCGM does not support profiling metrics on RTX 6000 and RTX 8000. Support will be added in a future release of DCGM Exporter.
After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using sudo rmmod nvidia nvidia_modeset nvidia_uvm command before re-installing GPU Operator.
When MIG strategy of mixed is configured, device-plugin-validation may stay in Pending state due to incorrect GPU resource request type. User would need to modify the pod spec to apply correct resource type to match the MIG devices configured in the cluster.
gpu-operator-resources project in Red Hat OpenShift requires label openshift.io/cluster-monitoring=true for Prometheus to collect DCGM metrics. User will need to add this label manually when project is created.

1.4.0#

New Features#

Added support for CentOS 7 and 8.
Note

Due to a known limitation with the GPU Operator’s default values on CentOS, install the operator on CentOS 7/8 using the following Helm command:
```
$ helm install --wait --generate-name \
  nvidia/gpu-operator \
  --set toolkit.version=1.4.0-ubi8
```
This issue will be fixed in the next release.
Added support for airgapped enterprise environments.
Added support for containerd as a container runtime under Kubernetes.

Improvements#

Updated DCGM-Exporter to 2.1.2, which uses DCGM 2.0.13.
Added the ability to pass arguments to the NVIDIA device plugin to enable migStrategy and deviceListStrategy flags that allow additional configuration of the plugin.
Added more resiliency to dcgm-exporter- dcgm-exporter would not check whether GPUs support profiling metrics and would result in a CrashLoopBackOff state at launch in these configurations.

Fixed issues#

Fixed the issue where the removal of the GPU Operator from the cluster required a restart of the Docker daemon (since the Operator sets the nvidia as the default runtime).
Fixed volume mounts for dcgm-exporter under the GPU Operator to allow pod<->device metrics attribution.
Fixed an issue where the GFD and dcgm-exporter container images were artificially limited to R450+ (CUDA 11.0+) drivers.

Known Limitations#

After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using sudo rmmod nvidia nvidia_modeset nvidia_uvm command before re-installing GPU Operator.

1.3.0#

New Features#

Integrated GPU Feature Discovery to automatically generate labels for GPUs leveraging NFD.
Added support for Red Hat OpenShift 4.4+ (i.e. 4.4.29+, 4.5 and 4.6). The GPU Operator can be deployed from OpenShift OperatorHub. See the catalog listing for more information.

Improvements#

Updated DCGM-Exporter to 2.1.0 and added profiling metrics by default.
Added further capabilities to configure tolerations, node affinity, node selectors, pod security context, resource requirements through the ClusterPolicy.
Optimized the footprint of the validation containers images - the image sizes are now down to ~200MB.
Validation images are now configurable for air-gapped installations.

Fixed issues#

Fixed the ordering of the state machine to ensure that the driver daemonset is deployed before the other components. This fix addresses the issue where the NVIDIA container toolkit would be set up as the default runtime, causing the driver container initialization to fail.

Known Limitations#

After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using sudo rmmod nvidia nvidia_modeset nvidia_uvm command before re-installing GPU Operator.

1.2.0#

New Features#

Added support for Ubuntu 20.04.z LTS.
Added support for the NVIDIA A100 GPU (and appropriate updates to the underlying components of the operator).

Improvements#

Updated Node Feature Discovery (NFD) to 0.6.0.
Container images are now hosted (and mirrored) on both DockerHub and NGC.

Fixed issues#

Fixed an issue where the GPU Operator would not correctly detect GPU nodes due to inconsistent PCIe node labels.
Fixed a race condition where some of the NVIDIA pods would start out of order resulting in some pods in RunContainerError state.
Fixed an issue in the driver container where the container would fail to install on systems with the linux-gke kernel due to not finding the kernel headers.

Known Limitations#

After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using sudo rmmod nvidia nvidia_modeset nvidia_uvm command before re-installing GPU Operator.

1.1.0#

New features#

DCGM is now deployed as part of the GPU Operator on OpenShift 4.3.

Improvements#

The operator CRD has been renamed to ClusterPolicy.
The operator image is now based on UBI8.
Helm chart has been refactored to fix issues and follow some best practices.

Fixed issues#

Fixed an issue with the toolkit container which would set up the NVIDIA runtime under /run/nvidia with a symlink to /usr/local/nvidia. If a node was rebooted, this would prevent any containers from being run with Docker as the container runtime configured in /etc/docker/daemon.json would not be available after reboot.
Fixed a race condition with the creation of the CRD and registration.

1.0.0#

New Features#

Added support for Helm v3. Note that installing the GPU Operator using Helm v2 is no longer supported.
Added support for Red Hat OpenShift 4 (4.1, 4.2 and 4.3) using Red Hat Enterprise Linux Core OS (RHCOS) and CRI-O runtime on GPU worker nodes.
GPU Operator now deploys NVIDIA DCGM for GPU telemetry on Ubuntu 18.04 LTS

Fixed Issues#

The driver container now sets up the required dependencies on i2c and ipmi_msghandler modules.
Fixed an issue with the validation steps (for the driver and device plugin) taking considerable time. Node provisioning times are now improved by 5x.
The SRO custom resource definition is set up as part of the operator.
Fixed an issue with the cleanup of driver mount files when deleting the operator from the cluster. This issue previously required a reboot of the node.

Known Limitations#

After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using sudo rmmod nvidia nvidia_modeset nvidia_uvm command before re-installing GPU Operator.