Installation and Upgrade Issues
If the GPU Operator is installed using Helm, always use the helm uninstall
command to ensure proper cleanup of all GPU Operator artifacts. Manually deleting components using kubectl
can leave stale entries, leading to installation failures with errors about invalid ownership metadata or namespace mismatches.
Next Steps
Automated Cleanup:
Run:
helm uninstall --wait gpu-operator nvidia/gpu-operator -n gpu-operator
Replace the namespace if necessary.
Manual Cleanup:
Delete Custom Resource Definitions and deployments.
Remove namespace, ClusterRoles, and ClusterRoleBindings.
For more detailed instructions and additional information, visit the full article here.
The GPU Operator installation fails on Kubernetes versions 1.25 and later due to the removal of the PodSecurityPolicy (PSP) API. Users attempting to install GPU Operator versions below 22.09 with PSP enabled encounter the following error:
Error: INSTALLATION FAILED: unable to build Kubernetes objects from release manifest: resource mapping not found for name: "gpu-operator-restricted" namespace: "" from "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1"
Next Steps
Disable PSP during installation using the following command:
helm install --wait gpu-operator nvaie/gpu-operator-1-3 -n gpu-operator --set driver.image=vgpu-guest-driver-1-3 --set psp.enabled=false
Upgrade to the latest GPU Operator version that supports Kubernetes 1.25+.
For more detailed instructions and additional information, visit the full article here.
The Linux GRID driver fails to install on Ubuntu 22.04.2 LTS with kernel version 6.5.0-41 due to:
There is a mismatch between the GCC version used to build the kernel and the installed GCC version.
An unsupported GCC option
-ftrivial-auto-var-init=zero
in GCC-11.
Next Steps
Fix GCC version mismatch:
Install GCC-12
Run the NVIDIA installer
Alternatively, update symbolic links for GCC to point to GCC-12
Resolve GCC option issues: Ensure the compiler version matches the kernel build version to avoid unsupported options.
For more detailed instructions and additional information, visit the full article here.
The DLS 3.3.0 in-place upgrade process gets stuck at Step 3 due to a certificate mismatch between the DLS UI and the In-Place Upgrade Service.
Next Steps
Access the upgrade page using the DLS IP (e.g,
https://<DLS_IP>:8443
).Use a browser with lower security settings (e.g., Firefox with Standard settings).
For more detailed instructions and additional information, visit the full article here.