Runbook: GPU Driver and GPU Operator Upgrades
Runbook: GPU Driver and GPU Operator Upgrades
Runbook: GPU Driver and GPU Operator Upgrades
This runbook provides the procedure to safely upgrade GPU drivers and GPU Operator components while preventing NVSentinel from interfering with the upgrade process.
During GPU Operator or driver upgrades, DCGM on affected nodes becomes temporarily disabled or unhealthy. NVSentinel uses DCGM as a health indicator for the GPU driver. When DCGM connectivity fails, NVSentinel:
GpuDcgmConnectivityFailure node conditionPotential Issues:
Solution: Temporarily disable NVSentinel management on nodes undergoing GPU driver or GPU Operator upgrades.
Apply the k8saas.nvidia.com/ManagedByNVSentinel=false label to all nodes that will be upgraded.
Note: Replace --all with specific node names if only upgrading a subset of nodes.
Execute the GPU driver or GPU Operator upgrade using your organization’s standard upgrade procedure.
Verify that all pods in the gpu-operator namespace are running and healthy:
Ensure all pods show Running status and are ready before proceeding.
Remove the k8saas.nvidia.com/ManagedByNVSentinel label from the upgraded nodes:
Note: Replace --all with specific node names if only a subset of nodes was upgraded.
After re-enabling NVSentinel management, monitor the nodes to ensure:
Ready state