For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Overview
    • Integrations
  • Architecture
    • Data Flow
    • External Datastore
  • Components
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor IAM
    • Kubernetes Object Monitor
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • State Manager
    • Node Drainer
    • Fault Quarantine
    • Fault Remediation
    • Circuit Breaker
    • Cancelling Breakfix
    • Log Collection
    • Monitoring Critical Operators
    • PostgreSQL Provider
  • Observability
    • Metrics Reference
    • Distributed Tracing
    • Audit Logging
  • Configuration
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor
    • Kubernetes Object Monitor
    • Fault Quarantine
    • Node Drainer
    • Fault Remediation
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • MongoDB Store
  • Runbooks
    • Circuit Breaker
    • Cordoned Nodes
    • CSP Health Monitor IAM
    • Datastore Connection
    • Driver Upgrades
    • GPU Monitor DCGM Failures
    • Health Event Analyzer High Error Rate
    • Health Monitor UDS Failures
    • Log Collection Job Failures
    • Log Rotation Failures
    • MongoDB Connection Error
    • Node Conditions
    • Node Condition Update Failures
    • Node Event Creation Failures
    • Stale Events
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Overview
  • Background
  • Procedure
  • 1. Disable NVSentinel Management on Target Nodes
  • 2. Perform the GPU Driver or GPU Operator Upgrade
  • 3. Validate GPU Component Health
  • 4. Re-enable NVSentinel Management
  • Verification
Runbooks

Runbook: GPU Driver and GPU Operator Upgrades

||View as Markdown|
Previous

Datastore Connection

Next

GPU Monitor DCGM Failures

Overview

This runbook provides the procedure to safely upgrade GPU drivers and GPU Operator components while preventing NVSentinel from interfering with the upgrade process.

Background

During GPU Operator or driver upgrades, DCGM on affected nodes becomes temporarily disabled or unhealthy. NVSentinel uses DCGM as a health indicator for the GPU driver. When DCGM connectivity fails, NVSentinel:

  1. Marks the node as unhealthy
  2. Applies the GpuDcgmConnectivityFailure node condition
  3. Cordons the node to prevent new workload scheduling

Potential Issues:

  • When multiple nodes are upgraded in parallel, the circuit breaker may trip, preventing NVSentinel from processing new events and leaving nodes in a cordoned state.
  • When the circuit breaker resets after maintenance, NVSentinel may process queued events, potentially causing the circuit breaker to trip again or nodes to be cordoned and uncordoned repeatedly (tracked in #450).
  • These behaviors can cause cluster availability issues.

Solution: Temporarily disable NVSentinel management on nodes undergoing GPU driver or GPU Operator upgrades.

Procedure

1. Disable NVSentinel Management on Target Nodes

Apply the k8saas.nvidia.com/ManagedByNVSentinel=false label to all nodes that will be upgraded.

$kubectl label node --all k8saas.nvidia.com/ManagedByNVSentinel=false

Note: Replace --all with specific node names if only upgrading a subset of nodes.

2. Perform the GPU Driver or GPU Operator Upgrade

Execute the GPU driver or GPU Operator upgrade using your organization’s standard upgrade procedure.

3. Validate GPU Component Health

Verify that all pods in the gpu-operator namespace are running and healthy:

$kubectl get po -n gpu-operator

Ensure all pods show Running status and are ready before proceeding.

4. Re-enable NVSentinel Management

Remove the k8saas.nvidia.com/ManagedByNVSentinel label from the upgraded nodes:

$kubectl label node --all k8saas.nvidia.com/ManagedByNVSentinel-

Note: Replace --all with specific node names if only a subset of nodes was upgraded.

Verification

After re-enabling NVSentinel management, monitor the nodes to ensure:

  • Nodes remain in Ready state
  • No unexpected cordoning occurs
  • NVSentinel metrics show normal operation