For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Overview
    • Integrations
  • Architecture
    • Data Flow
    • External Datastore
  • Components
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor IAM
    • Kubernetes Object Monitor
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • State Manager
    • Node Drainer
    • Fault Quarantine
    • Fault Remediation
    • Circuit Breaker
    • Cancelling Breakfix
    • Log Collection
    • Monitoring Critical Operators
    • PostgreSQL Provider
  • Observability
    • Metrics Reference
    • Distributed Tracing
    • Audit Logging
  • Configuration
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor
    • Kubernetes Object Monitor
    • Fault Quarantine
    • Node Drainer
    • Fault Remediation
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • MongoDB Store
  • Runbooks
    • Circuit Breaker
    • Cordoned Nodes
    • CSP Health Monitor IAM
    • Datastore Connection
    • Driver Upgrades
    • GPU Monitor DCGM Failures
    • Health Event Analyzer High Error Rate
    • Health Monitor UDS Failures
    • Log Collection Job Failures
    • Log Rotation Failures
    • MongoDB Connection Error
    • Node Conditions
    • Node Condition Update Failures
    • Node Event Creation Failures
    • Stale Events
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Overview
  • Purpose
  • State Machine
  • State Label Values
  • State Transitions
  • Example Flows
Components

State Manager

||View as Markdown|
Previous

Preflight

Next

Node Drainer

Overview

The State Manager is a Go library (commons/pkg/statemanager) that manages the dgxc.nvidia.com/nvsentinel-state node label lifecycle across NVSentinel modules. It provides a state machine implementation with transition validation and observability for coordinating Fault Quarantine, Node Drainer, and Fault Remediation operations.

Purpose

Coordinates node lifecycle state across three modules operating on the same node:

  • fault-quarantine: Detects faults, applies quarantined state
  • node-drainer: Evacuates workloads, applies draining → drain-succeeded or drain-failed
  • fault-remediation: Executes recovery, applies remediating → remediation-succeeded or remediation-failed

Provides:

  • Single source of truth for node remediation status
  • State transition validation with metrics for unexpected flows
  • Terminal state detection (drain-failed, remediation-failed, remediation-succeeded)
  • Cancellation support (label removal from any state without validation)

State Machine

┌──────────────────┐
│ [NO LABEL] │ Healthy node
└────────┬─────────┘
│
│ Fault detected
▼
┌──────────────────┐
│ quarantined ├──────────────────────┐
└────────┬─────────┘ │
│ │
│ Start drain │ No pods to drain
▼ │
┌──────────────────┐ │
Healthy ◄────┤ draining │ │
event └────────┬─────────┘ │
(cancel) │ │
│ │ Drain completed │
│ ┌────────┴────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌────────────────┐ │
│ │drain-failed │ │drain-succeeded │◄───────────┘
│ └───────────────┘ └──────┬─────────┘
│ [TERMINAL] │
│ │ Start remediation
│ ▼
│ ┌──────────────┐ Note: fault-remediation
│ │ remediating │ only consumes drain-succeeded.
│ └──────┬───────┘ drain-failed is a terminal state.
│ │
│ ┌────────┴────────┐
│ │ │
│ ▼ ▼
│ ┌─────────────┐ ┌──────────────┐
│ │ remediation-│ │ remediation- │ [TERMINAL]
│ │ succeeded │ │ failed │
│ └─────┬───────┘ └──────────────┘
│ │
│ │ Healthy event
▼ ▼
┌──────────────────────────────┐
│ [NO LABEL] │
└──────────────────────────────┘

State Label Values

StateApplied ByDescriptionTerminal
(no label)AnyHealthy node, no active fault handlingNo
quarantinedfault-quarantineFault detected, node cordoned/taintedNo
drainingnode-drainerWorkload evacuation in progressNo
drain-succeedednode-drainerAll workloads evacuated successfullyNo
drain-failednode-drainerWorkload evacuation failedYes
remediatingfault-remediationRemediation action in progressNo
remediation-succeededfault-remediationRemediation completed successfullyYes*
remediation-failedfault-remediationRemediation action failedYes

*Terminal until healthy event triggers label removal

State Transitions

From StateTo StateTrigger
(no label)quarantinedFault detected
quarantineddrainingDrain initiated
quarantineddrain-succeededNo pods to drain
drainingdrain-succeededAll pods evacuated
drainingdrain-failedEvacuation timeout/failure
drain-succeededremediatingRemediation initiated
remediatingremediation-succeededRemediation completed
remediatingremediation-failedRemediation error
(any state)(no label)Healthy event (cancellation)

Example Flows

Successful remediation:

none → quarantined → draining → drain-succeeded → remediating → remediation-succeeded → none

No pods to drain:

none → quarantined → drain-succeeded → remediating → remediation-succeeded → none

Failed drain:

none → quarantined → draining → drain-failed [TERMINAL]

Canceled drain (healthy event):

none → quarantined → draining → none

Failed remediation:

none → quarantined → draining → drain-succeeded → remediating → remediation-failed [TERMINAL]