State Manager
Overview
The State Manager is a Go library (commons/pkg/statemanager) that manages the dgxc.nvidia.com/nvsentinel-state node label lifecycle across NVSentinel modules. It provides a state machine implementation with transition validation and observability for coordinating Fault Quarantine, Node Drainer, and Fault Remediation operations.
Purpose
Coordinates node lifecycle state across three modules operating on the same node:
- fault-quarantine: Detects faults, applies
quarantinedstate - node-drainer: Evacuates workloads, applies
draining→drain-succeededordrain-failed - fault-remediation: Executes recovery, applies
remediating→remediation-succeededorremediation-failed
Provides:
- Single source of truth for node remediation status
- State transition validation with metrics for unexpected flows
- Terminal state detection (drain-failed, remediation-failed, remediation-succeeded)
- Cancellation support (label removal from any state without validation)
State Machine
State Label Values
*Terminal until healthy event triggers label removal
State Transitions
Example Flows
Successful remediation:
No pods to drain:
Failed drain:
Canceled drain (healthy event):
Failed remediation: