State Manager

View as Markdown

Overview

The State Manager is a Go library (commons/pkg/statemanager) that manages the dgxc.nvidia.com/nvsentinel-state node label lifecycle across NVSentinel modules. It provides a state machine implementation with transition validation and observability for coordinating Fault Quarantine, Node Drainer, and Fault Remediation operations.

Purpose

Coordinates node lifecycle state across three modules operating on the same node:

  • fault-quarantine: Detects faults, applies quarantined state
  • node-drainer: Evacuates workloads, applies drainingdrain-succeeded or drain-failed
  • fault-remediation: Executes recovery, applies remediatingremediation-succeeded or remediation-failed

Provides:

  • Single source of truth for node remediation status
  • State transition validation with metrics for unexpected flows
  • Terminal state detection (drain-failed, remediation-failed, remediation-succeeded)
  • Cancellation support (label removal from any state without validation)

State Machine

┌──────────────────┐
│ [NO LABEL] │ Healthy node
└────────┬─────────┘
│ Fault detected
┌──────────────────┐
│ quarantined ├──────────────────────┐
└────────┬─────────┘ │
│ │
│ Start drain │ No pods to drain
▼ │
┌──────────────────┐ │
Healthy ◄────┤ draining │ │
event └────────┬─────────┘ │
(cancel) │ │
│ │ Drain completed │
│ ┌────────┴────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌────────────────┐ │
│ │drain-failed │ │drain-succeeded │◄───────────┘
│ └───────────────┘ └──────┬─────────┘
│ [TERMINAL] │
│ │ Start remediation
│ ▼
│ ┌──────────────┐ Note: fault-remediation
│ │ remediating │ only consumes drain-succeeded.
│ └──────┬───────┘ drain-failed is a terminal state.
│ │
│ ┌────────┴────────┐
│ │ │
│ ▼ ▼
│ ┌─────────────┐ ┌──────────────┐
│ │ remediation-│ │ remediation- │ [TERMINAL]
│ │ succeeded │ │ failed │
│ └─────┬───────┘ └──────────────┘
│ │
│ │ Healthy event
▼ ▼
┌──────────────────────────────┐
│ [NO LABEL] │
└──────────────────────────────┘

State Label Values

StateApplied ByDescriptionTerminal
(no label)AnyHealthy node, no active fault handlingNo
quarantinedfault-quarantineFault detected, node cordoned/taintedNo
drainingnode-drainerWorkload evacuation in progressNo
drain-succeedednode-drainerAll workloads evacuated successfullyNo
drain-failednode-drainerWorkload evacuation failedYes
remediatingfault-remediationRemediation action in progressNo
remediation-succeededfault-remediationRemediation completed successfullyYes*
remediation-failedfault-remediationRemediation action failedYes

*Terminal until healthy event triggers label removal

State Transitions

From StateTo StateTrigger
(no label)quarantinedFault detected
quarantineddrainingDrain initiated
quarantineddrain-succeededNo pods to drain
drainingdrain-succeededAll pods evacuated
drainingdrain-failedEvacuation timeout/failure
drain-succeededremediatingRemediation initiated
remediatingremediation-succeededRemediation completed
remediatingremediation-failedRemediation error
(any state)(no label)Healthy event (cancellation)

Example Flows

Successful remediation:

none → quarantined → draining → drain-succeeded → remediating → remediation-succeeded → none

No pods to drain:

none → quarantined → drain-succeeded → remediating → remediation-succeeded → none

Failed drain:

none → quarantined → draining → drain-failed [TERMINAL]

Canceled drain (healthy event):

none → quarantined → draining → none

Failed remediation:

none → quarantined → draining → drain-succeeded → remediating → remediation-failed [TERMINAL]