> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nvsentinel/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nvsentinel/llms-full.txt.

# State Manager

## Overview

The State Manager is a Go library (`commons/pkg/statemanager`) that manages the `dgxc.nvidia.com/nvsentinel-state` node label lifecycle across NVSentinel modules. It provides a state machine implementation with transition validation and observability for coordinating Fault Quarantine, Node Drainer, and Fault Remediation operations.

## Purpose

Coordinates node lifecycle state across three modules operating on the same node:
- **fault-quarantine**: Detects faults, applies `quarantined` state
- **node-drainer**: Evacuates workloads, applies `draining` → `drain-succeeded` or `drain-failed`
- **fault-remediation**: Executes recovery, applies `remediating` → `remediation-succeeded` or `remediation-failed`

Provides:
- Single source of truth for node remediation status
- State transition validation with metrics for unexpected flows
- Terminal state detection (drain-failed, remediation-failed, remediation-succeeded)
- Cancellation support (label removal from any state without validation)

## State Machine

```
                ┌──────────────────┐
                │   [NO LABEL]     │  Healthy node
                └────────┬─────────┘
                         │
                         │ Fault detected
                         ▼
                ┌──────────────────┐
                │   quarantined    ├──────────────────────┐
                └────────┬─────────┘                      │
                         │                                │
                         │ Start drain                    │ No pods to drain
                         ▼                                │
                ┌──────────────────┐                      │
   Healthy ◄────┤     draining     │                      │
   event        └────────┬─────────┘                      │
   (cancel)              │                                │
      │                  │ Drain completed                │
      │         ┌────────┴────────┐                       │
      │         │                 │                       │
      │         ▼                 ▼                       │
      │  ┌───────────────┐  ┌────────────────┐            │
      │  │drain-failed   │  │drain-succeeded │◄───────────┘
      │  └───────────────┘  └──────┬─────────┘
      │    [TERMINAL]              │
      │                            │ Start remediation
      │                            ▼
      │                     ┌──────────────┐  Note: fault-remediation
      │                     │ remediating  │  only consumes drain-succeeded.
      │                     └──────┬───────┘  drain-failed is a terminal state.
      │                            │
      │                   ┌────────┴────────┐
      │                   │                 │
      │                   ▼                 ▼
      │            ┌─────────────┐  ┌──────────────┐
      │            │ remediation-│  │ remediation- │ [TERMINAL]
      │            │ succeeded   │  │   failed     │
      │            └─────┬───────┘  └──────────────┘
      │                  │
      │                  │ Healthy event
      ▼                  ▼
┌──────────────────────────────┐
│        [NO LABEL]            │
└──────────────────────────────┘
```

## State Label Values

| State                      | Applied By           | Description                            | Terminal |
|:---------------------------|:---------------------|:---------------------------------------|:---------|
| (no label)                 | Any                  | Healthy node, no active fault handling | No       |
| `quarantined`              | fault-quarantine     | Fault detected, node cordoned/tainted  | No       |
| `draining`                 | node-drainer         | Workload evacuation in progress        | No       |
| `drain-succeeded`          | node-drainer         | All workloads evacuated successfully   | No       |
| `drain-failed`             | node-drainer         | Workload evacuation failed             | Yes      |
| `remediating`              | fault-remediation    | Remediation action in progress         | No       |
| `remediation-succeeded`    | fault-remediation    | Remediation completed successfully     | Yes*     |
| `remediation-failed`       | fault-remediation    | Remediation action failed              | Yes      |

*Terminal until healthy event triggers label removal

## State Transitions

| From State            | To State                   | Trigger                     |
|:----------------------|:---------------------------|:----------------------------|
| (no label)            | `quarantined`              | Fault detected              |
| `quarantined`         | `draining`                 | Drain initiated             |
| `quarantined`         | `drain-succeeded`          | No pods to drain            |
| `draining`            | `drain-succeeded`          | All pods evacuated          |
| `draining`            | `drain-failed`             | Evacuation timeout/failure  |
| `drain-succeeded`     | `remediating`              | Remediation initiated       |
| `remediating`         | `remediation-succeeded`    | Remediation completed       |
| `remediating`         | `remediation-failed`       | Remediation error           |
| (any state)           | (no label)                 | Healthy event (cancellation)|

### Example Flows

**Successful remediation:**
```
none → quarantined → draining → drain-succeeded → remediating → remediation-succeeded → none
```

**No pods to drain:**
```
none → quarantined → drain-succeeded → remediating → remediation-succeeded → none
```

**Failed drain:**
```
none → quarantined → draining → drain-failed [TERMINAL]
```

**Canceled drain (healthy event):**
```
none → quarantined → draining → none
```

**Failed remediation:**
```
none → quarantined → draining → drain-succeeded → remediating → remediation-failed [TERMINAL]
```