Rack State Machine interaction with Machine, Switch, and Power Shelf
This document defines the combined state machines for Machine (each compute tray / managed host lifecycle), Switch (each NVLink switch), Power Shelf (each power shelf), and Rack (collection of machines, switches, and power shelves). The diagram below shows all four and the transitions between the Rack state machine and the child device state machines.
For switch-only and power-shelf-only diagrams, see Switch State Diagram and Power Shelf State Diagram. For on-demand rack maintenance API details, see On-Demand Rack Maintenance.
Combined State Diagram (Machine, Switch, Power Shelf, Rack)
Rack State Machine Flow
The Rack state machine represents a collection of machines, switches, and power shelves. The rack lifecycle runs in coordination with child device state machines: the rack tracks when its children are created and ready, drives maintenance (firmware upgrade, NVOS update, NMX cluster configuration, optional power sequencing), validates the rack via RVS, and reaches Ready when validation completes and all trays are healthy.
High-Level Rack Diagram
Rack State Definitions
Created (R_Created)
- Entry: Site operator enters the expected rack; Site Explorer creates the rack entity when a machine, switch, or power shelf with the same
rack_idis discovered. - Exit: When expected devices are present and discovery can proceed, the rack moves to Discovering.
Discovering (R_Discovering)
- Entry: From
Createdwhen discovery starts. Also re-entered fromReadywhentopology_changedis set (tray replacement). - Exit: When every machine in the rack is
ManagedHostState::ReadyorAssigned, every switch isSwitchControllerState::Ready, and every power shelf isReadyper the rack profile capability counts, the rack transitions to Maintenance atFirmwareUpgrade(Start).
The rack waits until all child devices reach ready before starting the first maintenance cycle.
Maintenance (R_Maintenance)
- Entry: From
Discoveringafter all children are ready, fromReadywhenreprovision_requestedormaintenance_requestedis set, or fromErrorwhen on-demand maintenance is requested. - Exit:
- To Validating (
Pending) when maintenance reachesCompleted. - To Error on timeout or unrecoverable failure.
- To Validating (
Sub-state flow (activities may be skipped based on MaintenanceScope.activities):
ConfigureNmxCluster sub-states:
During ConfigureCertificates, the rack configures ScaleUpFabric mTLS services on the primary switch via component manager / RMS. During WaitForFabricStatus, the rack polls fabric manager status and persists per-switch fabric_manager_status while switches wait in ReProvisioning::WaitingForNMXCConfigure.
Validating (R_Validating)
- Entry: From
Maintenance(Completed). - Exit:
- To Ready when validation reaches
Validatedand all child devices are ready. - To Error on terminal validation failure.
- To Ready when validation reaches
Sub-states: Pending → InProgress → Partial / FailedPartial → Validated or Failed. RVS drives transitions by writing rv.run-id and partition result labels on rack machines.
Ready (R_Ready)
- Entry: From
Validatingwhen validation completes and every tray is healthy. - Exit:
- To Maintenance when
reprovision_requestedormaintenance_requestedis set. - To Discovering when
topology_changedis set (tray replacement). - To Error when any child switch, power shelf, or machine enters a terminal failure state.
- To Maintenance when
The rack is fully operational. While ready, it monitors child health and accepts reprovisioning or on-demand maintenance requests.
Error (R_Error)
- Entry: From
Maintenance,Validating, orReadyon failure. - Exit:
- To Ready when all child devices are healthy again.
- To Maintenance when on-demand maintenance is requested from error state.
Deleting (R_Deleting)
- Entry: When the rack is marked
deleted. - Exit: Terminal delete.
Switch Interaction with Rack
The Rack state machine drives or observes the Switch state machine as follows:
These cross-state dependencies are shown in the Combined State Diagram.
Switch State Machine Flow (summary)
The Switch state machine runs on each switch. The lifecycle runs from creation through initialization, certificate configuration, password rotation, slot/tray fetch, validation, BOM validation, and Ready. From Ready a switch can enter operator Maintenance, rack-driven ReProvisioning, Deleting, or Error.
Bring-up flow:
Rack reprovisioning flow (when continue_after_firmware_upgrade is true):
See Switch State Diagram for the full switch FSM.
Machine Interaction with Rack
The Rack state machine drives or observes the Machine (compute) state machine as follows:
These cross-state dependencies are shown in the Combined State Diagram.
Power Shelf Interaction with Rack
See Power Shelf State Diagram for the power-shelf FSM.
Recovering from M_HostReprovision::FailedFirmwareUpgrade
A compute machine that fails its host firmware upgrade lands in the
M_HostReprovision::FailedFirmwareUpgrade substate. There are two ways out:
- Automatic retry. While
retry_count < MAX_FIRMWARE_UPGRADE_RETRIESand the configuredhost_firmware_upgrade_retry_intervalhas elapsed since the failure, the machine state handler automatically transitions back toCheckingFirmwareV2and re-attempts the upgrade. - Fresh Host Reprovision request. The Rack state machine (or an operator
via
trigger_host_reprovisioning) can issue a brand-new Host Reprovision request at any time. The new request overwriteshost_reprovisioning_requestedwithstarted_at = None; the FailedFirmwareUpgrade handler detects this fresh request and restarts the upgrade flow fromCheckingFirmwareV2withretry_countreset to0, mirroring the wayManagedHostState::Readykicks off a Host Reprovision (including thehost-fw-updatehealth-report alert merge). Rack-level requests (initiator prefixed withrack-) instead enterWaitingForRackFirmwareUpgrade.
This guarantees the Rack can always drive a stuck compute out of
FailedFirmwareUpgrade without waiting for the retry backoff, which is
important during R_Maintenance where the Rack must converge all computes back
to M_Ready before progressing to R_Validation.
Tray Replacement (External Event)
When a tray (compute machine) in a rack is physically replaced, the rack topology changes. This is an external event that triggers the following state machine transitions:
- Old machine: The replaced machine is removed from the rack. Its Machine state machine terminates (deletion path).
- New machine: Site Explorer detects the new tray and creates a new machine entity in M_Created. The new machine progresses through its ingestion states toward M_Ready.
- Rack: The rack detects the topology change (
topology_changed) and transitions from R_Ready → R_Discovering. In R_Discovering the rack waits until the new machine reaches M_Ready (and all other machines, switches, and power shelves remain ready), then proceeds through R_Maintenance → R_Validation → R_Ready as in the normal flow.
This ensures that any replaced hardware is fully discovered, provisioned, and validated before the rack returns to an operational ready state.
How the data is organized
A rack is the top-level entity. Every machine (compute tray), every switch, and every power shelf belongs to exactly one rack.
- Each rack has a unique identifier (the rack ID).
- Each child device stores the rack ID of the rack it belongs to.
- Each switch has
switch_reprovisioning_requestedand per-cycle status fields (firmware_upgrade_status,nvos_update_status,fabric_manager_status) that the rack state machine sets during maintenance.
When a site operator enters an expected rack (with a rack ID and rack type), Site Explorer creates the rack entity as soon as it discovers machines, switches, or power shelves that share the same rack ID. From that point the rack tracks its children through their respective state machines until the entire rack reaches a ready state.
Implementation
- Rack state type:
RackStateincrates/api-model/src/rack.rs. - Rack handlers:
crates/rack-controller/src/. - Switch state type:
SwitchControllerStateincrates/api-model/src/switch/mod.rs. - Switch handlers:
crates/switch-controller/src/. - On-demand maintenance API: On-Demand Rack Maintenance.