On-Demand Rack Maintenance
On-Demand Rack Maintenance
On-demand maintenance allows an operator to trigger a maintenance cycle on a rack that is in the Ready or Error state. It supports both full-rack and partial-rack scoping — the caller can optionally specify which machines, switches, or power shelves to maintain, and which maintenance activities to perform.
Scope: Full Rack vs Partial Rack
The maintenance request carries an optional MaintenanceScope that specifies which devices to include:
When no device IDs are specified (all three lists empty), the scope is treated as full rack — identical to the existing reprovision_requested behavior.
Activities
The request also carries an optional list of maintenance activities to perform. When the list is empty, all activities are performed (the default). Available activities:
Each activity is represented as a MaintenanceActivityConfig message with a oneof activity field. The oneof discriminant identifies the activity type, and each variant carries only the configuration fields relevant to that activity.
Activities that are not in the list are skipped during the maintenance cycle. The state machine always starts at FirmwareUpgrade(Start) (to consume the scope), but immediately advances to the next requested activity if firmware upgrade was not requested.
Firmware Version Resolution
The firmware used during the upgrade depends on how the maintenance was triggered:
When a firmware_version is supplied through the CLI, the maintenance handler resolves it by looking up the firmware record by ID (db_rack_firmware::find_by_id). If the ID is not found, the rack transitions to Error. If the firmware exists but is not marked as available, the firmware upgrade is skipped. When no version is specified (or the maintenance was triggered through ingestion/reprovision), the default firmware for the rack’s hardware type is used as before.
Flow
- The caller invokes the
OnDemandRackMaintenancegRPC method with arack_idand optional device-ID lists. - The handler validates that the rack is in
ReadyorErrorstate and no maintenance is already pending. - It writes a
MaintenanceScopetoRackConfig.maintenance_requestedand persists the config. - On the next state-handler tick,
handle_readydetectsmaintenance_requestedand transitions the rack toMaintenance(FirmwareUpgrade(Start)). - The
start_firmware_upgradefunction consumes the scope. If afirmware_versionwas specified, it resolves the firmware by ID; otherwise it uses the default firmware for the rack hardware type. It then filters device reprovisioning and firmware-upgrade operations to only the specified devices (or all devices if the scope is full-rack). - After maintenance completes, the rack flows through
Validatingback toReadyas usual.
gRPC API
Service method (in Forge service):
Messages:
Component Manager Integration
The UpdateComponentFirmware gRPC endpoint provides a higher-level interface for firmware updates. For compute trays and switches, it delegates to the rack state machine by internally calling on_demand_rack_maintenance, rather than managing firmware directly.
How it works
When update_component_firmware receives a request targeting compute trays or switches:
- Resolve the rack — looks up the first device (machine or switch) in the database to find its
rack_id. - Build a maintenance request — constructs a
RackMaintenanceOnDemandRequestwith:- The resolved
rack_id - The machine IDs and/or switch IDs from the request
- A
FirmwareUpgradeactivity carrying thetarget_versionasfirmware_version
- The resolved
- Delegate — calls
on_demand_rack_maintenance, which writes theMaintenanceScopeto the rack config and lets the rack state machine handle the actual firmware upgrade. - Return success — once the maintenance is scheduled, returns a success
ComponentResultfor each device.
Power shelves continue to use the component manager backend directly (they do not go through the rack state machine).
Example: CLI triggers compute tray firmware upgrade
This results in:
UpdateComponentFirmwareis called withComputeTrays { machine_ids: [machine-001, machine-002] }andtarget_version: "fw-v2.1".- Machine
machine-001is looked up to discoverrack_id = rack-42. OnDemandRackMaintenanceis called withrack_id = rack-42,machine_ids = [machine-001, machine-002], andFirmwareUpgradeActivity { firmware_version: "fw-v2.1" }.- The rack state machine picks up the request, resolves firmware
fw-v2.1from therack_firmwaretable, and runs the firmware upgrade via RMS for the specified machines.
Preconditions
The gRPC handler rejects the request with an error if:
- The rack is not in
ReadyorErrorstate — maintenance can only be triggered from these two states. - A maintenance request is already pending (
maintenance_requestedis already set). - Any provided device ID is malformed (cannot be parsed).
RBAC
The OnDemandRackMaintenance permission is granted to the ForgeAdminCLI role.
Failure Handling
If the maintenance state machine transitions to Error (for example, a
firmware upgrade fails, the requested rack firmware cannot be found, or RMS is
unreachable), the handler clears maintenance_requested while persisting the
Error transition.
This is important because handle_error re-enters Maintenance whenever
maintenance_requested is set; without clearing it, the rack would loop
between Maintenance and Error on the same failing request. The user must
issue a new OnDemandRackMaintenance call to retry.
Restarting a compute stuck in FailedFirmwareUpgrade
A compute tray whose host firmware upgrade fails lands in
M_HostReprovision::FailedFirmwareUpgrade. From there it normally retries
automatically (bounded by MAX_FIRMWARE_UPGRADE_RETRIES and
host_firmware_upgrade_retry_interval).
When an on-demand maintenance call (or, equivalently, the rack maintenance
flow) issues a fresh trigger_host_reprovisioning_request against such a
machine, host_reprovisioning_requested is overwritten with
started_at = None. The FailedFirmwareUpgrade arm in
HostUpgradeState::handle_host_reprovision detects this fresh request
(started_at.is_none()) and:
- For rack-initiated requests (initiator prefixed
rack-), transitions toM_HostReprovision::WaitingForRackFirmwareUpgradeso the rack-level RMS flow can drive the upgrade. - Otherwise transitions to
M_HostReprovision::CheckingFirmwareV2withretry_countreset to0, mirroring the wayManagedHostState::Readykicks off a Host Reprovision (including thehost-fw-updatehealth-report alert merge).
This means an on-demand maintenance call always converges a stuck compute back
toward M_Ready without waiting for the auto-retry interval, which is what
allows the rack to progress out of R_Maintenance into R_Validation.
Ready State Priority
When the rack is in Ready, three config flags are checked in order. The first match wins:
topology_changed→ transition toDiscoveringreprovision_requested→ transition toMaintenance(FirmwareUpgrade/Start)(clears any pendingmaintenance_requested)maintenance_requested→ transition toMaintenance(FirmwareUpgrade/Start)with device scope