Firmware Updates
This page covers when NICo updates firmware, how automated updates are scheduled, how to control update windows, and where NICo hands off to platform-specific firmware procedures.
Firmware updates run in two operating moments:
- Initial ingestion and pre-ingestion. NICo can update firmware before a host is fully ingested when the current firmware is too old for safe discovery, host-to-DPU pairing, or the configured minimum firmware policy.
- Approved service windows. After NICo manages a machine, firmware updates are scheduled during approved maintenance windows and are limited by health and concurrency policy.
For related background, see:
- Architecture Overview
- Managed Host State Diagrams
- DPU Lifecycle Management
- Redfish Workflow
- Core Metrics
Workflow Boundaries
NICo has several firmware paths. Use the path that matches the component being updated.
Use nico-admin-cli for CLI examples. These commands talk to the NICo Core
gRPC API:
Machine Update Manager
Machine Update Manager is a scheduler. It does not flash firmware directly. On each run it:
- Clears completed host and DPU update markers.
- Counts machines already in maintenance or firmware update states.
- Counts healthy and unhealthy hosts.
- Computes how many additional updates can start.
- Asks each enabled update module to start work until the site limit is reached.
Automated updates are selected only when the target is eligible. The normal eligibility gates are:
- The machine is known to NICo and is in a state that can be updated.
- The relevant update module is enabled.
- The machine needs firmware according to the configured baseline.
- The machine is not already in maintenance or another update flow.
- Site health and concurrency policy allow another host to leave service.
- A required update window is active when the firmware entry requires explicit start.
The actual firmware work runs inside the managed-host lifecycle. This keeps host power, BMC operations, DPU reprovisioning, health reports, and state transitions under one lifecycle controller.
Configuration
Firmware behavior is controlled by site configuration and firmware metadata.
Machine Update Scheduling
Example:
Host Firmware Settings
Example:
DPU Firmware Settings
Firmware Baselines
NICo compares observed firmware inventory with configured firmware baselines.
Host baselines are loaded from embedded configuration and from
metadata.toml files under firmware_global.firmware_directory. Metadata can
define vendor and model matching, component ordering, known firmware versions,
default versions, minimum pre-ingestion versions, and whether explicit start is
required.
List host firmware entries known to NICo:
The output includes vendor, model, component type, inventory-name match, version, and whether the update needs explicit start.
Keep live “latest firmware version” tables in a single baseline source and link to that source from site runbooks. Do not copy live version tables into this operations page; version tables change independently from the update workflow. If a site mirrors firmware versions into a wiki or dashboard, make the mirror point back to the same source-controlled baseline or approved version catalog.
To update a host firmware baseline:
- Add or update the firmware metadata for the target vendor and model.
- Place the firmware binary where
firmware_global.firmware_directorypoints, or provide the approved URL/script metadata used by the site. - Mark the intended version as the default for the component.
- Verify the new baseline with
firmware show. - Apply the update through a service window or machine-specific update window.
Example host firmware metadata:
Host Firmware Updates
Host firmware updates use Redfish inventory and firmware metadata. Common host components include BMC, BIOS, UEFI, HGX BMC, GPU, NIC, and platform-specific firmware, depending on the platform metadata.
During a host firmware update, NICo can:
- Upload firmware through Redfish or run an approved firmware script.
- Poll Redfish tasks and firmware inventory.
- Reset the BMC or host when required by the platform.
- Re-check firmware versions after activation.
- Apply a
HostUpdateInProgresshealth report that prevents allocation while update work is active.
Enable, disable, or clear machine-specific auto-update behavior:
Set an explicit firmware update window for one or more machines:
Cancel pending start windows:
Request host reprovisioning when the host must be put back through the managed-host firmware path:
DPU Firmware Updates
DPU firmware is managed as part of the managed-host lifecycle. NICo tracks:
Machine Update Manager uses DPU NIC firmware drift as the automatic trigger. During DPU reprovisioning, NICo verifies and updates the DPU firmware set against the configured baseline.
Inspect DPU firmware status:
For the full DPU firmware flow, see DPU Lifecycle Management.
Manual Platform Updates
Some platforms require a manual field procedure before NICo can complete the firmware workflow.
NVIDIA Platforms
For NVIDIA-managed platforms, follow the approved NVIDIA service procedure for the exact platform and firmware package. GB200 is the common example for this flow. NICo normally runs the automated firmware lifecycle. Use the manual procedure only when the site runbook or platform support path requires a manual GB200 update gate.
A GB200 firmware service procedure can cover several component groups:
The exact order, package names, credentials, BMC addresses, and validation commands are site- and release-specific. Keep those details in the approved GB200 service procedure. This Operations guide documents where NICo waits and how to resume the lifecycle after that procedure is complete.
When manual firmware upgrade is required, NICo moves the managed host to a manual waiting state. Complete the approved GB200 firmware procedure first. After the field procedure is complete, mark the manual firmware upgrade complete so NICo can resume automatic checks:
OEM Platforms
For OEM platforms, consult the OEM support site for the exact platform and firmware package before starting work. The OEM support site is the final authority for vendor-specific prerequisites, package selection, activation steps, and recovery.
SMC/Supermicro is one example of this path. NICo can use the Redfish firmware workflow when firmware metadata identifies the target component and package. The Supermicro Redfish implementation supports:
NICo can schedule, track, or hand off the update when the site has integrated an approved firmware package or script. Use the OEM procedure to confirm the exact package, update order, activation requirements, and recovery steps.
Rack and Component Firmware
Use the NVIDIA Forge REST API for rack and tray firmware updates. The API accepts a site ID and optional target firmware version, starts the update workflow, and returns task IDs for tracking.
REST API entry points:
Request fields:
Batch tray filters follow these constraints:
- Use either
rackIdorrackName, not both. - Do not combine a rack filter with
idsorcomponentIds. - Use
typewhen filtering bycomponentIds.
The API requires provider-admin authorization and returns task IDs for tracking.
Single-rack example:
Batch tray example for a rack:
The response contains task IDs:
Use component-manager only for lower-level switch and power shelf firmware workflows when a REST workflow is not available for that component:
Check component firmware status:
Monitor Progress
Start with the object being updated, then move to logs and metrics when progress is unclear.
Useful metrics:
Recovery
If a firmware update does not progress:
- Inspect the managed-host state, REST task IDs, or component status for the failing object.
- Check the health reports and handler outcome for the reason NICo is waiting.
- Check
nico-apilogs for Redfish task failures, BMC reachability errors, script failures, REST firmware task failures, RMS backend errors, or component-manager job errors. - Follow the platform or OEM recovery procedure before retrying a failed firmware operation.
- Retry through the same NICo workflow after the underlying platform condition is corrected.
A host firmware failure can place the host in a failed firmware-upgrade state.
NICo retries according to firmware_global.host_firmware_upgrade_retry_interval
where retry is supported. A rack or component firmware failure should be
tracked through the REST task IDs returned by the rack or tray firmware API,
or through component-manager status for switch and power shelf targets.
After the platform condition is corrected, reset a failed host reprovisioning flow only when the site runbook calls for it: