DPU Lifecycle Management
DPU management is NICo’s primary value differentiator. NICo treats every BlueField DPU as a first-class managed component: installing its OS, configuring host networking, monitoring health, upgrading firmware, and reprovisioning the DPU automatically when it drifts from the desired state. The DPU is the enforcement boundary for host isolation and network security; NICo manages it end-to-end so operators do not have to.
This page covers the full DPU lifecycle: what NICo installs, how it installs it, how it keeps the DPU healthy, and how to intervene when something goes wrong. For the full host ingestion flow, which includes DPU provisioning, see Ingesting Hosts. For the exact state transitions and retry paths, see the Managed Host State Diagrams. For DPU network configuration details, see DPU Configuration.
Lifecycle at a Glance
- Discovery and Pairing — Site Explorer discovers the DPU BMC, collects Redfish inventory, and pairs the DPU with its host.
- OS Install (BFB) — NICo installs the DPU OS either by pushing the BFB image via Redfish (preferred) or by booting the DPU over UEFI HTTP for a network install, then power-cycles the host.
- Network Config and Health — The
dpu-agentstarts, fetches desired configuration from NICo Core, applies HBN/NVUE configuration, and reports healthy. - Ready / Serving — The DPU is synchronized and the managed host proceeds to host initialization and eventually becomes available for tenant allocation.
- Health Monitoring — The
dpu-agentcontinuously checks DPU health and reports back to NICo Core. NICo uses these reports to gate lifecycle transitions and allocation. - Reprovisioning — When firmware must be updated, health cannot be recovered, or an operator requests it, NICo reinstalls the DPU OS and cycles back through configuration.
What NICo Installs and Manages
NICo treats each managed host as a host server paired with one or more BlueField DPUs. During ingestion, NICo installs the DPU OS, configures the DPU for host networking, and starts the services that let the site controller manage the host without trusting the host operating system.
dpu-agent
The DPU agent runs as a daemon on the DPU. In service names and logs it appears as forge-dpu-agent; in the documentation it is usually referred to as dpu-agent.
The agent periodically calls GetManagedHostNetworkConfig to fetch the desired configuration from NICo Core. It applies the configuration locally, runs health checks, and reports status back with RecordDpuNetworkStatus. The report includes applied configuration versions and DPU health.
The agent is responsible for:
- Applying DPU network configuration (HBN/NVUE).
- Configuring the DPU-local DHCP server.
- Running periodic health checks for required services, BGP peering, disk utilization, restricted mode, and related DPU conditions.
- Running the Forge Metadata Service (FMDS).
- Supporting auto-updates of the agent itself.
- Applying selected DPU OS hotfixes without requiring a full DPU OS reinstall.
DHCP Server
NICo runs a custom DHCP server on the DPU. The DPU-local DHCP server handles DHCP requests from the attached host, so DHCP traffic from the host primary networking interfaces does not leave the DPU and does not appear directly on the underlay network.
This is a security benefit: the DPU enforces host isolation before the host receives any network configuration. A compromised host cannot broadcast DHCP traffic onto the underlay to discover or interfere with other hosts. It also makes DHCP behavior part of the declarative DPU configuration that dpu-agent receives from NICo Core.
Forge Metadata Service
The Forge Metadata Service (FMDS) exposes instance metadata to tenants from the DPU. Tenants can use FMDS to retrieve information such as the Machine ID and boot or operating system metadata for their instance. FMDS runs on the DPU rather than on the host, so its responses are trusted independently of the host OS.
HBN and Containerized Cumulus
NICo uses HBN (Host-Based Networking), backed by containerized Cumulus, to provide the host networking behavior that the site controller expects. The dpu-agent converts desired network state from NICo Core into NVUE configuration and applies it through the NVUE CLI. After applying configuration, the agent checks that HBN and related services are healthy before NICo advances lifecycle state.
For the detailed configuration model, versioning behavior, and isolation semantics, see DPU Configuration.
Note: DPF-managed DPU installation and reprovisioning follow a separate flow and will be documented in a follow-up page. This guide describes the non-DPF DPU lifecycle unless a section explicitly says otherwise.
DPU OS Installation
DPU OS installation happens as part of the managed host state machine after Site Explorer has discovered and paired the host with its DPU or DPUs. NICo supports two installation methods and selects the method automatically based on DPU BMC firmware capabilities and site configuration.
NICo BFB vs. Preingestion BFB
NICo uses two different BFB images. They are not interchangeable:
- NICo BFB: The image installed during the managed host state machine and reprovisioning. It is built from the vanilla DOCA BFB and customized with NICo services:
dpu-agent, the DPU DHCP server, FMDS, HBN installer and configuration, NICo root CA, and scout. This is the image that makes the DPU a fully managed component. For build instructions, see Building NICo Containers. - Preingestion BFB (
preingestion.bfb): The unmodified vanilla DOCA BFB, saved as-is during the build process before any NICo customization is applied. It does not containdpu-agent, HBN, FMDS, or any other NICo services. This image is used only for pre-ingestion recovery via rshim (copy-bfb-to-dpu-rshim) to return a DPU to a clean factory state so that NICo can discover and pair it. After the preingestion BFB is installed, the normal state machine installs the NICo BFB.
How NICo Chooses the Install Method
dpu_enable_secure_boot defaults to false. When disabled, all DPUs use the UEFI HTTP Boot path regardless of BMC firmware version. To use Redfish BFB install, operators must explicitly enable it in the site configuration.
NICo checks supports_bfb_install against every DPU on the host. All DPUs on a host must support Redfish BFB install for NICo to use that path; if any DPU does not, the host falls back to UEFI HTTP Boot.
Redfish BFB Install
This is the preferred method for DPUs with recent BMC firmware. NICo pushes the BFB image directly to the DPU BMC over Redfish, which gives the state machine explicit progress tracking and error reporting.
- Site Explorer discovers the host BMC and DPU BMC over the out-of-band network, collects Redfish inventory, and validates the DPU pairing.
- The managed host enters
DpuDiscoveringState. - NICo enables rshim access and configures DPU Secure Boot (the
EnableSecureBootsub-flow). - Once Secure Boot is confirmed enabled, the state machine enters
DPUInit/InstallDpuOs/InstallingBFB. - NICo calls the DPU BMC Redfish
UpdateServiceSimpleUpdateaction, pointing it at the NICo BFB hosted bycarbide-pxe, with the targetDPU_OS. - NICo polls the Redfish task and waits in
DPUInit/InstallDpuOs/WaitForInstallComplete. - When the task completes, NICo power-cycles the host so the new DPU image and platform configuration take effect.
- NICo waits for DPU discovery, DPU network configuration, and a healthy
dpu-agentreport before moving to host initialization.
While the BFB task is running, the handler outcome includes messages like Waiting for BFB install to complete: <percent>%. If the Redfish task fails, the state moves to InstallationError and the task messages are stored in the state handler outcome and logs.
UEFI HTTP Boot (Network Install)
For DPUs whose BMC firmware does not support Redfish-based BFB install, NICo falls back to a network install via UEFI HTTP Boot. In this path the DPU downloads and installs its OS from carbide-pxe during boot rather than receiving a Redfish push.
- Site Explorer discovers and pairs the host and DPU as above.
- The managed host enters
DpuDiscoveringState. - NICo enables rshim access and disables DPU Secure Boot (the
DisableSecureBootsub-flow). Secure Boot must be off because the network boot image is not signed for the DPU Secure Boot chain. - NICo configures the DPU to boot once from UEFI HTTP (
SetUefiHttpBootstate) and reboots all DPUs. - The DPU boots via HTTP and requests PXE instructions from
carbide-pxe. NICo serves a DPU-specific boot payload: acarbide.efikernel, acarbide.rootinitrd, and a BlueField Kickstart script (bfks) delivered via cloud-init user-data. The kickstart script drives the BFB installation on the DPU. - After boot, NICo enters
DPUInit/Init, restarts all DPUs, power-cycles the host, and waits for the DPU to come up with the new image. - NICo proceeds through
WaitingForPlatformConfigurationandWaitingForNetworkConfig, waiting for thedpu-agentto apply configuration and report healthy, before moving to host initialization.
Because there is no Redfish task to poll, NICo monitors the network install indirectly: it watches for the DPU to become reachable and for dpu-agent to report in. If the DPU does not come up within the SLA, the state machine triggers a reboot to retry.
Note: During reprovisioning, this same distinction applies. If BFB install is supported, NICo enters
ReprovisionState::InstallDpuOs. If not, it entersReprovisionState::WaitingForNetworkInstall, which boots the DPU via UEFI HTTP and waits for it to complete the network install and become healthy.
Monitoring Installation Progress
During normal ingestion no manual action is required. Operators can monitor the state with:
For Redfish BFB installs, the handler outcome reports install percentage. For UEFI HTTP Boot installs, the handler outcome reports DPU discovery and reboot status.
Common Installation Failures
Most DPU OS installation failures are diagnosed from the managed host state, carbide-api logs, and (for Redfish installs) the Redfish task messages returned by the DPU BMC.
For the manual rshim recovery command (which installs the preingestion BFB, not the NICo BFB) and additional pairing troubleshooting, see DPU-Related Issues: Installing a Fresh DPU OS. For the full DPU troubleshooting workflow, see WaitingForNetworkConfig and DPU health.
Firmware Upgrades
NICo manages DPU firmware as part of the same managed host lifecycle. DPU firmware inventory comes from Redfish and hardware discovery. The configured firmware baseline is stored in the site configuration under dpu_config.
Managed Firmware Components
NICo tracks the following DPU firmware components:
What Triggers a Firmware Upgrade
Firmware upgrades can be triggered in two ways:
- Automatic selection by Machine Update Manager: Machine Update Manager monitors DPU NIC firmware versions. If a healthy
Readymanaged host has DPU NIC firmware outside the configureddpu_nic_firmware_update_versions, it queues a DPU reprovisioning request. During reprovisioning, NICo verifies and updates all DPU firmware components (BMC, CEC/ERoT, NIC) against the configured baseline, but only NIC firmware version drift triggers the automatic reprovisioning. - Manual operator request: An operator triggers DPU reprovisioning using the CLI (see DPU Reprovisioning below). Firmware is always verified and updated during any reprovisioning flow.
How Upgrades Are Staged
Machine Update Manager stages upgrades so the site does not take too many hosts out of service at once. Before scheduling an additional update, it evaluates:
- How many managed hosts are already in a maintenance or update state.
- How many managed hosts are currently unhealthy.
- The configured maximum concurrent update policy for the site.
A DPU update is treated as a host-level maintenance event because the host and its DPU or DPUs are updated together. During an update, NICo applies a HostUpdateInProgress health alert with the PreventAllocations classification, which keeps tenants from acquiring the host while work is in progress.
Operators can inspect DPU firmware status with:
Containerized Cumulus and NVUE
After the DPU OS is installed, the dpu-agent keeps HBN configured by applying NVUE configuration generated from NICo Core state. The configuration covers:
- Host admin network and tenant interfaces.
- VPC/VNI assignments and route server peering.
- DHCP server settings for the attached host.
- Network Security Group rules and isolation behavior.
Configuration Versioning
Configuration is versioned. NICo maintains separate version numbers for managedhost_network_config (site controller lifecycle changes) and instance_network_config (tenant-driven changes). NICo only considers the DPU synchronized when the DPU reports the expected versions for both and reports itself healthy.
After any configuration change, the dpu-agent raises a PostConfigCheckWait alert for approximately 30 seconds. This brief hold gives the DPU time to verify that the new configuration is stable (BGP sessions re-establish, services restart) before NICo treats it as applied.
Isolation Behavior
If the dpu-agent calls GetManagedHostNetworkConfig and receives a NotFound error (the site controller does not recognize this DPU), the agent automatically configures the DPU into an isolated mode. This prevents unknown or removed DPUs from consuming network resources.
DPU Health Monitoring
DPU health is part of aggregate host health. NICo combines reports from dpu-agent, BMC health monitoring, inventory monitoring, validation, and operator overrides. For the full health model, see Health Checks and Health Aggregation.
What dpu-agent Checks
The dpu-agent runs periodic health checks and includes the results in every RecordDpuNetworkStatus report. The checks cover:
How Health Drives Lifecycle Decisions
NICo uses DPU health to gate state transitions and allocation:
- If the DPU has not recently reported that it is up, healthy, and synchronized to the desired configuration, the managed host state does not advance.
- If the health report contains alerts with the
PreventAllocationsclassification, the host is not available for new tenant allocation. - If the
dpu-agentstops sending reports entirely, NICo records aHeartbeatTimeouthealth alert againstforge-dpu-agent.
Investigating Unhealthy DPUs
When a DPU becomes unhealthy, inspect the managed host state and DPU health report:
Key fields to check in the output:
- Health Probe Alerts: which specific check failed (e.g.,
HeartbeatTimeout,BgpStats,ServiceRunning). - Last seen: when the DPU last reported to NICo. A stale timestamp suggests the DPU agent has crashed or the DPU is offline.
- State SLA: if the host has been in its current state longer than the SLA, the output shows
In State > SLA: truewith the breach reason.
For the full troubleshooting workflow, including how to check logs via Grafana/Loki, verify DPU liveliness, restart the agent, and diagnose specific health probe alerts, see WaitingForNetworkConfig and DPU health.
DPU Reprovisioning
DPU reprovisioning reinstalls the DPU OS and then waits for discovery, network configuration, and DPU health to converge again. It is used for planned firmware updates, DPU recovery, and cases where a DPU must be returned to a known clean state.
What Happens During Reprovisioning
The reprovisioning state machine runs through the following stages:
- Check whether BFB install via Redfish is supported for the DPU (BMC firmware >= 24.10 and secure boot enabled).
- If supported, install the DPU OS via Redfish
UpdateService(ReprovisionState::InstallDpuOs). If not, boot the DPU via UEFI HTTP for a network install (ReprovisionState::WaitingForNetworkInstall). - Power the host off and back on so the DPU image and firmware take effect.
- Verify DPU firmware versions (BMC, CEC/ERoT, NIC) against the configured baseline and update any that do not match.
- Wait for DPU network configuration and health to synchronize.
- Clear the DPU reprovisioning request and return the managed host to the appropriate state.
When Automatic Reprovisioning Is Triggered
Automatic DPU reprovisioning is triggered when Machine Update Manager selects an eligible Ready host whose DPU NIC firmware is outside the configured baseline. It queues a DPU reprovisioning request for the host.
Triggering Reprovisioning Manually
The API requires a HostUpdateInProgress health alert on the host before it accepts a reprovisioning request. Use --update-message to apply this alert:
Firmware is always verified and updated during reprovisioning regardless of whether --update-firmware is passed. The --update-firmware flag is accepted but deprecated.
Monitoring Reprovisioning Progress
The managed-host show output displays the current reprovisioning substate, percent complete for BFB installation (when available), and any handler errors.
Additional Reprovisioning Commands
To restart a DPU reprovisioning flow for all DPUs on a host:
To clear a pending reprovisioning request that has not started:
For the complete reprovisioning state machine, see DPU Reprovision State Details.