> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/infra-controller/llms.txt.
> For full documentation content, see https://docs.nvidia.com/infra-controller/llms-full.txt.

# DPU Lifecycle Management

DPU management is NICo's primary value differentiator. NICo treats every BlueField DPU as a first-class managed component: installing its OS, configuring host networking, monitoring health, upgrading firmware, and reprovisioning the DPU automatically when it drifts from the desired state. The DPU is the enforcement boundary for host isolation and network security; NICo manages it end-to-end so operators do not have to.

This page covers the full DPU lifecycle: what NICo installs, how it installs it, how it keeps the DPU healthy, and how to intervene when something goes wrong. For the full host ingestion flow, which includes DPU provisioning, see [Ingesting Hosts](/infra-controller/documentation/provisioning-day-0-operations/ingesting-hosts). For the exact state transitions and retry paths, see the [Managed Host State Diagrams](/infra-controller/documentation/architecture/state-machines/managed-host). For DPU network configuration details, see [DPU Configuration](/infra-controller/documentation/dpu-management/dpu-configuration).

## Lifecycle at a Glance

1. **Discovery and Pairing** — Site Explorer discovers the DPU BMC, collects Redfish inventory, and pairs the DPU with its host.
2. **OS Install (BFB)** — NICo installs the DPU OS either by pushing the BFB image via Redfish (preferred) or by booting the DPU over UEFI HTTP for a network install, then power-cycles the host.
3. **Network Config and Health** — The `dpu-agent` starts, fetches desired configuration from NICo Core, applies HBN/NVUE configuration, and reports healthy.
4. **Ready / Serving** — The DPU is synchronized and the managed host proceeds to host initialization and eventually becomes available for tenant allocation.
5. **Health Monitoring** — The `dpu-agent` continuously checks DPU health and reports back to NICo Core. NICo uses these reports to gate lifecycle transitions and allocation.
6. **Reprovisioning** — When firmware must be updated, health cannot be recovered, or an operator requests it, NICo reinstalls the DPU OS and cycles back through configuration.

## What NICo Installs and Manages

NICo treats each managed host as a host server paired with one or more BlueField DPUs. During ingestion, NICo installs the DPU OS, configures the DPU for host networking, and starts the services that let the site controller manage the host without trusting the host operating system.

### `dpu-agent`

The DPU agent runs as a daemon on the DPU. In service names and logs it appears as `forge-dpu-agent`; in the documentation it is usually referred to as `dpu-agent`.

The agent periodically calls `GetManagedHostNetworkConfig` to fetch the desired configuration from NICo Core. It applies the configuration locally, runs health checks, and reports status back with `RecordDpuNetworkStatus`. The report includes applied configuration versions and DPU health.

The agent is responsible for:

- Applying DPU network configuration (HBN/NVUE).
- Configuring the DPU-local DHCP server.
- Running periodic health checks for required services, BGP peering, disk utilization, restricted mode, and related DPU conditions.
- Running the Forge Metadata Service (FMDS).
- Supporting auto-updates of the agent itself.
- Applying selected DPU OS hotfixes without requiring a full DPU OS reinstall.

### DHCP Server

NICo runs a custom DHCP server on the DPU. The DPU-local DHCP server handles DHCP requests from the attached host, so DHCP traffic from the host primary networking interfaces does not leave the DPU and does not appear directly on the underlay network.

This is a security benefit: the DPU enforces host isolation before the host receives any network configuration. A compromised host cannot broadcast DHCP traffic onto the underlay to discover or interfere with other hosts. It also makes DHCP behavior part of the declarative DPU configuration that `dpu-agent` receives from NICo Core.

### Forge Metadata Service

The Forge Metadata Service (FMDS) exposes instance metadata to tenants from the DPU. Tenants can use FMDS to retrieve information such as the Machine ID and boot or operating system metadata for their instance. FMDS runs on the DPU rather than on the host, so its responses are trusted independently of the host OS.

### HBN and Containerized Cumulus

NICo uses HBN (Host-Based Networking), backed by containerized Cumulus, to provide the host networking behavior that the site controller expects. The `dpu-agent` converts desired network state from NICo Core into NVUE configuration and applies it through the NVUE CLI. After applying configuration, the agent checks that HBN and related services are healthy before NICo advances lifecycle state.

For the detailed configuration model, versioning behavior, and isolation semantics, see [DPU Configuration](/infra-controller/documentation/dpu-management/dpu-configuration).

> **Note:** DPF-managed DPU installation and reprovisioning follow a separate flow and will be documented in a follow-up page. This guide describes the non-DPF DPU lifecycle unless a section explicitly says otherwise.

## DPU OS Installation

DPU OS installation happens as part of the managed host state machine after Site Explorer has discovered and paired the host with its DPU or DPUs. NICo supports two installation methods and selects the method automatically based on DPU BMC firmware capabilities and site configuration.

### NICo BFB vs. Preingestion BFB

NICo uses two different BFB images. They are not interchangeable:

- **NICo BFB**: The image installed during the managed host state machine and reprovisioning. It is built from the vanilla DOCA BFB and customized with NICo services: `dpu-agent`, the DPU DHCP server, FMDS, HBN installer and configuration, NICo root CA, and scout. This is the image that makes the DPU a fully managed component. For build instructions, see [Building NICo Containers](/infra-controller/documentation/getting-started/building-nico-containers#building-the-dpu-bfb).
- **Preingestion BFB** (`preingestion.bfb`): The unmodified vanilla DOCA BFB, saved as-is during the build process before any NICo customization is applied. It does **not** contain `dpu-agent`, HBN, FMDS, or any other NICo services. This image is used only for pre-ingestion recovery via rshim (`copy-bfb-to-dpu-rshim`) to return a DPU to a clean factory state so that NICo can discover and pair it. After the preingestion BFB is installed, the normal state machine installs the NICo BFB.

### How NICo Chooses the Install Method

| Condition | Method |
|---|---|
| DPU BMC firmware >= 24.10 **and** `dpu_enable_secure_boot` is enabled in site config | Redfish BFB Install |
| DPU BMC firmware < 24.10 **or** secure boot is not enabled | UEFI HTTP Boot |

`dpu_enable_secure_boot` defaults to `false`. When disabled, all DPUs use the UEFI HTTP Boot path regardless of BMC firmware version. To use Redfish BFB install, operators must explicitly enable it in the site configuration.

NICo checks `supports_bfb_install` against every DPU on the host. All DPUs on a host must support Redfish BFB install for NICo to use that path; if any DPU does not, the host falls back to UEFI HTTP Boot.

### Redfish BFB Install

This is the preferred method for DPUs with recent BMC firmware. NICo pushes the BFB image directly to the DPU BMC over Redfish, which gives the state machine explicit progress tracking and error reporting.

1. Site Explorer discovers the host BMC and DPU BMC over the out-of-band network, collects Redfish inventory, and validates the DPU pairing.
2. The managed host enters `DpuDiscoveringState`.
3. NICo enables rshim access and configures DPU Secure Boot (the `EnableSecureBoot` sub-flow).
4. Once Secure Boot is confirmed enabled, the state machine enters `DPUInit/InstallDpuOs/InstallingBFB`.
5. NICo calls the DPU BMC Redfish `UpdateService` `SimpleUpdate` action, pointing it at the NICo BFB hosted by `carbide-pxe`, with the target `DPU_OS`.
6. NICo polls the Redfish task and waits in `DPUInit/InstallDpuOs/WaitForInstallComplete`.
7. When the task completes, NICo power-cycles the host so the new DPU image and platform configuration take effect.
8. NICo waits for DPU discovery, DPU network configuration, and a healthy `dpu-agent` report before moving to host initialization.

While the BFB task is running, the handler outcome includes messages like `Waiting for BFB install to complete: <percent>%`. If the Redfish task fails, the state moves to `InstallationError` and the task messages are stored in the state handler outcome and logs.

### UEFI HTTP Boot (Network Install)

For DPUs whose BMC firmware does not support Redfish-based BFB install, NICo falls back to a network install via UEFI HTTP Boot. In this path the DPU downloads and installs its OS from `carbide-pxe` during boot rather than receiving a Redfish push.

1. Site Explorer discovers and pairs the host and DPU as above.
2. The managed host enters `DpuDiscoveringState`.
3. NICo enables rshim access and disables DPU Secure Boot (the `DisableSecureBoot` sub-flow). Secure Boot must be off because the network boot image is not signed for the DPU Secure Boot chain.
4. NICo configures the DPU to boot once from UEFI HTTP (`SetUefiHttpBoot` state) and reboots all DPUs.
5. The DPU boots via HTTP and requests PXE instructions from `carbide-pxe`. NICo serves a DPU-specific boot payload: a `carbide.efi` kernel, a `carbide.root` initrd, and a BlueField Kickstart script (`bfks`) delivered via cloud-init user-data. The kickstart script drives the BFB installation on the DPU.
6. After boot, NICo enters `DPUInit/Init`, restarts all DPUs, power-cycles the host, and waits for the DPU to come up with the new image.
7. NICo proceeds through `WaitingForPlatformConfiguration` and `WaitingForNetworkConfig`, waiting for the `dpu-agent` to apply configuration and report healthy, before moving to host initialization.

Because there is no Redfish task to poll, NICo monitors the network install indirectly: it watches for the DPU to become reachable and for `dpu-agent` to report in. If the DPU does not come up within the SLA, the state machine triggers a reboot to retry.

> **Note:** During reprovisioning, this same distinction applies. If BFB install is supported, NICo enters `ReprovisionState::InstallDpuOs`. If not, it enters `ReprovisionState::WaitingForNetworkInstall`, which boots the DPU via UEFI HTTP and waits for it to complete the network install and become healthy.

### Monitoring Installation Progress

During normal ingestion no manual action is required. Operators can monitor the state with:

```bash
carbide-admin-cli -c <api-url> managed-host show --all
carbide-admin-cli -c <api-url> managed-host show <machine-id>
```

For Redfish BFB installs, the handler outcome reports install percentage. For UEFI HTTP Boot installs, the handler outcome reports DPU discovery and reboot status.

### Common Installation Failures

Most DPU OS installation failures are diagnosed from the managed host state, `carbide-api` logs, and (for Redfish installs) the Redfish task messages returned by the DPU BMC.

| Symptom | Install method | Likely cause | Resolution |
|---|---|---|---|
| `Invalid FW Package` | Both | The BFB was built incorrectly or for the wrong DPU platform. | Verify the DPU model from Redfish inventory or DPU firmware output, rebuild the BFB for the correct platform, and retry. |
| Redfish unavailable | Redfish | DPU BMC is unreachable or not responding to Redfish requests. | Check DPU BMC network reachability and credentials. NICo retries automatically. |
| Task exception or unknown state | Redfish | Unexpected Redfish task status. | Inspect the Redfish task messages in `carbide-api` logs and confirm the BFB URL served by `carbide-pxe`. |
| rshim ownership conflict | rshim (SCP) | Host holds rshim and the DPU BMC cannot initiate the copy. | Use `--pre-copy-powercycle` when installing a fresh BFB via rshim to release host control first. |
| DPU never becomes reachable after reboot | UEFI HTTP | DPU failed to PXE boot or kickstart failed. | Check `carbide-pxe` logs for the DPU's PXE request. Verify the DPU boot order is set to UEFI HTTP. Check `carbide-api` logs for the DPU BMC IP. |
| Stuck in `WaitingForNetworkInstall` | UEFI HTTP | DPU booted but did not install the OS or `dpu-agent` did not start. | SSH to the DPU via its BMC/rshim and check `journalctl -fu forge-dpu-agent`. NICo reboots the DPU automatically if it does not appear within the reboot timeout. |

For the manual rshim recovery command (which installs the preingestion BFB, not the NICo BFB) and additional pairing troubleshooting, see [DPU-Related Issues: Installing a Fresh DPU OS](/infra-controller/documentation/provisioning-day-0-operations/ingesting-hosts#dpu-related-issues-installing-a-fresh-dpu-os). For the full DPU troubleshooting workflow, see [`WaitingForNetworkConfig` and DPU health](/infra-controller/documentation/playbooks/stuck-objects/stuck-in-waiting-for-network-config-and-dpu-health).

## Firmware Upgrades

NICo manages DPU firmware as part of the same managed host lifecycle. DPU firmware inventory comes from Redfish and hardware discovery. The configured firmware baseline is stored in the site configuration under `dpu_config`.

### Managed Firmware Components

NICo tracks the following DPU firmware components:

| Component | Inventory name | Notes |
|---|---|---|
| DPU NIC firmware | `DPU_NIC` | Primary NIC firmware on the BlueField. |
| DPU BMC firmware | `BMC_Firmware` | Controls the DPU management controller. |
| DPU UEFI firmware | `DPU_UEFI` | DPU boot firmware. |
| ATF / ERoT firmware | `Bluefield_FW_ERoT` | Arm Trusted Firmware or External Root of Trust. |

### What Triggers a Firmware Upgrade

Firmware upgrades can be triggered in two ways:

- **Automatic selection by Machine Update Manager**: Machine Update Manager monitors **DPU NIC firmware** versions. If a healthy `Ready` managed host has DPU NIC firmware outside the configured `dpu_nic_firmware_update_versions`, it queues a DPU reprovisioning request. During reprovisioning, NICo verifies and updates all DPU firmware components (BMC, CEC/ERoT, NIC) against the configured baseline, but only NIC firmware version drift triggers the automatic reprovisioning.
- **Manual operator request**: An operator triggers DPU reprovisioning using the CLI (see [DPU Reprovisioning](#dpu-reprovisioning) below). Firmware is always verified and updated during any reprovisioning flow.

### How Upgrades Are Staged

Machine Update Manager stages upgrades so the site does not take too many hosts out of service at once. Before scheduling an additional update, it evaluates:

- How many managed hosts are already in a maintenance or update state.
- How many managed hosts are currently unhealthy.
- The configured maximum concurrent update policy for the site.

A DPU update is treated as a host-level maintenance event because the host and its DPU or DPUs are updated together. During an update, NICo applies a `HostUpdateInProgress` health alert with the `PreventAllocations` classification, which keeps tenants from acquiring the host while work is in progress.

Operators can inspect DPU firmware status with:

```bash
carbide-admin-cli -c <api-url> dpu versions
```

## Containerized Cumulus and NVUE

After the DPU OS is installed, the `dpu-agent` keeps HBN configured by applying NVUE configuration generated from NICo Core state. The configuration covers:

- Host admin network and tenant interfaces.
- VPC/VNI assignments and route server peering.
- DHCP server settings for the attached host.
- Network Security Group rules and isolation behavior.

### Configuration Versioning

Configuration is versioned. NICo maintains separate version numbers for `managedhost_network_config` (site controller lifecycle changes) and `instance_network_config` (tenant-driven changes). NICo only considers the DPU synchronized when the DPU reports the expected versions for both and reports itself healthy.

After any configuration change, the `dpu-agent` raises a `PostConfigCheckWait` alert for approximately 30 seconds. This brief hold gives the DPU time to verify that the new configuration is stable (BGP sessions re-establish, services restart) before NICo treats it as applied.

### Isolation Behavior

If the `dpu-agent` calls `GetManagedHostNetworkConfig` and receives a `NotFound` error (the site controller does not recognize this DPU), the agent automatically configures the DPU into an isolated mode. This prevents unknown or removed DPUs from consuming network resources.

## DPU Health Monitoring

DPU health is part of aggregate host health. NICo combines reports from `dpu-agent`, BMC health monitoring, inventory monitoring, validation, and operator overrides. For the full health model, see [Health Checks and Health Aggregation](/infra-controller/documentation/architecture/health-checks-and-health-aggregation).

### What `dpu-agent` Checks

The `dpu-agent` runs periodic health checks and includes the results in every `RecordDpuNetworkStatus` report. The checks cover:

| Health probe | What it checks |
|---|---|
| BGP peering | Sessions established to all configured TOR and route server peers. |
| Required services | Mandatory DPU services (HBN container, DHCP, etc.) are running. |
| Restricted mode | DPU is not in an unexpected restricted mode. |
| Disk utilization | DPU filesystem usage is below the configured threshold. |
| DHCP server/relay | The host-facing DHCP server or relay is responding. |
| HBN/NVUE health | Containerized Cumulus configuration applied and functional. |

### How Health Drives Lifecycle Decisions

NICo uses DPU health to gate state transitions and allocation:

- If the DPU has not recently reported that it is up, healthy, and synchronized to the desired configuration, the managed host state does not advance.
- If the health report contains alerts with the `PreventAllocations` classification, the host is not available for new tenant allocation.
- If the `dpu-agent` stops sending reports entirely, NICo records a `HeartbeatTimeout` health alert against `forge-dpu-agent`.

### Investigating Unhealthy DPUs

When a DPU becomes unhealthy, inspect the managed host state and DPU health report:

```bash
carbide-admin-cli -c <api-url> managed-host show <machine-id>
carbide-admin-cli -c <api-url> machine network status
```

Key fields to check in the output:

- **Health Probe Alerts**: which specific check failed (e.g., `HeartbeatTimeout`, `BgpStats`, `ServiceRunning`).
- **Last seen**: when the DPU last reported to NICo. A stale timestamp suggests the DPU agent has crashed or the DPU is offline.
- **State SLA**: if the host has been in its current state longer than the SLA, the output shows `In State > SLA: true` with the breach reason.

For the full troubleshooting workflow, including how to check logs via Grafana/Loki, verify DPU liveliness, restart the agent, and diagnose specific health probe alerts, see [`WaitingForNetworkConfig` and DPU health](/infra-controller/documentation/playbooks/stuck-objects/stuck-in-waiting-for-network-config-and-dpu-health).

## DPU Reprovisioning

DPU reprovisioning reinstalls the DPU OS and then waits for discovery, network configuration, and DPU health to converge again. It is used for planned firmware updates, DPU recovery, and cases where a DPU must be returned to a known clean state.

### What Happens During Reprovisioning

The reprovisioning state machine runs through the following stages:

1. Check whether BFB install via Redfish is supported for the DPU (BMC firmware >= 24.10 and secure boot enabled).
2. If supported, install the DPU OS via Redfish `UpdateService` (`ReprovisionState::InstallDpuOs`). If not, boot the DPU via UEFI HTTP for a network install (`ReprovisionState::WaitingForNetworkInstall`).
3. Power the host off and back on so the DPU image and firmware take effect.
4. Verify DPU firmware versions (BMC, CEC/ERoT, NIC) against the configured baseline and update any that do not match.
5. Wait for DPU network configuration and health to synchronize.
6. Clear the DPU reprovisioning request and return the managed host to the appropriate state.

### When Automatic Reprovisioning Is Triggered

Automatic DPU reprovisioning is triggered when Machine Update Manager selects an eligible `Ready` host whose DPU NIC firmware is outside the configured baseline. It queues a DPU reprovisioning request for the host.

### Triggering Reprovisioning Manually

The API requires a `HostUpdateInProgress` health alert on the host before it accepts a reprovisioning request. Use `--update-message` to apply this alert:

```bash
carbide-admin-cli -c <api-url> dpu reprovision set \
  --id <host-or-dpu-machine-id> \
  --update-message "<maintenance-reference>"
```

Firmware is always verified and updated during reprovisioning regardless of whether `--update-firmware` is passed. The `--update-firmware` flag is accepted but deprecated.

### Monitoring Reprovisioning Progress

```bash
carbide-admin-cli -c <api-url> dpu reprovision list
carbide-admin-cli -c <api-url> managed-host show <machine-id>
```

The `managed-host show` output displays the current reprovisioning substate, percent complete for BFB installation (when available), and any handler errors.

### Additional Reprovisioning Commands

To restart a DPU reprovisioning flow for all DPUs on a host:

```bash
carbide-admin-cli -c <api-url> dpu reprovision restart --id <host-machine-id>
```

To clear a pending reprovisioning request that has not started:

```bash
carbide-admin-cli -c <api-url> dpu reprovision clear --id <host-or-dpu-machine-id>
```

For the complete reprovisioning state machine, see [DPU Reprovision State Details](/infra-controller/documentation/architecture/state-machines/managed-host#dpu-reprovision-state-details-dpureprovisionstate).