For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
GitHub
DocumentationREST API Reference
DocumentationREST API Reference
    • Home
  • Overview
    • What is NICo?
    • Key Capabilities
    • Operational Principles
    • Day 0 / Day 1 / Day 2 Lifecycle
    • Scope and Boundaries
  • Getting Started
    • Building NICo Containers
    • Quick Start Guide
  • Architecture
    • Overview and Components
    • Reliable State Handling
    • Networking Integrations
    • Key Group Synchronization
  • Provisioning (Day 0)
    • Ingesting Hosts
    • Ingesting Hosts (REST API)
    • Machine Validation
    • SKU Validation
    • Measured Boot Attestation
  • Configuration (Day 1)
    • Network Isolation
    • Tenant Management
    • Organization & Permissions
  • Operations (Day 2)
    • Tenant Lifecycle Cleanup
    • Network Isolation
    • Network Security Groups
    • InfiniBand Partitioning
    • NVLink Partitioning
    • Rack-Level Administration (RLA)
    • IP Resource Pools
    • BGP Peering
    • nicocli Reference
      • Azure OIDC for Infra Controller Web UI
      • Force Deleting and Rebuilding Hosts
      • Rebooting a Machine
      • InfiniBand Setup
        • Overview and General Troubleshooting
        • Common Mitigations
        • Stuck in WaitingForNetworkConfig and DPU Health
        • Adding New Machines to an Existing Site
        • Troubleshooting noDpuLogsWarning Alerts
  • Reference
    • Hardware Compatibility List
    • Release Notes
    • FAQs
    • Glossary
GitHub
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogo
On this page
  • Follow-up investigation steps
  • Checking DPU liveliness
  • Checking DPU agent logs
  • Checking logs via Grafana & Loki
  • Checking logs on the DPU:
  • Checking additional logs
  • Potential Mitigations
  • Power Cycling the Host
  • Restarting nico-dpu-agent
  • Reloading nico-dpu-agent configurations
  • Mitigations for specific Health Probe Alerts
  • BgpStats
  • ServiceRunning
  • DhcpRelay/DhcpServer
  • PostConfigCheckWait
Operations (Day 2)PlaybooksStuck Objects

WaitingForNetworkConfig and DPU health

||View as Markdown|
Previous

Common Mitigations

Next

Adding New Machines to an Existing Site

Whenever the configuration of a ManagedHost changes (Instance gets created, Instance gets deleted, Provisioning), NICo requires the nico-dpu-agent to acknowledge that the desired DPU configuration is applied and that the DPU and services running on it (like HBN) are in a healthy state.

This feedback mechanism works in the following fashion:

  1. nico-dpu-agent periodically calls GetManagedHostNetworkConfig. It thereby obtains the latest configuration for all interfaces, including the configuration which states whether the Host should get attached to an admin or tenant network. The configuration includes Version numbers, which increase whenever the configuration changes.
  2. nico-dpu-agent reports the version numbers of the currently applied configurations back to NICo using the RecordDpuNetworkStatus API. This report also includes the DPUs health in the form of a HealthReport.

If the DPU has not recently reported that it is up, healthy and that the latest desired configuration is applied, the state will not be advanced.

If a ManagedHost is stuck due to this check, you can inspect which condition is not met by inspecting the last report from the Host and DPUs

  • via nico-admin-cli:
    • nico-admin-cli managed-host show fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
    • nico-admin-cli machine show fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
    • nico-admin-cli machine show fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
    • nico-admin-cli machine network status

E.g. in the following report

/opt/nico/nico-admin-cli managed-host show fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
Hostname : 192-168-18-95
State : DPUInitializing/WaitingForNetworkConfig
Time in State : 296 days and 29 minutes
State SLA : 30 minutes
In State > SLA: true
Reason : The object is in the state for longer than defined by the SLA. Handler outcome: Wait("Waiting for DPU agent to apply network config and report healthy network for DPU fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg")
Host:
----------------------------------------
ID : fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
Memory : Unknown
Admin IP : 192.168.18.95
Admin MAC : B8:3F:D2:B7:70:64
Health
Probe Alerts : HeartbeatTimeout [Target: nico-dpu-agent]:
Overrides
BMC
Version : Unknown
Firmware Version : Unknown
IP : Unknown
MAC : Unknown
DPU0:
----------------------------------------
ID : fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
State : DPUInitializing/WaitingForNetworkConfig
Primary : true
Failure details : Unknown
Last reboot : 2023-12-13 16:38:08.180734 UTC
Last reboot requested : Unknown/
Last seen : 2023-12-13 17:24:15.454965 UTC
Serial Number : MT2244XZ022R
BIOS Version : BlueField:3.9.3-7-g8f2d8ca
Admin IP : 192.168.134.233
Admin MAC : B8:3F:D2:B7:70:72
BMC
Version : 1
Firmware Version : 2.08
IP : 192.168.134.234
MAC : B8:3F:D2:B7:70:66
Health
Probe Alerts : HeartbeatTimeout [Target: nico-dpu-agent]: No health data was received from DPU
  • The Health field indicates whether any of the health checks failed. In this case we can see an alert of the HeartbeatTimeout probe - with target nico-dpu-agent. That indicates no HealthReport had been received from nico-dpu-agent via a RecordDpuNetworkStatus API call for a certain amount of time.
  • The aggregate Health of a Host is the aggregation of Health states from monitoring by nico-dpu-agent, out of band BMC monitoring (hardware-health), and the results of validation tests. If the health check failure also shows up in the Health field of the DPU, then the failure is related to the DPU, and/or has been reported by nico-dpu-agent. If a health-check has failed, then the root-caused for the failed health-check needs to be remediated.
  • “Last seen” indicates whether the DPU (and nico-dpu-agent) is up and running. If the timestamp is too old, it might indicate the DPU agent has crashed or the whole DPU is no longer online. In such a case a HeartbeatTimeout alert on the DPU and Host would be raised too.

The network status details show:

/opt/nico/nico-admin-cli machine network status
+-------------------------+-------------------------------------------------------------+------------------------+----------+--------------------------------------------+---------------------------------+
| Observed at | DPU machine ID | Network config version | Healthy? | Health Probe Alerts | Agent version |
+=========================+=============================================================+========================+==========+============================================+=================================+
| 2023-12-13 17:24:15.454 | fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg | V2-T1702485344893918 | false | HeartbeatTimeout [Target: nico-dpu-agent] | v2023.12-rc1-43-g3322d125f |
+-------------------------+-------------------------------------------------------------+------------------------+----------+--------------------------------------------+---------------------------------+

In this case we learn that the DPU was alive before, and acknowledged network config version V2-T1702485344893918. This is still the desired network configuration version for this DPU. The target configuration for a DPU can be found on the Network Config block the DPU page in the admin Web UI.

The summary for this example is that the Machine is stuck because the DPU

  • is either not healthy at all (e.g. not booted)
  • is not running nico-dpu-agent
  • nico-dpu-agent is not reporting back to NICo

Follow-up investigation steps

Checking DPU liveliness

Operators can try SSHing to the DPU, using the DPU OOB address that is shown on ManagedHost pages and DPU details pages. If SSH fails, the DPU might not be up and running.

If directly SSHing to the DPU does not work, it can be accessed via its BMC and rshim to investigate its state.

Checking DPU agent logs

In case the DPU is running, nico-dpu-agent logs can be inspected in order to learn why it can not communicate with nico, or why the configuration application might have failed. There are various options for this.

Checking logs via Grafana & Loki

nico-dpu-agent logs are forwarded via OpenTelemetry to the site controller logging infrastructure. They can be queried from there via Loki.

Search strings for DPU can be:

{systemd_unit="nico-dpu-agent.service", machine_id="fm100ds006eliqt3u4h65ou9ebrqfq9th2jf39qqki68k9ueu2amearv47g"}
{systemd_unit="nico-dpu-agent.service", host_name="192-168-155-135.nico.example.org"}

Note that the query using the MachineId will only work if the DPU once had been fully ingested and is aware of its Machine ID. Otherwise only searches by host_name will work.

In case the DPU problem affects log forwarding, DPU logs need to be checked directly on the DPU.

Checking logs on the DPU:

The dpu agent logs are stored in the systemd journal on the DPU. They can be queried using

journalctl -u nico-dpu-agent.service -e --no-pager

Checking additional logs

Depending on the problems that are found in dpu-agent logs, it can be useful to check other logs that are available on the DPU. Examples are

  • nl2doca logs: {machine_id="fm100ds02e5g65099ov37rmho1gnge0c99ihdisvluo4fls1ba3br9bksg0", log_file_path="/var/log/doca/hbn/nl2docad.log"}
  • syslog: {machine_id="fm100ds02e5g65099ov37rmho1gnge0c99ihdisvluo4fls1ba3br9bksg0", log_file_path="/var/log/doca/hbn/syslog"}
  • nvue logs
  • frr logs

Potential Mitigations

Power Cycling the Host

⚠️ Note that while a tenant uses a Machine as an instance, powercycling the Host will interrupt their workloads. Only perform these step if its clear that the Tenant no longer requires the Machine (is stuck in termination), or if the Tenant agrees with this action.

If the DPU is unresponsive, powering off the Host and back on can help. This will restart the DPU.

The Host can be powercycled using the Explored-Endpoint view in the Admin Web UI, The DPU Machine details page will link to the explored endpoint by clicking on the DPU BMC IP.

Restarting nico-dpu-agent

If nico-dpu-agent is not even started, then it needs to be started (systemctl enable nico-dpu-agent.service). This should however never be necessary, since the agent gets restarted on all crashes.

If nico-dpu-agent should just be restarted, use

systemctl restart nico-dpu-agent.service

Reloading nico-dpu-agent configurations

In rare situations, it might be useful to restart nico-dpu-agent using latest dpu-agent systemd config files. To do so:

systemctl daemon-reload
systemctl restart nico-dpu-agent.service

Mitigations for specific Health Probe Alerts

BgpStats

The BgpStats health probe indicates that BGP peering with the TOR or route server is not successful. This might either indicate a link issue or a configuration issue. The BGP details can be checked on the DPU using

sudo crictl exec -ti $(sudo crictl ps |grep doca-hbn |awk '{print $1}') vtysh -c 'show bgp summary'

ServiceRunning

Indicates that mandatory DPU services are not running. Next steps in the investigation can be to check whether the HBN container is running on the DPU (crictl ps should show doca-hbn container), and to search for associated logs.

DhcpRelay/DhcpServer

Indicates that the DHCP Relay or Server that NICo deploys on the DPU in order to respond to the DHCP requests from the Host are not running as intended. In these conditions, the Host would not be able to boot since nothing would respond to the DHCP request.

Next steps in the investigation would be to check nico-dpu-agent logs for details.

PostConfigCheckWait

This alert is only raised for a brief time after each configuration change in order to wait for the configuration to settle on the DPU. The alert should always settle down after less than a minute. In case the alert keeps raised, it can indicate that new configurations are applied in every dpu-agent eventloop iteration. In this case it would need to be debugged what in the configurations changed, and the source of the unnecessary configuration changes would need to be fixed.