Whenever the configuration of a ManagedHost changes (Instance gets created,
Instance gets deleted, Provisioning), NICo requires the nico-dpu-agent to
acknowledge that the desired DPU configuration is applied and that the DPU and
services running on it (like HBN) are in a healthy state.
This feedback mechanism works in the following fashion:
nico-dpu-agent periodically calls GetManagedHostNetworkConfig. It thereby
obtains the latest configuration for all interfaces, including the configuration
which states whether the Host should get attached to an admin or tenant network.
The configuration includes Version numbers, which increase whenever the
configuration changes.nico-dpu-agent reports the version numbers of the currently applied
configurations back to NICo using the RecordDpuNetworkStatus API.
This report also includes the DPUs health in the form of a HealthReport.If the DPU has not recently reported that it is up, healthy and that the latest desired configuration is applied, the state will not be advanced.
If a ManagedHost is stuck due to this check, you can inspect which condition is not met by inspecting the last report from the Host and DPUs
E.g. in the following report
Health field indicates whether any of the health checks failed. In this case we can see an alert of the HeartbeatTimeout probe - with target nico-dpu-agent. That indicates no
HealthReport had been received from nico-dpu-agent via a RecordDpuNetworkStatus
API call for a certain amount of time.Health of a Host is the aggregation of Health states from
monitoring by nico-dpu-agent, out of band BMC monitoring (hardware-health),
and the results of validation tests. If the health check failure also shows up
in the Health field of the DPU, then the failure is related to the DPU,
and/or has been reported by nico-dpu-agent.
If a health-check has failed, then the root-caused for the failed health-check
needs to be remediated.nico-dpu-agent) is up and running.
If the timestamp is too old, it might indicate the DPU agent has crashed or the
whole DPU is no longer online. In such a case a HeartbeatTimeout alert on the
DPU and Host would be raised too.The network status details show:
In this case we learn that the DPU was alive before, and acknowledged network config version V2-T1702485344893918. This is still the desired network configuration version for
this DPU. The target configuration for a DPU can be found on the Network Config block the DPU page in the admin Web UI.
The summary for this example is that the Machine is stuck because the DPU
nico-dpu-agentnico-dpu-agent is not reporting back to NICoOperators can try SSHing to the DPU, using the DPU OOB address that is shown on ManagedHost pages and DPU details pages. If SSH fails, the DPU might not be up and running.
If directly SSHing to the DPU does not work, it can be accessed via its BMC and rshim to investigate its state.
In case the DPU is running, nico-dpu-agent logs can be inspected in order to
learn why it can not communicate with nico, or why the configuration application
might have failed. There are various options for this.
nico-dpu-agent logs are forwarded via OpenTelemetry to the site controller logging infrastructure. They can be queried from there via Loki.
Search strings for DPU can be:
Note that the query using the MachineId will only work if the DPU once had been fully ingested
and is aware of its Machine ID. Otherwise only searches by host_name will work.
In case the DPU problem affects log forwarding, DPU logs need to be checked directly on the DPU.
The dpu agent logs are stored in the systemd journal on the DPU. They can be queried using
Depending on the problems that are found in dpu-agent logs, it can be useful to check other logs that are available on the DPU. Examples are
nl2doca logs: {machine_id="fm100ds02e5g65099ov37rmho1gnge0c99ihdisvluo4fls1ba3br9bksg0", log_file_path="/var/log/doca/hbn/nl2docad.log"}syslog: {machine_id="fm100ds02e5g65099ov37rmho1gnge0c99ihdisvluo4fls1ba3br9bksg0", log_file_path="/var/log/doca/hbn/syslog"}nvue logsfrr logs⚠️ Note that while a tenant uses a Machine as an instance, powercycling the Host will interrupt their workloads. Only perform these step if its clear that the Tenant no longer requires the Machine (is stuck in termination), or if the Tenant agrees with this action.
If the DPU is unresponsive, powering off the Host and back on can help. This will restart the DPU.
The Host can be powercycled using the Explored-Endpoint view in the Admin Web UI, The DPU Machine details page will link to the explored endpoint by clicking on the DPU BMC IP.
If nico-dpu-agent is not even started, then it needs to be started (systemctl enable nico-dpu-agent.service).
This should however never be necessary, since the agent gets restarted on all
crashes.
If nico-dpu-agent should just be restarted, use
In rare situations, it might be useful to restart nico-dpu-agent using latest dpu-agent systemd config files. To do so:
BgpStatsThe BgpStats health probe indicates that BGP peering with the TOR or route server is not successful. This might either indicate a link issue or a configuration issue. The BGP details can be checked on the DPU using
ServiceRunningIndicates that mandatory DPU services are not running. Next steps in the investigation
can be to check whether the HBN container is running on the DPU (crictl ps should
show doca-hbn container), and to search for associated logs.
DhcpRelay/DhcpServerIndicates that the DHCP Relay or Server that NICo deploys on the DPU in order to respond to the DHCP requests from the Host are not running as intended. In these conditions, the Host would not be able to boot since nothing would respond to the DHCP request.
Next steps in the investigation would be to check nico-dpu-agent logs for details.
PostConfigCheckWaitThis alert is only raised for a brief time after each configuration change in order to wait for the configuration to settle on the DPU. The alert should always settle down after less than a minute. In case the alert keeps raised, it can indicate that new configurations are applied in every dpu-agent eventloop iteration. In this case it would need to be debugged what in the configurations changed, and the source of the unnecessary configuration changes would need to be fixed.