The noDpuLogsWarning alert fires under the following conditions:
The format of the alert name is “<NICo site ID>-noDpuLogsWarning (<NICo site ID> <DPU ARM OS hostname> nico-monitoring/nico-monitoring-(<NICo site ID>-prometheus warning)
The machine is currently being re-provisioned and taking longer than expected to complete provisioning
The machine is being worked on by another SRE team member. The machine might be powered off, undergoing maintenance or might have been force-deleted.
Issues with systemd services on the DPU ARM OS.
On the DPU ARM OS, check that node-exporter, otelcol-contrib and nico-dpu-otel-agent services are running and not reporting errors:
In the example above, the hostname being used by the otelcol-contrib service (host_name=“localhost”) is set to localhost. The host_name should be set to the hostname of the DPU ARM OS. To resolve this issue, restart the OpenTelemetry Collector service:
Wait for 5 minutes after restarting the service and check the metrics again:
In this example the host_name is now set to 192-168-134-165.nico.example.org.
If errors are being sent against the endpoint, but it is available on the network (You can ping it, ssh to the DPU ARM OS and all services appear to be running with no errors), you can attempt to restart the nico-hardware-health pod to see if this resolves the issues: