Monitoring and Health
This page covers monitoring and health workflows for NICo sites after deployment: hardware health, DPU health, aggregate host health, health overrides, Prometheus scraping, Grafana dashboards, and Loki queries.
Use aggregate host health as the starting point for operational decisions. NICo combines hardware health, DPU health, validation and discovery checks, rack health, and health overrides into a single host-level result. Component health explains which source is responsible for the aggregate result.
Use this page as the entry point for health triage. It gives the primary inspection path, commands, metrics, dashboards, and log queries needed to start an investigation. For subsystem-specific behavior, follow the linked hardware, DPU, health aggregation, and classification references rather than treating this page as a replacement for those manuals.
For reference, see:
Health Sources
NICo builds health from health reports. A health report contains successes and alerts from a reporting source. Common health sources are:
Each alert has an ID, an optional target, a message, a start time, and one or
more classifications. Classifications define operational impact. For example,
PreventAllocations blocks new allocations while the alert is active, and
ExcludeFromStateMachineSla excludes the host from state-machine SLA
evaluation.
Hardware Health Monitoring
NICo monitors hardware through the hardware health service. The Helm chart is
nico-hardware-health.
The service discovers BMC endpoints from NICo and queries them through Redfish. It monitors host BMCs, DPU BMCs, and configured switch or power-shelf BMCs. The primary monitoring path is sensor collection. Additional collectors can gather firmware, log, NMX-T, NVUE REST, and leak-related data when configured.
Helm Configuration
Enable hardware health in Helm values:
Enable metrics scraping with its ServiceMonitor:
By default, the chart exposes hardware health metrics on port 9009. Log
collection is disabled by default:
Enable log collection only through the target site’s deployment values.
Hardware Health Service Configuration
The hardware health service config example,
crates/health/example/config.example.toml, documents endpoint discovery,
sinks, rate limits, collectors, processors, and metrics.
Production endpoint discovery uses the NICo API source. The checked-in
hardware-health example config currently names that source
[endpoint_sources.nico_api]:
Static BMC endpoints are supported for local, mock, or special deployments:
Collector defaults from the example config:
BMC Proxy
The full Helm example enables nico-bmc-proxy, the authenticating proxy for BMC
Redfish access. The proxy chart exposes proxy traffic on port 1079 and metrics
on port 1080.
Example proxy settings:
The BMC proxy ServiceMonitor follows the same serviceMonitor.enabled,
interval, and scrapeTimeout pattern as other NICo services.
Sensor Alerts
Hardware sensor alerts are derived from BMC-reported health, sensor readings, and thresholds. Sensor classifications include:
If numeric threshold data indicates a problem but the BMC reports the sensor as healthy, NICo treats the sensor as healthy. In that case the BMC health state is the authority.
Hardware Health Logs
Use Loki or Grafana Explore to inspect hardware health logs for a host:
Health report events include fields such as:
For leak-related events, look for:
DPU Health Checks
dpu-agent runs on managed DPUs and reports DPU health to NICo. The BlueField
chart is named nico-dpu-agent. In service names and logs, the DPU agent
currently appears as forge-dpu-agent.service.
The agent checks DPU service health, networking state, HBN/NVUE configuration, DHCP behavior, BGP status, and heartbeat. DPU health is part of aggregate host health, so an otherwise healthy host can still be unavailable when its DPU is unhealthy.
See also:
DPU Agent Configuration
Key nico-dpu-agent chart values:
The DaemonSet renders these core arguments:
If dhcp_server.interface_prepend is set, the chart also adds:
The pod sets these runtime environment variables:
Common DPU Alerts
Common DPU alert IDs include:
ContainerExistsServiceRunningDhcpServerBgpStatsBgpPeeringTorBgpPeeringRouteServerIfreloadBgpDaemonEnabledPostConfigCheckWaitDpuDiskUtilizationCriticalHeartbeatTimeout
HeartbeatTimeout means NICo has not received a recent health report
from the DPU agent. Check whether the DPU is powered, the agent is running, DPU
time is correct, and the DPU can reach NICo.
DPU Logs
Use Loki to inspect DPU-agent logs:
Alternative labels can be used when available:
On the DPU, use journalctl for direct service logs:
Restart the agent when required:
Health Alert Lifecycle
NICo health alerts are source-based. A health source submits a fresh report, and NICo uses the latest report from each source to calculate aggregate health.
A typical alert flow:
- A source reports an alert such as
PoweredOffwith target<bmc-ip>. - NICo adds the alert to aggregate host health.
- Classifications such as
PreventAllocationsdefine the operational effect. - The health view shows the alert ID, target, message, start time, and classifications.
- Metrics and logs identify the responsible source.
- After remediation, the source submits a fresh report that marks the check successful or omits the previous alert.
- NICo merges the fresh report and aggregate host health returns to healthy.
If a health override created the alert, remove or replace the override after the operational reason ends.
Inspect Current Health
Start with the host health page in the Admin Web UI:
Inspect the aggregate health table first. For each alert, note:
IDTargetIn Alert SinceMessageTenant MessageClassifications
Then inspect component health to identify the source: hardware health, DPU health, validation, discovery, rack health, or health override.
Use the health history table to review recent transitions. This helps identify whether an alert is new, recurring, or already cleared by a later health report.
Admin CLI examples:
Health Overrides
Health overrides add manual or service-created health reports into the same aggregate health model as automated checks. Overrides are shared health mechanisms; they are not specific to hardware health.
Use overrides for controlled states such as maintenance, validation, repair, break-fix, or temporary automation control. Do not use an override as a substitute for resolving the underlying condition.
Merge and Replace
DPU replace overrides are rejected by the API.
Override Templates
The nico-admin-cli machine health-override add command supports templates
for common workflows:
Examples:
Before creating an override, identify the current aggregate health, choose the smallest effect that matches the workflow, include a clear message, and define the removal condition. After remediation, remove or replace the override and verify aggregate health.
Prometheus Metrics
NICo charts expose Prometheus scraping through ServiceMonitor resources.
ServiceMonitors are disabled by default in chart values and enabled in the full
example for selected services.
Example:
ServiceMonitors from the charts:
Check rendered ServiceMonitors:
Core health metrics currently use carbide_* metric names. Some dashboards and
site configurations also expose host-health rollups with the forge_* prefix.
Use the literal metric name that exists in the target site.
Useful core metric families:
Use the Host Health dashboard panels for fleet-level rollups, including host health status, health overrides, probe alerts, and alert classifications. Example dashboard queries:
DPU metrics:
API Health and Availability
The NICo API is required for health inspection, health report ingestion, administrative workflows, and state-machine visibility. Check API health before debugging a host-specific health issue.
Check Kubernetes status:
Check API metrics scraping:
Use Loki or Grafana Explore to inspect API logs:
Grafana, Loki, and Logs
Use Grafana dashboards for fleet-level triage and Loki for source-specific logs.
Start from aggregate host health, identify the alert source and inAlertSince,
then query logs around that time.
Common Loki patterns:
Some sites expose machine identity as a log label. When that label is present, prefer a label filter over a free-text match:
Console logs are shipped by the nico-ssh-console-rs OpenTelemetry Collector
sidecar when enabled:
The sidecar tails:
It labels console logs with machineid and an SSH console exporter label:
Labels vary by log source. Use the Loki label browser to choose the most specific label available. Common labels include:
k8s_container_namek8s_namespace_namek8s_pod_namemachine_idmachineidhost_machine_idhost_namesystemd_unitexporterlevel
logcli can be used for repeatable terminal-based Loki queries when direct Loki
access is configured for the site. Use the same LogQL selectors shown above.
For example:
Dashboard Starting Points
Use the site-level health dashboard for fleet triage before drilling into logs. Start with these panels when they are available:
For host-health triage, the highest-value panels are Healthy Host Percentage, Health Status, Health Overrides, Health Probe Alerts, and Health Alert Classifications.
Troubleshooting
Triage Workflow
- Open aggregate host health.
- Record the alert ID, target, message,
inAlertSince, and classifications. - Identify the source: hardware health, DPU health, validation, discovery, rack health, or override.
- Use the source-specific metrics and logs for that alert.
- Remediate the underlying condition.
- Wait for a fresh health report from the responsible source.
- Confirm aggregate host health returns to healthy.
- Remove temporary overrides used during the investigation.