Monitoring and Health

View as Markdown

This page covers monitoring and health workflows for NICo sites after deployment: hardware health, DPU health, aggregate host health, health overrides, Prometheus scraping, Grafana dashboards, and Loki queries.

Use aggregate host health as the starting point for operational decisions. NICo combines hardware health, DPU health, validation and discovery checks, rack health, and health overrides into a single host-level result. Component health explains which source is responsible for the aggregate result.

Use this page as the entry point for health triage. It gives the primary inspection path, commands, metrics, dashboards, and log queries needed to start an investigation. For subsystem-specific behavior, follow the linked hardware, DPU, health aggregation, and classification references rather than treating this page as a replacement for those manuals.

For reference, see:

Health Sources

NICo builds health from health reports. A health report contains successes and alerts from a reporting source. Common health sources are:

SourceWhat it reports
Hardware healthBMC and Redfish hardware state, including sensors, chassis status, and leak-related signals when configured.
DPU agentDPU service health, DPU networking health, BGP state, DHCP service health, and agent heartbeat.
Validation and discoverySKU validation, host validation, endpoint discovery, and inventory checks.
Rack healthRack-level health input when rack health reporting is configured.
Health overridesManual or service-created health reports used for maintenance, repair, validation, or other controlled workflows.

Each alert has an ID, an optional target, a message, a start time, and one or more classifications. Classifications define operational impact. For example, PreventAllocations blocks new allocations while the alert is active, and ExcludeFromStateMachineSla excludes the host from state-machine SLA evaluation.

Hardware Health Monitoring

NICo monitors hardware through the hardware health service. The Helm chart is nico-hardware-health.

The service discovers BMC endpoints from NICo and queries them through Redfish. It monitors host BMCs, DPU BMCs, and configured switch or power-shelf BMCs. The primary monitoring path is sensor collection. Additional collectors can gather firmware, log, NMX-T, NVUE REST, and leak-related data when configured.

Helm Configuration

Enable hardware health in Helm values:

1nico-hardware-health:
2 enabled: true

Enable metrics scraping with its ServiceMonitor:

1nico-hardware-health:
2 enabled: true
3 replicas: 1
4
5 serviceMonitor:
6 enabled: true
7 interval: 30s
8 scrapeTimeout: 25s

By default, the chart exposes hardware health metrics on port 9009. Log collection is disabled by default:

1env:
2 NICO_HEALTH__COLLECTORS__LOGS__ENABLED: "false"

Enable log collection only through the target site’s deployment values.

Hardware Health Service Configuration

The hardware health service config example, crates/health/example/config.example.toml, documents endpoint discovery, sinks, rate limits, collectors, processors, and metrics.

Production endpoint discovery uses the NICo API source. The checked-in hardware-health example config currently names that source [endpoint_sources.nico_api]:

1[endpoint_sources.nico_api]
2root_ca = "/var/run/secrets/spiffe.io/ca.crt"
3client_cert = "/var/run/secrets/spiffe.io/tls.crt"
4client_key = "/var/run/secrets/spiffe.io/tls.key"
5api_url = "https://nico-api.forge-system.svc.cluster.local:1079"

Static BMC endpoints are supported for local, mock, or special deployments:

1[[endpoint_sources.static_bmc_endpoints]]
2ip = "10.0.0.1"
3port = 443
4mac = "aa:bb:cc:dd:ee:ff"
5username = "admin"
6password = "secret"

Collector defaults from the example config:

AreaParameterExample valueMeaning
Rate limitingbucket_burst200Burst size for outbound requests.
Rate limitingbucket_replenish"35ms"Token replenish interval.
Sensor collectorsensor_fetch_interval"1m"Sensor polling cadence.
Sensor collectorrediscover_interval"5m"Sensor inventory rediscovery cadence.
Sensor collectorstate_refresh_interval"30m"Broader BMC state refresh cadence.
Sensor collectorsensor_fetch_concurrency10Concurrent sensor fetch limit.
Sensor collectorinclude_sensor_thresholdstrueInclude BMC threshold data when available.
Firmware collectorfirmware_refresh_interval"30m"Firmware refresh cadence.
Logs collectormode"sse"Preferred BMC log collection mode.
NMX-T collectorscrape_interval"1m"Switch telemetry scrape cadence.
NVUE REST collectorpoll_interval"1m"NVUE REST polling cadence.
Leak processorminimum_alerts_per_report1Leak alert threshold for health reports.
Rack leak processorleaking_tray_threshold2Rack-level leak threshold.
Metricsendpoint"0.0.0.0:9009"Metrics listener.
Metricsprefix"carbide_hardware_health"Hardware-health metric prefix.

BMC Proxy

The full Helm example enables nico-bmc-proxy, the authenticating proxy for BMC Redfish access. The proxy chart exposes proxy traffic on port 1079 and metrics on port 1080.

Example proxy settings:

1listen = "[::]:1079"
2metrics_endpoint = "[::]:1080"
3allowed_principals = ["spiffe-service-id/nico-api", "spiffe-service-id/nv-dps"]

The BMC proxy ServiceMonitor follows the same serviceMonitor.enabled, interval, and scrapeTimeout pattern as other NICo services.

Sensor Alerts

Hardware sensor alerts are derived from BMC-reported health, sensor readings, and thresholds. Sensor classifications include:

ClassificationMeaning
SensorWarningSensor crossed a caution threshold.
SensorCriticalSensor crossed a critical threshold.
SensorFailureSensor value is outside the valid range or otherwise invalid.

If numeric threshold data indicates a problem but the BMC reports the sensor as healthy, NICo treats the sensor as healthy. In that case the BMC health state is the authority.

Hardware Health Logs

Use Loki or Grafana Explore to inspect hardware health logs for a host:

1{k8s_container_name="nico-hardware-health"} |= "<machine-id>"

Health report events include fields such as:

collector=sensor_collector
report_source=bmc-sensors
machine_id=Some(<machine-id>)
alert_count=<n>
success_count=<n>

For leak-related events, look for:

report_source=tray-leak-detection

DPU Health Checks

dpu-agent runs on managed DPUs and reports DPU health to NICo. The BlueField chart is named nico-dpu-agent. In service names and logs, the DPU agent currently appears as forge-dpu-agent.service.

The agent checks DPU service health, networking state, HBN/NVUE configuration, DHCP behavior, BGP status, and heartbeat. DPU health is part of aggregate host health, so an otherwise healthy host can still be unavailable when its DPU is unhealthy.

See also:

DPU Agent Configuration

Key nico-dpu-agent chart values:

Config areaValueMeaning
certsDirsite certificate directoryHost certificate directory mounted into the agent container.
securityContext.capabilities.addNET_ADMIN, SYS_ADMIN, DAC_OVERRIDE, NET_RAWLinux capabilities used for network and system operations.
hbn.nvue_https_addressnvueNVUE service name used by the agent.
hbn.nvue_credentials_secret_namesite valueSecret containing NVUE credentials.
hbn.nvue_password_keypasswordSecret key for the NVUE password.
dhcp_server.interface_prependempty by defaultOptional DHCP interface prefix argument.
dhcp_server.service_nameset by DPF service integrationDHCP gRPC service name.
fmds.service_nameset by DPF service integrationFMDS gRPC service name.

The DaemonSet renders these core arguments:

nico-dpu-agent run
--hbn-config-mode=nvue-rest
--agent-platform-type=containerized
--dhcp-grpc-server=http://<dhcp-grpc-service>:10079
--fmds-grpc-server=http://<fmds-grpc-service>:50052

If dhcp_server.interface_prepend is set, the chart also adds:

--dhcp-server-interface-prepend=<prefix>

The pod sets these runtime environment variables:

Environment variableSource / value
POD_IPKubernetes pod IP field reference.
NODE_NAMEKubernetes node name field reference.
POD_NAMEKubernetes pod name field reference.
POD_NAMESPACEKubernetes namespace field reference.
IGNORE_MGMT_VRF1.
NVUE_HTTPS_ADDRESSDPF cluster NVUE endpoint.
NVUE_USERNAMENVUE user configured for the deployment.
NVUE_PASSWORDSecret key from hbn.nvue_credentials_secret_name.
RUST_LOGinfo.

Common DPU Alerts

Common DPU alert IDs include:

  • ContainerExists
  • ServiceRunning
  • DhcpServer
  • BgpStats
  • BgpPeeringTor
  • BgpPeeringRouteServer
  • Ifreload
  • BgpDaemonEnabled
  • PostConfigCheckWait
  • DpuDiskUtilizationCritical
  • HeartbeatTimeout

HeartbeatTimeout means NICo has not received a recent health report from the DPU agent. Check whether the DPU is powered, the agent is running, DPU time is correct, and the DPU can reach NICo.

DPU Logs

Use Loki to inspect DPU-agent logs:

1{systemd_unit="forge-dpu-agent.service", machine_id="<machine-id>"}

Alternative labels can be used when available:

1{systemd_unit="forge-dpu-agent.service", host_name="<host-name>"}

On the DPU, use journalctl for direct service logs:

$journalctl -u forge-dpu-agent.service -e --no-pager

Restart the agent when required:

$systemctl restart forge-dpu-agent.service

Health Alert Lifecycle

NICo health alerts are source-based. A health source submits a fresh report, and NICo uses the latest report from each source to calculate aggregate health.

A typical alert flow:

  1. A source reports an alert such as PoweredOff with target <bmc-ip>.
  2. NICo adds the alert to aggregate host health.
  3. Classifications such as PreventAllocations define the operational effect.
  4. The health view shows the alert ID, target, message, start time, and classifications.
  5. Metrics and logs identify the responsible source.
  6. After remediation, the source submits a fresh report that marks the check successful or omits the previous alert.
  7. NICo merges the fresh report and aggregate host health returns to healthy.

If a health override created the alert, remove or replace the override after the operational reason ends.

Inspect Current Health

Start with the host health page in the Admin Web UI:

https://<nico-api-hostname>/admin/machine/<machine-id>/health

Inspect the aggregate health table first. For each alert, note:

  • ID
  • Target
  • In Alert Since
  • Message
  • Tenant Message
  • Classifications

Then inspect component health to identify the source: hardware health, DPU health, validation, discovery, rack health, or health override.

Use the health history table to review recent transitions. This helps identify whether an alert is new, recurring, or already cleared by a later health report.

Admin CLI examples:

$nico-admin-cli machine show <machine-id>
$nico-admin-cli machine health-override show <machine-id>

Health Overrides

Health overrides add manual or service-created health reports into the same aggregate health model as automated checks. Overrides are shared health mechanisms; they are not specific to hardware health.

Use overrides for controlled states such as maintenance, validation, repair, break-fix, or temporary automation control. Do not use an override as a substitute for resolving the underlying condition.

Merge and Replace

Override modeUse
MergeAdds a specific health condition while preserving automated health sources. Use this for most manual workflows.
ReplaceReplaces aggregate health for the target. Use only as a tightly controlled exception because it can hide automated health sources.

DPU replace overrides are rejected by the API.

Override Templates

The nico-admin-cli machine health-override add command supports templates for common workflows:

TemplateUse
HostUpdateMark host as in DPU reprovision or host update.
InternalMaintenanceInternal maintenance window.
OutForRepairHost removed from service for repair.
DegradedMark host as degraded.
ValidationMark host for validation.
SuppressExternalAlertingSuppress external alerting behavior.
MarkHealthyForce healthy.
StopRebootForAutomaticRecoveryFromStateMachineBlock automatic recovery reboots during manual work.
TenantReportedIssueTenant-reported issue while releasing an instance.
RequestRepairTenant-reported issue requiring repair.

Examples:

$nico-admin-cli machine health-override show <machine-id>
$
$nico-admin-cli machine health-override add <machine-id> \
> --template RequestRepair \
> --message "Manual repair trigger for tenant-reported issue"
$
$nico-admin-cli machine health-override add <machine-id> \
> --template OutForRepair \
> --message "Automated repair failed, requires manual investigation"
$
$nico-admin-cli machine health-override remove <machine-id> repair-request
$nico-admin-cli machine health-override remove <machine-id> tenant-reported-issue

Before creating an override, identify the current aggregate health, choose the smallest effect that matches the workflow, include a clear message, and define the removal condition. After remediation, remove or replace the override and verify aggregate health.

Prometheus Metrics

NICo charts expose Prometheus scraping through ServiceMonitor resources. ServiceMonitors are disabled by default in chart values and enabled in the full example for selected services.

Example:

1nico-api:
2 serviceMonitor:
3 enabled: true
4 interval: 30s
5 scrapeTimeout: 25s
6
7nico-hardware-health:
8 serviceMonitor:
9 enabled: true
10 interval: 30s
11 scrapeTimeout: 25s

ServiceMonitors from the charts:

ComponentServiceMonitorMetrics port
nico-apinico-api-metrics1080
nico-dsx-exchange-consumernico-dsx-exchange-consumer-metrics9009
nico-hardware-healthnico-hardware-health-metrics9009
nico-bmc-proxynico-bmc-proxy-metrics1080
nico-dhcpnico-dhcp-metrics1089
nico-pxenico-pxe-metrics8080
nico-ssh-console-rsnico-ssh-console-rs-metrics9009
unboundunbound-metrics9167

Check rendered ServiceMonitors:

$kubectl get servicemonitor -n nico-system
$kubectl get servicemonitor -n nico-system nico-api-metrics
$kubectl get servicemonitor -n nico-system nico-hardware-health-metrics
$kubectl get servicemonitor -n nico-system nico-dsx-exchange-consumer-metrics

Core health metrics currently use carbide_* metric names. Some dashboards and site configurations also expose host-health rollups with the forge_* prefix. Use the literal metric name that exists in the target site.

Useful core metric families:

MetricUse
carbide_hosts_health_status_countHost health counts split by healthy and in_use.
carbide_hosts_health_overrides_countActive merge and replace health overrides.
carbide_hosts_unhealthy_by_probe_id_countActive unhealthy hosts by probe ID and probe target.
carbide_hosts_unhealthy_by_classification_countActive unhealthy hosts by health-alert classification.
carbide_machines_per_stateFleet distribution by machine state.
carbide_machines_per_state_above_slaMachines above state-machine SLA.

Use the Host Health dashboard panels for fleet-level rollups, including host health status, health overrides, probe alerts, and alert classifications. Example dashboard queries:

1sum by (healthy, in_use) (
2 max by(healthy, in_use) (
3 forge_hosts_health_status_count{fresh="true"}
4 )
5)
1sum by(override_type) (
2 max by(override_type, in_use) (
3 forge_hosts_health_overrides_count{fresh="true"}
4 )
5)
1sum by(probe_id) (
2 max by(probe_id, probe_target) (
3 forge_hosts_unhealthy_by_probe_id_count{fresh="true"}
4 )
5)
1sum by(classification) (
2 max by(classification, in_use) (
3 forge_hosts_unhealthy_by_classification_count{fresh="true"}
4 )
5)

DPU metrics:

MetricUse
carbide_dpus_up_countDPUs with health reports newer than the DPU up threshold.
carbide_dpus_healthy_countDPUs whose latest health report is healthy.
carbide_dpu_health_check_failed_countFailed DPU health checks by probe.
carbide_dpu_agent_version_countDPU-agent version distribution.
carbide_dpu_firmware_version_countDPU firmware version distribution.
forge_dpu_agent_network_reachableDPU-to-DPU reachability.
forge_dpu_agent_network_latencyDPU-to-DPU latency.
forge_dpu_agent_network_loss_percentagePacket loss in a DPU network check cycle.
forge_dpu_agent_network_monitor_errorNetwork monitor errors unrelated to connectivity.
forge_dpu_agent_network_communication_errorCommunication errors to a destination DPU.

API Health and Availability

The NICo API is required for health inspection, health report ingestion, administrative workflows, and state-machine visibility. Check API health before debugging a host-specific health issue.

Check Kubernetes status:

$kubectl get deploy -n nico-system nico-api
$kubectl get pods -n nico-system -l app.kubernetes.io/name=nico-api
$kubectl get svc -n nico-system nico-api

Check API metrics scraping:

$kubectl get servicemonitor -n nico-system nico-api-metrics

Use Loki or Grafana Explore to inspect API logs:

1{k8s_container_name="nico-api"} |= "<machine-id>"
1{k8s_container_name="nico-api"} |= "<bmc-ip>" != "SPAN"

Grafana, Loki, and Logs

Use Grafana dashboards for fleet-level triage and Loki for source-specific logs. Start from aggregate host health, identify the alert source and inAlertSince, then query logs around that time.

Common Loki patterns:

1{systemd_unit="forge-dpu-agent.service", machine_id="<machine-id>"}
1{k8s_container_name="nico-hardware-health"} |= "<machine-id>"

Some sites expose machine identity as a log label. When that label is present, prefer a label filter over a free-text match:

1{machine_id="<machine-id>"}

Console logs are shipped by the nico-ssh-console-rs OpenTelemetry Collector sidecar when enabled:

1nico-ssh-console-rs:
2 lokiLogCollector:
3 enabled: true
4 image:
5 repository: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib
6 tag: "0.81.0"

The sidecar tails:

/var/log/consoles/{machineid}_{bmc_ip}.log

It labels console logs with machineid and an SSH console exporter label:

1{machineid="<machine-id>", exporter="nico-ssh-console-rs"}

Labels vary by log source. Use the Loki label browser to choose the most specific label available. Common labels include:

  • k8s_container_name
  • k8s_namespace_name
  • k8s_pod_name
  • machine_id
  • machineid
  • host_machine_id
  • host_name
  • systemd_unit
  • exporter
  • level

logcli can be used for repeatable terminal-based Loki queries when direct Loki access is configured for the site. Use the same LogQL selectors shown above. For example:

$logcli query --since=1h '{k8s_container_name="nico-hardware-health"} |= "<machine-id>"'
$logcli query --since=1h '{systemd_unit="forge-dpu-agent.service"} |= "<machine-id>"'

Dashboard Starting Points

Use the site-level health dashboard for fleet triage before drilling into logs. Start with these panels when they are available:

Dashboard areaUse
Host HealthSite-level host health, probe alerts, classifications, and overrides.
DPU StatusDPU health, heartbeat, version, and firmware distribution.
Hardware Health Monitor Service MetricsHardware-health scrape and collector behavior.
Site ExplorerEndpoint discovery and exploration behavior.
Machine Update ManagerUpdate workflow health and state-machine interaction.

For host-health triage, the highest-value panels are Healthy Host Percentage, Health Status, Health Overrides, Health Probe Alerts, and Health Alert Classifications.

Troubleshooting

SymptomCheckNext action
Host is unhealthy with PoweredOffAdmin Web UI health page and hardware-health logs around inAlertSince.Confirm BMC power state and whether the alert target is the expected BMC IP.
Host is unhealthy with HeartbeatTimeout for forge-dpu-agentjournalctl -u forge-dpu-agent.service -e --no-pager and Loki query for the DPU agent.Confirm the DPU is powered, time-synced, and able to reach NICo. Restart forge-dpu-agent.service only when service-level remediation requires it.
Host has active overridesnico-admin-cli machine health-override show <machine-id> and the Health Overrides dashboard panel.Verify the override reason is still valid. Remove temporary overrides after the condition ends.
Health metrics are missingkubectl get servicemonitor -n nico-system and the component-specific ServiceMonitor.Enable the chart serviceMonitor block or fix the Prometheus selector/namespace match.
Hardware-health logs do not show reports for a hostLoki query for k8s_container_name="nico-hardware-health" and the machine ID.Confirm hardware-health is running, BMC discovery found the endpoint, and the collector is enabled for the source.
DPU health probes fail for BGP, DHCP, or ifreloadDPU-agent logs and DPU Status dashboard panels.Use the DPU health alert ID to choose the subsystem-specific runbook or service check.
API health inspection or admin pages are unavailablekubectl get deploy, pods, and svc for nico-api; query API logs.Restore API availability before debugging host-specific health state.

Triage Workflow

  1. Open aggregate host health.
  2. Record the alert ID, target, message, inAlertSince, and classifications.
  3. Identify the source: hardware health, DPU health, validation, discovery, rack health, or override.
  4. Use the source-specific metrics and logs for that alert.
  5. Remediate the underlying condition.
  6. Wait for a fresh health report from the responsible source.
  7. Confirm aggregate host health returns to healthy.
  8. Remove temporary overrides used during the investigation.