> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/infra-controller/llms.txt.
> For full documentation content, see https://docs.nvidia.com/infra-controller/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/infra-controller/_mcp/server.

# Monitoring and Health

This page covers monitoring and health workflows for NICo sites after
deployment: hardware health, DPU health, aggregate host health, health
overrides, Prometheus scraping, Grafana dashboards, and Loki queries.

Use aggregate host health as the starting point for operational decisions. NICo
combines hardware health, DPU health, validation and discovery checks, rack
health, and health overrides into a single host-level result. Component health
explains which source is responsible for the aggregate result.

Use this page as the entry point for health triage. It gives the primary
inspection path, commands, metrics, dashboards, and log queries needed to start
an investigation. For subsystem-specific behavior, follow the linked hardware,
DPU, health aggregation, and classification references rather than treating this
page as a replacement for those manuals.

For reference, see:

- [Health Checks and Health Aggregation](/infra-controller/documentation/architecture/health-reporting/health-checks-and-health-aggregation)
- [Health Probe IDs](/infra-controller/documentation/architecture/health-reporting/health-probe-i-ds)
- [Health Alert Classifications](/infra-controller/documentation/architecture/health-reporting/health-alert-classifications)
- [Redfish Workflow](/infra-controller/documentation/architecture/redfish/redfish-workflow)

## Health Sources

NICo builds health from health reports. A health report contains successes and
alerts from a reporting source. Common health sources are:

| Source | What it reports |
|---|---|
| Hardware health | BMC and Redfish hardware state, including sensors, chassis status, and leak-related signals when configured. |
| DPU agent | DPU service health, DPU networking health, BGP state, DHCP service health, and agent heartbeat. |
| Validation and discovery | SKU validation, host validation, endpoint discovery, and inventory checks. |
| Rack health | Rack-level health input when rack health reporting is configured. |
| Health overrides | Manual or service-created health reports used for maintenance, repair, validation, or other controlled workflows. |

Each alert has an ID, an optional target, a message, a start time, and one or
more classifications. Classifications define operational impact. For example,
`PreventAllocations` blocks new allocations while the alert is active, and
`ExcludeFromStateMachineSla` excludes the host from state-machine SLA
evaluation.

## Hardware Health Monitoring

NICo monitors hardware through the hardware health service. The Helm chart is
`nico-hardware-health`.

The service discovers BMC endpoints from NICo and queries them through Redfish.
It monitors host BMCs, DPU BMCs, and configured switch or power-shelf BMCs. The
primary monitoring path is sensor collection. Additional collectors can gather
firmware, log, NMX-T, NVUE REST, and leak-related data when configured.

### Helm Configuration

Enable hardware health in Helm values:

```yaml
nico-hardware-health:
  enabled: true
```

Enable metrics scraping with its ServiceMonitor:

```yaml
nico-hardware-health:
  enabled: true
  replicas: 1

  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 25s
```

By default, the chart exposes hardware health metrics on port `9009`. Log
collection is disabled by default:

```yaml
env:
  NICO_HEALTH__COLLECTORS__LOGS__ENABLED: "false"
```

Enable log collection only through the target site's deployment values.

### Hardware Health Service Configuration

The hardware health service config example,
`crates/health/example/config.example.toml`, documents endpoint discovery,
sinks, rate limits, collectors, processors, and metrics.

Production endpoint discovery uses the NICo API source. The checked-in
hardware-health example config currently names that source
`[endpoint_sources.nico_api]`:

```toml
[endpoint_sources.nico_api]
root_ca = "/var/run/secrets/spiffe.io/ca.crt"
client_cert = "/var/run/secrets/spiffe.io/tls.crt"
client_key = "/var/run/secrets/spiffe.io/tls.key"
api_url = "https://nico-api.forge-system.svc.cluster.local:1079"
```

Static BMC endpoints are supported for local, mock, or special deployments:

```toml
[[endpoint_sources.static_bmc_endpoints]]
ip = "10.0.0.1"
port = 443
mac = "aa:bb:cc:dd:ee:ff"
username = "admin"
password = "secret"
```

Collector defaults from the example config:

| Area | Parameter | Example value | Meaning |
|---|---:|---|---|
| Rate limiting | `bucket_burst` | `200` | Burst size for outbound requests. |
| Rate limiting | `bucket_replenish` | `"35ms"` | Token replenish interval. |
| Sensor collector | `sensor_fetch_interval` | `"1m"` | Sensor polling cadence. |
| Sensor collector | `rediscover_interval` | `"5m"` | Sensor inventory rediscovery cadence. |
| Sensor collector | `state_refresh_interval` | `"30m"` | Broader BMC state refresh cadence. |
| Sensor collector | `sensor_fetch_concurrency` | `10` | Concurrent sensor fetch limit. |
| Sensor collector | `include_sensor_thresholds` | `true` | Include BMC threshold data when available. |
| Firmware collector | `firmware_refresh_interval` | `"30m"` | Firmware refresh cadence. |
| Logs collector | `mode` | `"sse"` | Preferred BMC log collection mode. |
| NMX-T collector | `scrape_interval` | `"1m"` | Switch telemetry scrape cadence. |
| NVUE REST collector | `poll_interval` | `"1m"` | NVUE REST polling cadence. |
| Leak processor | `minimum_alerts_per_report` | `1` | Leak alert threshold for health reports. |
| Rack leak processor | `leaking_tray_threshold` | `2` | Rack-level leak threshold. |
| Metrics | `endpoint` | `"0.0.0.0:9009"` | Metrics listener. |
| Metrics | `prefix` | `"carbide_hardware_health"` | Hardware-health metric prefix. |

### BMC Proxy

The full Helm example enables `nico-bmc-proxy`, the authenticating proxy for BMC
Redfish access. The proxy chart exposes proxy traffic on port `1079` and metrics
on port `1080`.

Example proxy settings:

```toml
listen = "[::]:1079"
metrics_endpoint = "[::]:1080"
allowed_principals = ["spiffe-service-id/nico-api", "spiffe-service-id/nv-dps"]
```

The BMC proxy ServiceMonitor follows the same `serviceMonitor.enabled`,
`interval`, and `scrapeTimeout` pattern as other NICo services.

### Sensor Alerts

Hardware sensor alerts are derived from BMC-reported health, sensor readings,
and thresholds. Sensor classifications include:

| Classification | Meaning |
|---|---|
| `SensorWarning` | Sensor crossed a caution threshold. |
| `SensorCritical` | Sensor crossed a critical threshold. |
| `SensorFailure` | Sensor value is outside the valid range or otherwise invalid. |

If numeric threshold data indicates a problem but the BMC reports the sensor as
healthy, NICo treats the sensor as healthy. In that case the BMC health state is
the authority.

### Hardware Health Logs

Use Loki or Grafana Explore to inspect hardware health logs for a host:

```logql
{k8s_container_name="nico-hardware-health"} |= "<machine-id>"
```

Health report events include fields such as:

```text
collector=sensor_collector
report_source=bmc-sensors
machine_id=Some(<machine-id>)
alert_count=<n>
success_count=<n>
```

For leak-related events, look for:

```text
report_source=tray-leak-detection
```

## DPU Health Checks

`dpu-agent` runs on managed DPUs and reports DPU health to NICo. The BlueField
chart is named `nico-dpu-agent`. In service names and logs, the DPU agent
currently appears as `forge-dpu-agent.service`.

The agent checks DPU service health, networking state, HBN/NVUE configuration,
DHCP behavior, BGP status, and heartbeat. DPU health is part of aggregate host
health, so an otherwise healthy host can still be unavailable when its DPU is
unhealthy.

See also:

- [DPU Lifecycle Management](/infra-controller/documentation/operations-day-2/dpu-management/dpu-lifecycle-management)
- [DPU Configuration](/infra-controller/documentation/operations-day-2/dpu-management/dpu-configuration)
- [Stuck in WaitingForNetworkConfig and DPU Health](/infra-controller/documentation/operations-day-2/playbooks/stuck-objects/stuck-in-waiting-for-network-config-and-dpu-health)

### DPU Agent Configuration

Key `nico-dpu-agent` chart values:

| Config area | Value | Meaning |
|---|---|---|
| `certsDir` | site certificate directory | Host certificate directory mounted into the agent container. |
| `securityContext.capabilities.add` | `NET_ADMIN`, `SYS_ADMIN`, `DAC_OVERRIDE`, `NET_RAW` | Linux capabilities used for network and system operations. |
| `hbn.nvue_https_address` | `nvue` | NVUE service name used by the agent. |
| `hbn.nvue_credentials_secret_name` | site value | Secret containing NVUE credentials. |
| `hbn.nvue_password_key` | `password` | Secret key for the NVUE password. |
| `dhcp_server.interface_prepend` | empty by default | Optional DHCP interface prefix argument. |
| `dhcp_server.service_name` | set by DPF service integration | DHCP gRPC service name. |
| `fmds.service_name` | set by DPF service integration | FMDS gRPC service name. |

The DaemonSet renders these core arguments:

```text
nico-dpu-agent run
  --hbn-config-mode=nvue-rest
  --agent-platform-type=containerized
  --dhcp-grpc-server=http://<dhcp-grpc-service>:10079
  --fmds-grpc-server=http://<fmds-grpc-service>:50052
```

If `dhcp_server.interface_prepend` is set, the chart also adds:

```text
--dhcp-server-interface-prepend=<prefix>
```

The pod sets these runtime environment variables:

| Environment variable | Source / value |
|---|---|
| `POD_IP` | Kubernetes pod IP field reference. |
| `NODE_NAME` | Kubernetes node name field reference. |
| `POD_NAME` | Kubernetes pod name field reference. |
| `POD_NAMESPACE` | Kubernetes namespace field reference. |
| `IGNORE_MGMT_VRF` | `1`. |
| `NVUE_HTTPS_ADDRESS` | DPF cluster NVUE endpoint. |
| `NVUE_USERNAME` | NVUE user configured for the deployment. |
| `NVUE_PASSWORD` | Secret key from `hbn.nvue_credentials_secret_name`. |
| `RUST_LOG` | `info`. |

### Common DPU Alerts

Common DPU alert IDs include:

- `ContainerExists`
- `ServiceRunning`
- `DhcpServer`
- `BgpStats`
- `BgpPeeringTor`
- `BgpPeeringRouteServer`
- `Ifreload`
- `BgpDaemonEnabled`
- `PostConfigCheckWait`
- `DpuDiskUtilizationCritical`
- `HeartbeatTimeout`

`HeartbeatTimeout` means NICo has not received a recent health report
from the DPU agent. Check whether the DPU is powered, the agent is running, DPU
time is correct, and the DPU can reach NICo.

### DPU Logs

Use Loki to inspect DPU-agent logs:

```logql
{systemd_unit="forge-dpu-agent.service", machine_id="<machine-id>"}
```

Alternative labels can be used when available:

```logql
{systemd_unit="forge-dpu-agent.service", host_name="<host-name>"}
```

On the DPU, use `journalctl` for direct service logs:

```bash
journalctl -u forge-dpu-agent.service -e --no-pager
```

Restart the agent when required:

```bash
systemctl restart forge-dpu-agent.service
```

## Health Alert Lifecycle

NICo health alerts are source-based. A health source submits a fresh report, and
NICo uses the latest report from each source to calculate aggregate health.

A typical alert flow:

1. A source reports an alert such as `PoweredOff` with target `<bmc-ip>`.
2. NICo adds the alert to aggregate host health.
3. Classifications such as `PreventAllocations` define the operational effect.
4. The health view shows the alert ID, target, message, start time, and
   classifications.
5. Metrics and logs identify the responsible source.
6. After remediation, the source submits a fresh report that marks the check
   successful or omits the previous alert.
7. NICo merges the fresh report and aggregate host health returns to healthy.

If a health override created the alert, remove or replace the override after the
operational reason ends.

## Inspect Current Health

Start with the host health page in the Admin Web UI:

```text
https://<nico-api-hostname>/admin/machine/<machine-id>/health
```

Inspect the aggregate health table first. For each alert, note:

- `ID`
- `Target`
- `In Alert Since`
- `Message`
- `Tenant Message`
- `Classifications`

Then inspect component health to identify the source: hardware health, DPU
health, validation, discovery, rack health, or health override.

Use the health history table to review recent transitions. This helps identify
whether an alert is new, recurring, or already cleared by a later health report.

Admin CLI examples:

```bash
nico-admin-cli machine show <machine-id>
nico-admin-cli machine health-override show <machine-id>
```

## Health Overrides

Health overrides add manual or service-created health reports into the same
aggregate health model as automated checks. Overrides are shared health
mechanisms; they are not specific to hardware health.

Use overrides for controlled states such as maintenance, validation, repair,
break-fix, or temporary automation control. Do not use an override as a
substitute for resolving the underlying condition.

### Merge and Replace

| Override mode | Use |
|---|---|
| Merge | Adds a specific health condition while preserving automated health sources. Use this for most manual workflows. |
| Replace | Replaces aggregate health for the target. Use only as a tightly controlled exception because it can hide automated health sources. |

DPU replace overrides are rejected by the API.

### Override Templates

The `nico-admin-cli machine health-override add` command supports templates
for common workflows:

| Template | Use |
|---|---|
| `HostUpdate` | Mark host as in DPU reprovision or host update. |
| `InternalMaintenance` | Internal maintenance window. |
| `OutForRepair` | Host removed from service for repair. |
| `Degraded` | Mark host as degraded. |
| `Validation` | Mark host for validation. |
| `SuppressExternalAlerting` | Suppress external alerting behavior. |
| `MarkHealthy` | Force healthy. |
| `StopRebootForAutomaticRecoveryFromStateMachine` | Block automatic recovery reboots during manual work. |
| `TenantReportedIssue` | Tenant-reported issue while releasing an instance. |
| `RequestRepair` | Tenant-reported issue requiring repair. |

Examples:

```bash
nico-admin-cli machine health-override show <machine-id>

nico-admin-cli machine health-override add <machine-id> \
  --template RequestRepair \
  --message "Manual repair trigger for tenant-reported issue"

nico-admin-cli machine health-override add <machine-id> \
  --template OutForRepair \
  --message "Automated repair failed, requires manual investigation"

nico-admin-cli machine health-override remove <machine-id> repair-request
nico-admin-cli machine health-override remove <machine-id> tenant-reported-issue
```

Before creating an override, identify the current aggregate health, choose the
smallest effect that matches the workflow, include a clear message, and define
the removal condition. After remediation, remove or replace the override and
verify aggregate health.

## Prometheus Metrics

NICo charts expose Prometheus scraping through `ServiceMonitor` resources.
ServiceMonitors are disabled by default in chart values and enabled in the full
example for selected services.

Example:

```yaml
nico-api:
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 25s

nico-hardware-health:
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 25s
```

ServiceMonitors from the charts:

| Component | ServiceMonitor | Metrics port |
|---|---|---:|
| `nico-api` | `nico-api-metrics` | `1080` |
| `nico-dsx-exchange-consumer` | `nico-dsx-exchange-consumer-metrics` | `9009` |
| `nico-hardware-health` | `nico-hardware-health-metrics` | `9009` |
| `nico-bmc-proxy` | `nico-bmc-proxy-metrics` | `1080` |
| `nico-dhcp` | `nico-dhcp-metrics` | `1089` |
| `nico-pxe` | `nico-pxe-metrics` | `8080` |
| `nico-ssh-console-rs` | `nico-ssh-console-rs-metrics` | `9009` |
| `unbound` | `unbound-metrics` | `9167` |

Check rendered ServiceMonitors:

```bash
kubectl get servicemonitor -n nico-system
kubectl get servicemonitor -n nico-system nico-api-metrics
kubectl get servicemonitor -n nico-system nico-hardware-health-metrics
kubectl get servicemonitor -n nico-system nico-dsx-exchange-consumer-metrics
```

Core health metrics currently use `carbide_*` metric names. Some dashboards and
site configurations also expose host-health rollups with the `forge_*` prefix.
Use the literal metric name that exists in the target site.

Useful core metric families:

| Metric | Use |
|---|---|
| `carbide_hosts_health_status_count` | Host health counts split by `healthy` and `in_use`. |
| `carbide_hosts_health_overrides_count` | Active merge and replace health overrides. |
| `carbide_hosts_unhealthy_by_probe_id_count` | Active unhealthy hosts by probe ID and probe target. |
| `carbide_hosts_unhealthy_by_classification_count` | Active unhealthy hosts by health-alert classification. |
| `carbide_machines_per_state` | Fleet distribution by machine state. |
| `carbide_machines_per_state_above_sla` | Machines above state-machine SLA. |

Use the Host Health dashboard panels for fleet-level rollups, including host
health status, health overrides, probe alerts, and alert classifications.
Example dashboard queries:

```promql
sum by (healthy, in_use) (
  max by(healthy, in_use) (
    forge_hosts_health_status_count{fresh="true"}
  )
)
```

```promql
sum by(override_type) (
  max by(override_type, in_use) (
    forge_hosts_health_overrides_count{fresh="true"}
  )
)
```

```promql
sum by(probe_id) (
  max by(probe_id, probe_target) (
    forge_hosts_unhealthy_by_probe_id_count{fresh="true"}
  )
)
```

```promql
sum by(classification) (
  max by(classification, in_use) (
    forge_hosts_unhealthy_by_classification_count{fresh="true"}
  )
)
```

DPU metrics:

| Metric | Use |
|---|---|
| `carbide_dpus_up_count` | DPUs with health reports newer than the DPU up threshold. |
| `carbide_dpus_healthy_count` | DPUs whose latest health report is healthy. |
| `carbide_dpu_health_check_failed_count` | Failed DPU health checks by probe. |
| `carbide_dpu_agent_version_count` | DPU-agent version distribution. |
| `carbide_dpu_firmware_version_count` | DPU firmware version distribution. |
| `forge_dpu_agent_network_reachable` | DPU-to-DPU reachability. |
| `forge_dpu_agent_network_latency` | DPU-to-DPU latency. |
| `forge_dpu_agent_network_loss_percentage` | Packet loss in a DPU network check cycle. |
| `forge_dpu_agent_network_monitor_error` | Network monitor errors unrelated to connectivity. |
| `forge_dpu_agent_network_communication_error` | Communication errors to a destination DPU. |

## API Health and Availability

The NICo API is required for health inspection, health report ingestion,
administrative workflows, and state-machine visibility. Check API health before
debugging a host-specific health issue.

Check Kubernetes status:

```bash
kubectl get deploy -n nico-system nico-api
kubectl get pods -n nico-system -l app.kubernetes.io/name=nico-api
kubectl get svc -n nico-system nico-api
```

Check API metrics scraping:

```bash
kubectl get servicemonitor -n nico-system nico-api-metrics
```

Use Loki or Grafana Explore to inspect API logs:

```logql
{k8s_container_name="nico-api"} |= "<machine-id>"
```

```logql
{k8s_container_name="nico-api"} |= "<bmc-ip>" != "SPAN"
```

## Grafana, Loki, and Logs

Use Grafana dashboards for fleet-level triage and Loki for source-specific logs.
Start from aggregate host health, identify the alert source and `inAlertSince`,
then query logs around that time.

Common Loki patterns:

```logql
{systemd_unit="forge-dpu-agent.service", machine_id="<machine-id>"}
```

```logql
{k8s_container_name="nico-hardware-health"} |= "<machine-id>"
```

Some sites expose machine identity as a log label. When that label is present,
prefer a label filter over a free-text match:

```logql
{machine_id="<machine-id>"}
```

Console logs are shipped by the `nico-ssh-console-rs` OpenTelemetry Collector
sidecar when enabled:

```yaml
nico-ssh-console-rs:
  lokiLogCollector:
    enabled: true
    image:
      repository: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib
      tag: "0.81.0"
```

The sidecar tails:

```text
/var/log/consoles/{machineid}_{bmc_ip}.log
```

It labels console logs with `machineid` and an SSH console exporter label:

```logql
{machineid="<machine-id>", exporter="nico-ssh-console-rs"}
```

Labels vary by log source. Use the Loki label browser to choose the most
specific label available. Common labels include:

- `k8s_container_name`
- `k8s_namespace_name`
- `k8s_pod_name`
- `machine_id`
- `machineid`
- `host_machine_id`
- `host_name`
- `systemd_unit`
- `exporter`
- `level`

`logcli` can be used for repeatable terminal-based Loki queries when direct Loki
access is configured for the site. Use the same LogQL selectors shown above.
For example:

```bash
logcli query --since=1h '{k8s_container_name="nico-hardware-health"} |= "<machine-id>"'
logcli query --since=1h '{systemd_unit="forge-dpu-agent.service"} |= "<machine-id>"'
```

### Dashboard Starting Points

Use the site-level health dashboard for fleet triage before drilling into logs.
Start with these panels when they are available:

| Dashboard area | Use |
|---|---|
| Host Health | Site-level host health, probe alerts, classifications, and overrides. |
| DPU Status | DPU health, heartbeat, version, and firmware distribution. |
| Hardware Health Monitor Service Metrics | Hardware-health scrape and collector behavior. |
| Site Explorer | Endpoint discovery and exploration behavior. |
| Machine Update Manager | Update workflow health and state-machine interaction. |

For host-health triage, the highest-value panels are Healthy Host Percentage,
Health Status, Health Overrides, Health Probe Alerts, and Health Alert
Classifications.

## Troubleshooting

| Symptom | Check | Next action |
|---|---|---|
| Host is unhealthy with `PoweredOff` | Admin Web UI health page and hardware-health logs around `inAlertSince`. | Confirm BMC power state and whether the alert target is the expected BMC IP. |
| Host is unhealthy with `HeartbeatTimeout` for `forge-dpu-agent` | `journalctl -u forge-dpu-agent.service -e --no-pager` and Loki query for the DPU agent. | Confirm the DPU is powered, time-synced, and able to reach NICo. Restart `forge-dpu-agent.service` only when service-level remediation requires it. |
| Host has active overrides | `nico-admin-cli machine health-override show <machine-id>` and the Health Overrides dashboard panel. | Verify the override reason is still valid. Remove temporary overrides after the condition ends. |
| Health metrics are missing | `kubectl get servicemonitor -n nico-system` and the component-specific ServiceMonitor. | Enable the chart `serviceMonitor` block or fix the Prometheus selector/namespace match. |
| Hardware-health logs do not show reports for a host | Loki query for `k8s_container_name="nico-hardware-health"` and the machine ID. | Confirm hardware-health is running, BMC discovery found the endpoint, and the collector is enabled for the source. |
| DPU health probes fail for BGP, DHCP, or ifreload | DPU-agent logs and DPU Status dashboard panels. | Use the DPU health alert ID to choose the subsystem-specific runbook or service check. |
| API health inspection or admin pages are unavailable | `kubectl get deploy`, `pods`, and `svc` for `nico-api`; query API logs. | Restore API availability before debugging host-specific health state. |

## Triage Workflow

1. Open aggregate host health.
2. Record the alert ID, target, message, `inAlertSince`, and classifications.
3. Identify the source: hardware health, DPU health, validation, discovery, rack
   health, or override.
4. Use the source-specific metrics and logs for that alert.
5. Remediate the underlying condition.
6. Wait for a fresh health report from the responsible source.
7. Confirm aggregate host health returns to healthy.
8. Remove temporary overrides used during the investigation.