Machine and DPU Logs

View as Markdown

This document covers log collection from NICo-managed devices: machine/DPU serial console output and DPU system/application logs. For NICo service logs (nico-api, nico-dns, etc), see /infra-controller/documentation/operations-day-2/observability/logging.


1. Overview

NICo manages two categories of device logs:

SourceWhat it capturesCollection method
Machine/DPU console logsBMC serial console output (boot messages, kernel output, crash dumps)nico-ssh-console captures to files, sidecar ships to backend
DPU logsSystem logs (journald), DOCA/HBN, nico-dpu-agent, auth logsotelcol-contrib on DPU forwards via OTLP to site collector

2. Machine console logs (nico-ssh-console)

2.1 How it works

When a user connects to a machine’s BMC console through nico-ssh-console, the proxy:

  1. Establishes an SSH session to the BMC
  2. Captures all serial console output (stdout from the BMC session)
  3. Strips ANSI escape sequences for cleaner logs
  4. Writes timestamped output to a local file
  5. Rotates files when they exceed the configured size

Console logging runs for the duration of each BMC session. When the session ends, the logger writes a closing timestamp and flushes the file.

2.2 Log file location and naming

Console logs are written to:

/var/log/consoles/<machine-id>_<bmc-ip>.log

For example:

/var/log/consoles/fm100ds..0042_10.0.1.50.log

The filename encodes both the NICo machine ID and the BMC IP address, making it easy to identify which machine produced the logs.

2.3 Log content

Console logs contain raw serial output with session markers:

--- ssh-console started at 2026-06-12T10:15:30+00:00 ---
[ 0.000000] Linux version 5.15.0-generic ...
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0
[ 2.341892] pci 0000:00:1f.0: [8086:a0c8] type 00 class 0x060100
...
[ 45.123456] systemd[1]: Reached target Multi-User System.
--- ssh-console shutting down at 2026-06-12T10:45:12+00:00 ---

These logs are useful for:

  • Debugging boot failures
  • Capturing kernel panics and oops messages
  • Reviewing BIOS/UEFI output
  • Diagnosing hardware initialization issues

2.4 Configuration

Console logging is controlled by nico-ssh-console configuration:

SettingDefaultDescription
console_logs_path/var/log/consolesDirectory for console log files
console_logging_enabledtrueEnable/disable console capture
log_rotate_max_size10 MiBRotate when file exceeds this size
log_rotate_max_rotated_files4Keep up to N rotated files (.log.0 through .log.3)

When rotation occurs, the current .log file becomes .log.0, previous .log.0 becomes .log.1, and so on. The oldest file beyond the limit is deleted.

2.5 Centralizing console logs

The nico-ssh-console Helm chart includes an optional OpenTelemetry Collector sidecar for shipping console logs to a backend.

Enable the sidecar:

1# values.yaml for nico-ssh-console-rs
2lokiLogCollector:
3 enabled: true
4 image:
5 repository: otel/opentelemetry-collector-contrib
6 tag: "0.102.0"

Default sidecar configuration:

The sidecar reads from /var/log/consoles/*.log and extracts machine metadata from filenames:

1receivers:
2 filelog:
3 include:
4 - /var/log/consoles/*.log
5 - /var/log/consoles/*.log.*
6 start_at: beginning
7 storage: file_storage
8 operators:
9 - type: regex_parser
10 regex: '^(?P<machineid>[a-z0-9]+)_(?P<bmc_ip>[^_]+).log$'
11 parse_from: attributes["log.file.name"]
12
13processors:
14 attributes:
15 actions:
16 - key: exporter
17 value: nico-ssh-console-rs
18 action: upsert
19 - key: loki.attribute.labels
20 value: machineid,exporter
21 action: insert
22 - key: loki.format
23 value: raw
24 action: insert
25 batch: {}
26 memory_limiter:
27 check_interval: 5s
28 limit_mib: 4096
29 spike_limit_mib: 1024
30
31exporters:
32 loki:
33 endpoint: "http://loki.loki.svc.cluster.local:3100/loki/api/v1/push"
34 headers:
35 "X-Scope-OrgID": nico
36
37service:
38 extensions: [file_storage]
39 pipelines:
40 logs:
41 receivers: [filelog]
42 processors: [attributes, batch]
43 exporters: [loki]

Alternative: stdout to DaemonSet collector:

To follow the standard Kubernetes pattern where the DaemonSet collector picks up all pod logs, configure the sidecar to write to stdout instead of directly to a backend:

1configFiles:
2 otelcolConfig: |
3 extensions:
4 file_storage:
5 directory: /var/lib/otelcol/filelog-checkpoints
6 receivers:
7 filelog:
8 include:
9 - /var/log/consoles/*.log
10 - /var/log/consoles/*.log.*
11 start_at: beginning
12 storage: file_storage
13 operators:
14 - type: regex_parser
15 regex: '^(?P<machineid>[a-z0-9]+)_(?P<bmc_ip>[^_]+).log$'
16 parse_from: attributes["log.file.name"]
17 processors:
18 attributes:
19 actions:
20 - key: component
21 value: nico-ssh-console-rs
22 action: upsert
23 batch: {}
24 memory_limiter:
25 check_interval: 5s
26 limit_mib: 4096
27 spike_limit_mib: 1024
28 exporters:
29 debug:
30 verbosity: basic # Writes to stdout
31 service:
32 extensions: [file_storage]
33 pipelines:
34 logs:
35 receivers: [filelog]
36 processors: [memory_limiter, attributes, batch]
37 exporters: [debug]

The DaemonSet collector on each node reads /var/log/pods/ (including the sidecar’s stdout) and forwards all logs to your backend. This keeps the architecture simple - console logs flow through the same pipeline as all other pod logs.

2.6 Querying console logs

Once centralized, query console logs by machine ID:

Loki (LogQL):

{machineid="fm100ds..0042"}

VictoriaLogs (LogsQL):

machineid:fm100ds..0042

To find boot failures or kernel panics:

{machineid="fm100ds..0042"} |~ "panic|oops|failed|error"

3. DPU logs

DPUs run an OpenTelemetry Collector (otelcol-contrib) deployed via the nico-otelcol Helm chart (bluefield/charts/nico-otelcol/). The chart deploys a DaemonSet that runs on DPU nodes managed by DPF (DOCA Platform Framework). For non-Kubernetes DPU deployments, a systemd service (otelcol-contrib.service) provides the same functionality.

The collector gathers logs from multiple sources and forwards them to the site controller over mTLS.

3.1 Log sources on the DPU

The following logs are collected from the DPU Arm OS. All log files are physically located on the DPU’s local filesystem.

SourceReceiverComponent labelDescription
Kernel/dmesgjournald/kerneljournaldKernel messages, hardware events
DOCA/HBNfilelog/docahbnFRR, nl2docad, nvued, supervisord, syslog
nico-dpu-agentfilelog/nico-dpu-agentjournaldNICo agent logs (logfmt format)
Auth logsfilelog/authdpu-auth-filelog/var/log/auth.log for security auditing

DOCA/HBN log paths:

These paths are on the DPU Arm OS filesystem. The HBN container writes logs to host-mounted volumes, making them accessible to the otelcol-contrib collector running on the host.

/var/log/doca/hbn/frr/frr.log
/var/log/doca/hbn/nl2docad.log
/var/log/doca/hbn/nvued.log
/var/log/doca/hbn/supervisor/supervisord.log
/var/log/doca/hbn/syslog

nico-dpu-agent logs:

The agent can run in two modes depending on the deployment:

  • Systemd service (forge-dpu-agent.service): Logs go to journald. Query with journalctl -u forge-dpu-agent.service.
  • Containerized DaemonSet (via DPF): Logs go to stdout, captured at /var/log/pods/*/nico-dpu-agent/*.log. The collector parses the CRI log format.

In both cases, the agent emits logfmt output. The otelcol-contrib collector extracts log levels from the logfmt level= field.

3.2 DPU collector configuration

The DPU runs otelcol-contrib with configuration from /etc/otelcol-contrib/config.yaml. Key aspects:

Resource attributes added to all logs:

  • host.name — DPU hostname (from resourcedetection processor)
  • machine.id — NICo machine ID (from file at /run/otelcol-contrib/machine-id)
  • host.machine.id — Host machine ID (from /run/otelcol-contrib/host-machine-id)
  • component — Log source identifier (journald, hbn, dpu-auth-filelog)

Export to site controller:

1exporters:
2 otlp/site:
3 endpoint: site-otel-receiver.nico:443
4 tls:
5 ca_file: /opt/forge/forge_root.pem
6 cert_file: /opt/forge/machine_cert.pem
7 key_file: /opt/forge/machine_cert.key
8 reload_interval: 1h
9 retry_on_failure:
10 enabled: true
11 initial_interval: 5s
12 max_interval: 30s
13 max_elapsed_time: 1h

DPU logs are sent over mTLS using machine certificates provisioned by NICo. The forge-dpu-otel-agent service handles certificate renewal.

3.3 Site controller receiver

The site controller’s otel-collector receives DPU logs via OTLP and routes them through processing pipelines:

1receivers:
2 otlp:
3 protocols:
4 grpc:
5 endpoint: ${env:MY_POD_IP}:4317
6 http:
7 endpoint: ${env:MY_POD_IP}:4318
8
9connectors:
10 routing/otlp-logs:
11 default_pipelines:
12 - logs/dpu
13 table:
14 # Route console logs separately
15 - statement: route() where attributes["component"] == "nico-ssh-console-rs"
16 pipelines:
17 - logs/console
18
19service:
20 pipelines:
21 logs/otlp-in:
22 receivers: [otlp]
23 exporters: [routing/otlp-logs]
24
25 logs/dpu:
26 receivers: [routing/otlp-logs]
27 processors:
28 - memory_limiter
29 - resource/dpu-logs-loki
30 - transform/dpu-logs-loki
31 - batch
32 exporters: [loki]

Resource labels for Loki indexing:

1processors:
2 resource/dpu-logs-loki:
3 attributes:
4 - action: insert
5 key: loki.resource.labels
6 value: exporter, machine.id, host.machine.id, host.name, component, site
7 - action: insert
8 key: loki.format
9 value: raw

3.4 Querying DPU logs

By DPU hostname:

{host_name="dpu-node-01"}

By machine ID:

{machine_id="fm100ds..0042"}

By component:

{component="journald"} |~ "kernel"
{component="hbn"} |~ "frr|bgp"
{component="dpu-auth-filelog"}

Kernel errors on a specific DPU:

{host_name="dpu-node-01", component="journald"} | json | PRIORITY <= 3

4. Troubleshooting

Console logs not appearing

SymptomCauseFix
No files in /var/log/consoles/console_logging_enabled: falseEnable in config
Files exist but emptyNo active BMC sessionsConnect to a machine console
Sidecar not shippinglokiLogCollector.enabled: falseEnable sidecar in Helm values
Sidecar shipping but no data in backendWrong exporter endpointCheck sidecar config and backend connectivity

Verify console logging is working:

$# Check for console log files
$kubectl exec -it deploy/nico-ssh-console-rs -- ls -la /var/log/consoles/
$
$# Tail a console log
$kubectl exec -it deploy/nico-ssh-console-rs -- tail -f /var/log/consoles/<machine-id>_<ip>.log

DPU logs not appearing

SymptomCauseFix
No logs from DPUotelcol-contrib not runningCheck systemctl status otelcol-contrib on DPU
Connection refusedSite collector not listeningVerify OTLP receiver is enabled
TLS errorsCertificate issuesCheck cert paths and renewal (forge-dpu-otel-agent)
Logs arriving but missing labelsProcessor misconfigurationCheck resource/dpu-logs-loki processor

Verify DPU collector is running:

$# SSH to DPU and check service
$systemctl status otelcol-contrib
$
$# Check collector logs
$journalctl -u otelcol-contrib -f
$
$# Verify certificate files exist
$ls -la /opt/forge/machine_cert.pem /opt/forge/machine_cert.key

Verify site collector is receiving:

$# Check site collector logs for incoming connections
$kubectl logs -l app.kubernetes.io/name=opentelemetry-collector -f | grep -i "otlp\|dpu"

Accessing DPU logs directly

When centralized logging isn’t available or you need to debug on the DPU itself:

$# SSH to DPU (if SSH works)
$ssh <dpu-oob-ip>
$
$# Check nico-dpu-agent logs
$journalctl -u forge-dpu-agent.service -e --no-pager
$
$# Check otelcol-contrib logs
$journalctl -u otelcol-contrib -e --no-pager
$
$# Check HBN/DOCA services
$sudo crictl ps
$sudo crictl logs <container-id>
$
$# Check BGP status (inside DOCA HBN container)
$sudo crictl exec -ti $(sudo crictl ps | grep doca-hbn | awk '{print $1}') \
> vtysh -c 'show bgp summary'

If SSH to the DPU fails, use DPU BMC or rshim console access to check whether the DPU OS booted.

Other useful log locations

LocationDescription
/var/log/nico/nico-scout.logHost discovery scout logs during machine ingestion
journalctl -u nico-dpu-agentDPU agent: heartbeat, network config, BGP, HBN, service health
/var/log/doca/hbn/*DOCA HBN component logs (FRR, nvued, nl2docad, etc.)
/var/log/auth.logDPU authentication/security events

5. References