How NICo components tracing work, what it covers, how to turn it on and off and what it costs.
carbide-api binary) is NICo’s primary tracing source and the subject of this
document. nico-dns also emits traces, but with a separate simpler always-on setup.
No other NICo component emits traces.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, which overrides the TOML value.enabled = true, or at runtime
with nico-admin-cli set tracing-enabled true when tracing.allow_runtime_changes = true.Two binaries build an OTLP span exporter:
crates/api-core/src/logging/setup.rs) - the rich, control-plane tracing this
document is mostly about, off by default behind endpoint plus enabled-flag configurationcrates/dns/src/main.rs) - a separate, much simpler always-on setup.The other binaries (nico-pxe, nico-dhcp, nico-bmc-proxy, nico-hardware-health, nico-ssh-console-rs, nico-dsx-exchange-consumer) carry the OpenTelemetry crates in the workspace but do not build a span exporter, so they emit no traces.
Unless noted otherwise, the rest of this document describes nico-api tracing. nico-dns differs as described in 1.5.
nico-api links many library crates in-process and the #[tracing::instrument] spans live in
those crates. When tracing is enabled, the instrumented operations are:
There is also a metric, carbide_api_tracing_spans_open, that reports the number of currently
open spans (exported by the spancounter crate) - useful for spotting span leaks or runaway
trace volume.
These cover the control-plane paths an operator most often needs to debug: machine provisioning/reconcile loops, power control and firmware updates against the BMC/power/rack backends, plus the database work underneath them - which maps directly to the EPIC’s “time on a given state of the machine, nodes stuck” need.
nico-api uses a custom CarbideSpanSampler wrapped as ParentBased:
tracing_enabled flag is on, from [tracing] enabled = true at startup or
from the dynamic tracing-enabled settingcode.namespace begins with carbide::ParentBased means), so once a
trace is sampled the whole call tree beneath it is captured - except tokio spans, which are
always dropped (they leak and would exhaust memory).service.name = carbide-api; the tracer is named carbide.nico-api pushes spans over OTLP/gRPC to a collector endpoint you configure. It does not
discover or get injected with anything - it simply connects out to the endpoint from
[tracing] otlp_endpoint or, if set, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT. The environment
variable overrides the TOML value. The transport details: gRPC-only, plaintext.
nico-dns has its own tracing setup (crates/dns/src/main.rs), independent of and simpler than
nico-api’s:
tracing-enabled switch. If the process runs, it is exporting.otlp_endpoint config field
(crates/dns/src/config.rs), which defaults to
http://opentelemetry-collector.otel.svc.cluster.local:4317. Because of that default, nico-dns
tries to export out of the boxCarbideSpanSampler),
so it records broadly, filtered only by the log-level directives in its EnvFilter. It
instruments retrieve_records, among others.service.name = carbide-dns; logs are JSON on stdout (not logfmt).with_tonic, no tls feature)Enabling tracing has two parts: startup configuration for the exporter endpoint, and an enabled flag that can come from startup config or, when allowed, the runtime switch. An endpoint without the enabled flag emits no traces. The enabled flag without an endpoint also emits no traces because no OTLP exporter is built.
(a) A traces backend. Anything that accepts OTLP traces: e.g. Tempo, Jaeger, Grafana Cloud, Datadog, Elastic APM or another OTEL collector acting as a gateway.
(b) A collector to receive OTLP from nico-api. nico-api should send to a collector, not straight to the backend - the collector is where you do sampling, batching, attribute normalization and (importantly) TLS for anything leaving the cluster. There are two common ways to give nico-api a collector to talk to:
Option A - a shared collector (Deployment or DaemonSet) that many workloads send to. A minimal
otel-collector traces pipeline:
With Option A, nico-api’s endpoint is the collector’s in-cluster Service, e.g.
http://otel-collector.observability.svc.cluster.local:4317.
Option B - a per-pod sidecar collector injected by the OpenTelemetry Operator. If your cluster
runs the OpenTelemetry Operator, you can have it inject a collector container into the nico-api
pod via a pod annotation. nico-api then talks to the collector over localhost (same pod, same network namespace)
The annotation value follows the form <namespace>/<collector-name>:
(c) Point nico-api at the collector. nico-api builds its OTLP span exporter only if an endpoint is configured at startup. If no endpoint is configured, no tracing layer is constructed at all and nothing is ever emitted - regardless of the enabled flag.
Preferred config-file form:
The deployment environment variable form is still supported and takes precedence over the TOML endpoint:
Notes:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is the only trace-related setting nico-api reads from
the environment. Other standard OTEL env vars are ignored.http://…, h2c); 4317 is the default
OTLP/gRPC port. Do not point it at a 4318 HTTP receiver and do not use https://.enabled must also be true.With the endpoint configured, emission is controlled by [tracing] enabled, which defaults
off:
When allow_runtime_changes = true, toggle tracing live without a restart:
Under the hood this sets the dynamic config ConfigSetting::TracingEnabled, which flips the
in-process tracing_enabled flag that CarbideSpanSampler reads. If
allow_runtime_changes = false, the SetDynamicConfig call is rejected with PermissionDenied;
the startup value from [tracing] enabled remains authoritative until nico-api restarts with a new
config.
Leaving tracing off in steady state is the intended operating mode. If you need startup-only
control, set allow_runtime_changes = false and change [tracing] enabled through the config file
plus a pod roll.
It depends on which part you are changing:
Why: [tracing] otlp_endpoint, [tracing] enabled, [tracing] allow_runtime_changes, and
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT are read at process startup (crates/api-core/src/logging/setup.rs).
If no endpoint was configured when nico-api started, the OTLP exporter and tracing layer were never
constructed and there is no way to add them at runtime. The runtime switch, when allowed, only flips
an in-process flag and never needs a restart.
Recommendation: set [tracing] otlp_endpoint at deploy time and leave it in place permanently -
the plumbing is cheap while tracing is toggled off. Keep enabled = false and
allow_runtime_changes = true for debug-on-demand environments, or set
allow_runtime_changes = false when the config file should be the only control plane for tracing.
[tracing] otlp_endpoint or OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is set on nico-api and points
at the collector’s gRPC endpoint.traces pipeline and its logs show the OTLP receiver listening on 4317.[tracing] enabled = true is configured, or nico-admin-cli set tracing-enabled true has been
run while tracing.allow_runtime_changes = true.service.name = carbide-api.carbide_api_tracing_spans_open to confirm spans are being opened.Tracing has real cost, which is the reason it defaults off. The cost depends on which of three states nico-api is in:
This is the expensive mode the dev team warns about:
ParentBased, a sampled root span pulls in its entire child subtree
(the component-manager, controller, and DB spans beneath it). A single traced
operation can therefore produce many spans.tail_sampling at the collector (keep errors/slow traces, sample the rest) and -
most importantly - only enable it during an active investigation, then turn it back off.This is the common steady state if you follow the recommendation to leave the endpoint configured
with [tracing] enabled = false, or after disabling tracing dynamically. The overhead here is
near-zero but not exactly zero:
[tracing] otlp_endpoint configured and keep tracing off in steady state - cheap and
avoids a pod roll when you need traces.carbide_api_tracing_spans_open and nico-api CPU/latency while it is on.carbide_api_tracing_spans_open.