How NICo components tracing work, what it covers, how to turn it on and off and what it costs.
carbide-api binary) is NICo’s primary tracing source and the subject of this
document. nico-dns also emits traces, but with a separate simpler always-on setup.
No other NICo component emits traces.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
nico-admin-cli set tracing-enabled trueTwo binaries build an OTLP span exporter:
crates/api-core/src/logging/setup.rs) - the rich, control-plane tracing this
document is mostly about, off by default behind the two-part enablementcrates/dns/src/main.rs) - a separate, much simpler always-on setup.The other binaries (nico-pxe, nico-dhcp, nico-bmc-proxy, nico-hardware-health, nico-ssh-console-rs, nico-dsx-exchange-consumer) carry the OpenTelemetry crates in the workspace but do not build a span exporter, so they emit no traces.
Unless noted otherwise, the rest of this document describes nico-api tracing. nico-dns differs as described in 1.5.
nico-api links many library crates in-process and the #[tracing::instrument] spans live in
those crates. When tracing is enabled, the instrumented operations are:
There is also a metric, carbide_api_tracing_spans_open, that reports the number of currently
open spans (exported by the spancounter crate) - useful for spotting span leaks or runaway
trace volume.
These cover the control-plane paths an operator most often needs to debug: machine provisioning/reconcile loops, power control and firmware updates against the BMC/power/rack backends, plus the database work underneath them - which maps directly to the EPIC’s “time on a given state of the machine, nodes stuck” need.
nico-api uses a custom CarbideSpanSampler wrapped as ParentBased:
tracing-enabled flag is oncode.namespace begins with carbide::ParentBased means), so once a
trace is sampled the whole call tree beneath it is captured - except tokio spans, which are
always dropped (they leak and would exhaust memory).service.name = carbide-api; the tracer is named carbide.nico-api pushes spans over OTLP/gRPC to a collector endpoint you configure. It does not
discover or get injected with anything - it simply connects out to whatever
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT points at. The transport details: gRPC-only, plaintext.
nico-dns has its own tracing setup (crates/dns/src/main.rs), independent of and simpler than
nico-api’s:
tracing-enabled switch. If the process runs, it is exporting.otlp_endpoint config field
(crates/dns/src/config.rs), which defaults to
http://opentelemetry-collector.otel.svc.cluster.local:4317. Because of that default, nico-dns
tries to export out of the boxCarbideSpanSampler),
so it records broadly, filtered only by the log-level directives in its EnvFilter. It
instruments retrieve_records, among others.service.name = carbide-dns; logs are JSON on stdout (not logfmt).with_tonic, no tls feature)Enabling tracing has two parts: a one-time deploy-time configuration and a runtime switch. Both must be in place; satisfying only one produces no traces
(a) A traces backend. Anything that accepts OTLP traces: e.g. Tempo, Jaeger, Grafana Cloud, Datadog, Elastic APM or another OTEL collector acting as a gateway.
(b) A collector to receive OTLP from nico-api. nico-api should send to a collector, not straight to the backend - the collector is where you do sampling, batching, attribute normalization and (importantly) TLS for anything leaving the cluster. There are two common ways to give nico-api a collector to talk to:
Option A - a shared collector (Deployment or DaemonSet) that many workloads send to. A minimal
otel-collector traces pipeline:
With Option A, nico-api’s endpoint is the collector’s in-cluster Service, e.g.
http://otel-collector.observability.svc.cluster.local:4317.
Option B - a per-pod sidecar collector injected by the OpenTelemetry Operator. If your cluster
runs the OpenTelemetry Operator, you can have it inject a collector container into the nico-api
pod via a pod annotation. nico-api then talks to the collector over localhost (same pod, same network namespace)
The annotation value follows the form <namespace>/<collector-name>:
(c) Point nico-api at the collector. nico-api builds its OTLP span exporter only if
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is set. If it is unset, no tracing layer is constructed at
all and nothing is ever emitted - regardless of the runtime switch.
Notes:
http://…, h2c); 4317 is the default
OTLP/gRPC port. Do not point it at a 4318 HTTP receiver and do not use https://.With the endpoint configured, emission is still controlled by a runtime flag that defaults off. Toggle it live without a restart:
Under the hood this sets the dynamic config ConfigSetting::TracingEnabled, which flips the
in-process tracing_enabled flag that CarbideSpanSampler reads. Leaving it off in steady
state is the intended operating mode.
It depends on which part you are changing:
Why: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is read exactly once, at process startup
(crates/api-core/src/logging/setup.rs). If it was unset when nico-api started, the OTLP
exporter and tracing layer were never constructed and there is no way to add them at runtime -
so the first time you set the endpoint you must restart/roll the pod. The runtime switch,
by contrast, only flips an in-process flag and never needs a restart.
Recommendation: set OTEL_EXPORTER_OTLP_TRACES_ENDPOINT at deploy time and leave it in place
permanently - the plumbing is cheap while tracing is toggled off. Enabling/disabling then
never requires a restart, which is the whole point of separating the configuration from the switch.
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is set on the nico-api pod and points at the collector’s
gRPC endpoint.traces pipeline and its logs show the OTLP receiver listening on 4317.nico-admin-cli set tracing-enabled true has been run.service.name = carbide-api.carbide_api_tracing_spans_open to confirm spans are being opened.Tracing has real cost, which is the reason it defaults off. The cost depends on which of three states nico-api is in:
This is the expensive mode the dev team warns about:
ParentBased, a sampled root span pulls in its entire child subtree
(the component-manager, machine-a-tron, controller and DB spans beneath it). A single traced
operation can therefore produce many spans.tail_sampling at the collector (keep errors/slow traces, sample the rest) and -
most importantly - only enable it during an active investigation, then turn it back off.This is the common steady state if you follow the recommendation to leave the endpoint configured. The overhead here is near-zero but not exactly zero:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT configured and keep the runtime switch off in steady
state - cheap and avoids a pod roll when you need traces.carbide_api_tracing_spans_open and nico-api CPU/latency while it is on.carbide_api_tracing_spans_open.