NVSentinel uses OpenTelemetry distributed tracing to provide end-to-end visibility into the lifecycle of every health event as it moves through the breakfix pipeline. A single trace follows one health event from detection through quarantine, drain, and remediation across all NVSentinel modules — platform-connector, fault-quarantine, node-drainer, fault-remediation, health-events-analyzer, janitor, janitor-provider, and event-exporter.
Instead of manually correlating logs across multiple services to debug an issue (e.g., “Why was a node not remediated?” or “Why did fault-quarantine take longer than expected to handle an event?”), you can look up a single trace and see the complete journey, timing, and outcome of any health event in one view.
NVSentinel’s breakfix pipeline spans many modules. When something goes wrong — a health event takes longer than expected to move from detection to remediation, a remediation fails, or a node isn’t recovered — understanding what happened requires piecing together information from multiple services:
Distributed tracing solves these problems by giving you a structured, request-scoped view of every health event’s path through the system — showing where, what, and how long each step took.
When a health event enters NVSentinel, the system creates a trace — a unique identifier that follows the event through every module. Each module adds spans (units of work with start/end times) to the trace as it processes the event. Spans are nested to show the call hierarchy, and carry attributes (key-value metadata) describing what happened and any errors if occurred.
All NVSentinel modules export traces via OTLP gRPC to an OpenTelemetry Collector.
HealthEventStore layer. Filter slow queries with TraceQL (e.g., \{db.duration_ms > 50\})\{name =~ "HTTP.*" && duration > 500ms\}error.type and error.message attributes on spans when something fails — you can see exactly which step failed and what error occurred. Find all error spans with TraceQL: \{status = error\}NVSentinel modules don’t communicate via HTTP/gRPC with each other (except janitor → janitor-provider). Instead, trace context is propagated through three mechanisms depending on the boundary:
Each module reads the upstream trace context, creates a linked span in the same trace, and writes its own span ID for the next module. The result is a single trace that spans all modules.
When tracing is enabled and a module uses trace-correlation logging, JSON logs emitted within an active span context include trace_id and span_id fields. This lets you jump from a trace to corresponding logs (or vice versa) in your log aggregation system.
Search by trace_id in Loki, Kibana, or any log backend to find all logs associated with a specific trace.
Configure tracing through your Helm values:
When tracing is enabled, Helm injects these environment variables into every NVSentinel module:
You can look up traces in your tracing UI (Grafana Tempo, Jaeger) by:
trace_id from logs, a health event document, or an exported CloudEvent, search for it directly\{span.health_event.id = "69e1c5b487beb8cfe7bc440e"\}. The health_event.id attribute is present on spans from fault-quarantine, node-drainer, and fault-remediationservice.name (e.g., fault-quarantine, node-drainer) to see all traces through a specific modulenode_drainer.drain.scope, fault_remediation.log_collector.outcome)HTTP * spans) or database operations (db.* spans) within the slow module to identify whether the delay is from an API call or a DB querytrace_id or span_id from the span and search for it in the corresponding service’s container logs to get the full contextnode_drainer.drain_session span)fault_remediation.skip_event span — read the fault_remediation.skip.reason attribute to understand why it was skippedjanitor.reset_job.failed = true or janitor.error.type = "reboot_timeout" attributes on janitor spans to see if the GPU reset or reboot failed on the nodeDatabase operations appear as spans named <collection>.<operation> (e.g., HealthEvents.update, HealthEvents.insert, HealthEvents.find), with attributes like db.operation.name, db.collection.name, db.system.name (mongodb), and network.peer.address (the MongoDB host). To find slow database operations, use TraceQL:
This returns all spans where a database operation took longer than 100ms, helping you identify slow queries across all modules.
K8s API calls appear as HTTP PUT, HTTP DELETE, HTTP GET spans within a trace. Each HTTP span includes url.full (e.g., https://10.96.0.1:443/api/v1/nodes/<node>/status), http.response.status_code, and http.request.method, so you can identify exactly which API call was slow and to which resource. To find slow calls across all traces, use TraceQL:
This returns all spans where an HTTP call took longer than 500ms, helping you identify slow API calls across all modules.
Q: Tracing is enabled but I don’t see any traces in my backend
global.tracing.endpoint is set to the correct OTLP gRPC address of your collectorQ: Log lines don’t have trace_id and span_id fields
global.tracing.enabled is set to true in your Helm values and that OTEL_EXPORTER_OTLP_ENDPOINT is present in the module’s environment. You can also confirm tracing is active by checking the module’s container logs for the startup message "OpenTelemetry tracing initialized" — if present, tracing was successfully initialized for that moduletrace_id and span_id are only injected into log lines emitted during an active span context. Logs outside of event processing won’t have these fieldsQ: Can I use tracing without Grafana Tempo?
Q: I don’t see any database spans in my traces
Q: What is the performance impact of enabling tracing?