Trace Incident Runbook

Use this runbook when a NeMo Relay application has missing traces, partial traces, incorrect scope parentage, exporter failures, duplicate events, or sensitive data in telemetry. It assumes that the application already has a baseline scope and call instrumentation path.

For first-time setup problems, start with the Troubleshooting Guide. For conceptual grounding, refer to Agent Runtime Primer, Scopes, Events, and Subscribers.

Protect Sensitive Data First

Do not collect raw prompts, model responses, authorization headers, tokens, customer records, tool arguments, or provider payloads while triaging an incident. Capture the smallest sanitized event sample that proves the failure.

Before exporting incident artifacts outside the current trust boundary, verify that sanitize guardrails or exporter filters remove sensitive fields. Sanitize guardrails change emitted telemetry payloads only; they do not change the live request or response passed to the tool, model provider, or application. Refer to Middleware and Add Middleware for the guardrail boundary.

Triage By Symptom

Use this table to choose the first check for the symptom you see.

Symptom	Likely Area	Start With
No traces	Missing instrumentation boundary or inactive exporter	Confirm Instrumentation Boundary
Partial traces	Unwrapped calls, dropped streams, or late subscriber registration	Confirm Managed Calls
Wrong parent or child scope	Scope propagation or shared scope stack issue	Confirm Active Scope
Events appear in process but export fails elsewhere	Exporter config, endpoint, environment, or flush path	Confirm Exporter Setup
Duplicate events	Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls	Check For Duplicate Event Sources
Sensitive data appears in telemetry	Missing sanitize guardrails before subscribers or exporters	Confirm Sanitization Before Export

Run The Ordered Checks

Run these checks in order before changing exporter or application code.

Confirm the instrumentation boundary.
Confirm the active scope and root scope ownership.
Confirm managed tool and LLM calls.
Confirm subscriber or exporter registration timing.
Confirm exporter endpoint, environment, and flush behavior.
Confirm sanitization before export.

Confirm Instrumentation Boundary

Start with the code path that owns the real work.

If application code calls the tool or model provider directly, verify that the call path uses Instrument Applications guidance.
If a framework owns scheduling, retries, callbacks, or provider payloads, verify that the integration uses Integrate into Frameworks guidance.
If a plugin installs runtime behavior, verify that the plugin is activated before the request path starts.

Do not debug an exporter first if no in-process subscriber sees events. Add or enable a sanitized in-process subscriber at the same boundary and confirm that scope, tool, or LLM events exist before investigating external export.

Confirm Active Scope

Trace gaps and wrong parent-child relationships usually start with scope ownership. Verify these conditions:

Each request, agent run, or workflow starts under the intended top-level scope.
Detached tasks, worker threads, callbacks, and async jobs receive the intended scope stack when they should remain part of the same logical run.
Independent requests receive fresh isolated scope stacks.
Scope-local middleware and subscribers are registered on the owning scope or an ancestor scope.

Use Adding Scopes and Marks and Scopes to compare the intended root scope with the emitted event uuid and parent_uuid values.

Confirm Managed Calls

Partial traces often mean some work bypasses the runtime helpers. Check these areas:

Tool calls that should emit tool start and end events use the managed tool call path.
Model calls that should emit LLM start and end events use the managed LLM call path or an integration wrapper that emits equivalent lifecycle events.
Manual lifecycle calls emit matched start and end events with the same lifecycle UUID.
Streaming LLM responses are drained until completion so final events, collectors, and subscribers can observe the completed output.

Refer to Instrument a Tool Call, Instrument an LLM Call, Wrap Tool Calls, and Wrap LLM Calls.

Confirm Subscriber And Exporter Registration

Events are not buffered for subscribers that register after the event has already been emitted. Verify these conditions:

Plugin-managed observability components are loaded before the request path.
Manual subscribers are registered before the scope, tool, or LLM events they need to observe.
Scope-local subscribers are registered on a scope that is active for the work they should observe.
Exporter filters match the intended root scope or event category.
Shutdown, teardown, or request completion calls flush owned exporters before the process exits or the container stops.

Use Observability, Observability Configuration, and Subscribers to verify the registration lifecycle.

Confirm Exporter Setup

If in-process event inspection works but export fails elsewhere, isolate exporter transport and configuration from runtime instrumentation.

For file or trajectory export, confirm these settings:

Output paths are writable by the running process.
The application shuts down or clears the exporter in a path that flushes partial output.
ATIF export is scoped to the intended agent root and does not mix concurrent root scopes.

For ATOF streaming endpoints, confirm these settings:

Endpoint URLs, transports, headers, and timeouts match the downstream collector.
nemo-relay doctor can send the synthetic nemo_relay.doctor.atof_probe mark event to each configured endpoint.
A failed streaming endpoint is isolated from file output and from other configured endpoints.

For OpenTelemetry or OpenInference export, confirm these settings:

The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network egress are available in the target environment.
The exporter is enabled in the active configuration file or plugin document.
The backend receives spans with nemo_relay.uuid and nemo_relay.parent_uuid attributes.
The application flushes and shuts down the subscriber during graceful termination.

Refer to Agent Trajectory Observability Format (ATOF), Agent Trajectory Interchange Format (ATIF), OpenTelemetry, and OpenInference.

Check For Duplicate Event Sources

Duplicate events usually mean the same boundary is instrumented more than once. Check these areas:

The application does not wrap a call that a framework integration already wraps.
Manual lifecycle calls are not emitted around the same call that already uses managed tool or LLM helpers.
Plugin-managed exporters and manually registered exporters are not both active for the same output path or backend.
Retry logic belongs to the framework or application and is not being counted as duplicate telemetry for the same real call.

If duplicate events are expected because a retry or fallback actually executed more than once, preserve the events and add stable names or metadata that let the downstream backend distinguish attempts.

Confirm Sanitization Before Export

Sensitive data in telemetry is an incident. Use this order:

Stop or disable the affected exporter if sensitive data is leaving the intended trust boundary.
Keep the application path stable unless the live request itself is unsafe.
Add or fix sanitize-request and sanitize-response guardrails before subscribers and exporters receive events.
Validate the sanitized event with ATOF JSONL or an in-process subscriber before re-enabling external export.
Re-enable one exporter at a time and confirm the downstream backend no longer receives sensitive fields.

Use a request intercept only when the real request to the tool or provider must change. Use a sanitize guardrail when only the recorded telemetry should change.

Escalation Capture Checklist

Collect this information before escalating an incident:

NeMo Relay version and binding package version.
Language binding and runtime version.
Whether instrumentation is direct application code, a framework integration, or plugin-managed behavior.
Exporter type, configuration source, and activation path.
Sanitized event sample that shows uuid, parent_uuid, category, scope_category, name, and redacted metadata.
Runtime shape, such as single process, worker pool, async tasks, sidecar, job queue, or container orchestration.
Reproduction scope, including whether the failure occurs for one request, one tenant, one service, or all requests.
Recent changes to instrumentation, plugin configuration, exporter endpoints, runtime environment, or tracing backend configuration.

Do not attach raw prompts, model responses, credentials, customer records, authorization headers, or unredacted tool arguments to escalation artifacts.