Use this runbook when a NeMo Relay application has missing traces, partial
traces, incorrect scope parentage, exporter failures, duplicate events, or
sensitive data in telemetry. It assumes that the application already has a
baseline scope and call instrumentation path.
For first-time setup problems, start with the
Troubleshooting Guide. For conceptual grounding,
refer to Agent Runtime Primer,
Scopes, Events,
and Subscribers.
Protect Sensitive Data First
Do not collect raw prompts, model responses, authorization headers, tokens,
customer records, tool arguments, or provider payloads while triaging an
incident. Capture the smallest sanitized event sample that proves the failure.
Before exporting incident artifacts outside the current trust boundary, verify
that sanitize guardrails or exporter filters remove sensitive fields. Sanitize
guardrails change emitted telemetry payloads only; they do not change the live
request or response passed to the tool, model provider, or application. Refer to
Middleware and
Add Middleware for the
guardrail boundary.
Triage By Symptom
Use this table to choose the first check for the symptom you see.
Run The Ordered Checks
Run these checks in order before changing exporter or application code.
- Confirm the instrumentation boundary.
- Confirm the active scope and root scope ownership.
- Confirm managed tool and LLM calls.
- Confirm subscriber or exporter registration timing.
- Confirm exporter endpoint, environment, and flush behavior.
- Confirm sanitization before export.
Confirm Instrumentation Boundary
Start with the code path that owns the real work.
- If application code calls the tool or model provider directly, verify that the
call path uses Instrument Applications
guidance.
- If a framework owns scheduling, retries, callbacks, or provider payloads,
verify that the integration uses
Integrate into Frameworks guidance.
- If a plugin installs runtime behavior, verify that the plugin is activated
before the request path starts.
Do not debug an exporter first if no in-process subscriber sees events. Add or
enable a sanitized in-process subscriber at the same boundary and confirm that
scope, tool, or LLM events exist before investigating external export.
Confirm Active Scope
Trace gaps and wrong parent-child relationships usually start with scope
ownership. Verify these conditions:
- Each request, agent run, or workflow starts under the intended top-level scope.
- Detached tasks, worker threads, callbacks, and async jobs receive the intended
scope stack when they should remain part of the same logical run.
- Independent requests receive fresh isolated scope stacks.
- Scope-local middleware and subscribers are registered on the owning scope or
an ancestor scope.
Use Adding Scopes and Marks
and Scopes to compare the intended root scope
with the emitted event uuid and parent_uuid values.
Confirm Managed Calls
Partial traces often mean some work bypasses the runtime helpers. Check these
areas:
- Tool calls that should emit tool start and end events use the managed tool
call path.
- Model calls that should emit LLM start and end events use the managed LLM call
path or an integration wrapper that emits equivalent lifecycle events.
- Manual lifecycle calls emit matched start and end events with the same
lifecycle UUID.
- Streaming LLM responses are drained until completion so final events,
collectors, and subscribers can observe the completed output.
Refer to Instrument a Tool Call,
Instrument an LLM Call,
Wrap Tool Calls, and
Wrap LLM Calls.
Confirm Subscriber And Exporter Registration
Events are not buffered for subscribers that register after the event has
already been emitted. Verify these conditions:
- Plugin-managed observability components are loaded before the request path.
- Manual subscribers are registered before the scope, tool, or LLM events they
need to observe.
- Scope-local subscribers are registered on a scope that is active for the work
they should observe.
- Exporter filters match the intended root scope or event category.
- Shutdown, teardown, or request completion calls flush owned exporters before
the process exits or the container stops.
Use Observability,
Observability Configuration, and
Subscribers to verify the registration
lifecycle.
Confirm Exporter Setup
If in-process event inspection works but export fails elsewhere, isolate
exporter transport and configuration from runtime instrumentation.
For file or trajectory export, confirm these settings:
- Output paths are writable by the running process.
- The application shuts down or clears the exporter in a path that flushes
partial output.
- ATIF export is scoped to the intended agent root and does not mix concurrent
root scopes.
For OpenTelemetry or OpenInference export, confirm these settings:
- The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network
egress are available in the target environment.
- The exporter is enabled in the active configuration file or plugin document.
- The backend receives spans with
nemo_relay.uuid and
nemo_relay.parent_uuid attributes.
- The application flushes and shuts down the subscriber during graceful
termination.
Refer to Agent Trajectory Observability Format (ATOF),
Agent Trajectory Interchange Format (ATIF),
OpenTelemetry, and
OpenInference.
Check For Duplicate Event Sources
Duplicate events usually mean the same boundary is instrumented more than once.
Check these areas:
- The application does not wrap a call that a framework integration already
wraps.
- Manual lifecycle calls are not emitted around the same call that already uses
managed tool or LLM helpers.
- Plugin-managed exporters and manually registered exporters are not both
active for the same output path or backend.
- Retry logic belongs to the framework or application and is not being counted
as duplicate telemetry for the same real call.
If duplicate events are expected because a retry or fallback actually executed
more than once, preserve the events and add stable names or metadata that let
the downstream backend distinguish attempts.
Confirm Sanitization Before Export
Sensitive data in telemetry is an incident. Use this order:
- Stop or disable the affected exporter if sensitive data is leaving the
intended trust boundary.
- Keep the application path stable unless the live request itself is unsafe.
- Add or fix sanitize-request and sanitize-response guardrails before
subscribers and exporters receive events.
- Validate the sanitized event with ATOF JSONL or an in-process subscriber
before re-enabling external export.
- Re-enable one exporter at a time and confirm the downstream backend no
longer receives sensitive fields.
Use a request intercept only when the real request to the tool or provider must
change. Use a sanitize guardrail when only the recorded telemetry should change.
Escalation Capture Checklist
Collect this information before escalating an incident:
- NeMo Relay version and binding package version.
- Language binding and runtime version.
- Whether instrumentation is direct application code, a framework integration,
or plugin-managed behavior.
- Exporter type, configuration source, and activation path.
- Sanitized event sample that shows
uuid, parent_uuid, category,
scope_category, name, and redacted metadata.
- Runtime shape, such as single process, worker pool, async tasks, sidecar, job
queue, or container orchestration.
- Reproduction scope, including whether the failure occurs for one request, one
tenant, one service, or all requests.
- Recent changes to instrumentation, plugin configuration, exporter endpoints,
runtime environment, or tracing backend configuration.
Do not attach raw prompts, model responses, credentials, customer records,
authorization headers, or unredacted tool arguments to escalation artifacts.