> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/relay/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/relay/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/relay/_mcp/server.

# Trace Incident Runbook

Use this runbook when a NeMo Relay application has missing traces, partial
traces, incorrect scope parentage, exporter failures, duplicate events, or
sensitive data in telemetry. It assumes that the application already has a
baseline scope and call instrumentation path.

For first-time setup problems, start with the
[Troubleshooting Guide](/resources/troubleshooting). For conceptual grounding,
refer to [Agent Runtime Primer](/getting-started/agent-runtime-primer),
[Scopes](/about-nemo-relay/concepts/scopes), [Events](/about-nemo-relay/concepts/events),
and [Subscribers](/about-nemo-relay/concepts/subscribers).

## Protect Sensitive Data First

Do not collect raw prompts, model responses, authorization headers, tokens,
customer records, tool arguments, or provider payloads while triaging an
incident. Capture the smallest sanitized event sample that proves the failure.

Before exporting incident artifacts outside the current trust boundary, verify
that sanitize guardrails or exporter filters remove sensitive fields. Sanitize
guardrails change emitted telemetry payloads only; they do not change the live
request or response passed to the tool, model provider, or application. Refer to
[Middleware](/about-nemo-relay/concepts/middleware) and
[Add Middleware](/instrument-applications/advanced-guide) for the
guardrail boundary.

## Triage By Symptom

Use this table to choose the first check for the symptom you see.

| Symptom                                             | Likely Area                                                                            | Start With                                                                |
| --------------------------------------------------- | -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| No traces                                           | Missing instrumentation boundary or inactive exporter                                  | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary)     |
| Partial traces                                      | Unwrapped calls, dropped streams, or late subscriber registration                      | [Confirm Managed Calls](#confirm-managed-calls)                           |
| Wrong parent or child scope                         | Scope propagation or shared scope stack issue                                          | [Confirm Active Scope](#confirm-active-scope)                             |
| Events appear in process but export fails elsewhere | Exporter config, endpoint, environment, or flush path                                  | [Confirm Exporter Setup](#confirm-exporter-setup)                         |
| Duplicate events                                    | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources)   |
| Sensitive data appears in telemetry                 | Missing sanitize guardrails before subscribers or exporters                            | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) |

## Run The Ordered Checks

Run these checks in order before changing exporter or application code.

1. Confirm the instrumentation boundary.
2. Confirm the active scope and root scope ownership.
3. Confirm managed tool and LLM calls.
4. Confirm subscriber or exporter registration timing.
5. Confirm exporter endpoint, environment, and flush behavior.
6. Confirm sanitization before export.

## Confirm Instrumentation Boundary

Start with the code path that owns the real work.

* If application code calls the tool or model provider directly, verify that the
  call path uses [Instrument Applications](/instrument-applications/about)
  guidance.
* If a framework owns scheduling, retries, callbacks, or provider payloads,
  verify that the integration uses
  [Integrate into Frameworks](/integrate-into-frameworks/about) guidance.
* If a plugin installs runtime behavior, verify that the plugin is activated
  before the request path starts.

Do not debug an exporter first if no in-process subscriber sees events. Add or
enable a sanitized in-process subscriber at the same boundary and confirm that
scope, tool, or LLM events exist before investigating external export.

## Confirm Active Scope

Trace gaps and wrong parent-child relationships usually start with scope
ownership. Verify these conditions:

* Each request, agent run, or workflow starts under the intended top-level scope.
* Detached tasks, worker threads, callbacks, and async jobs receive the intended
  scope stack when they should remain part of the same logical run.
* Independent requests receive fresh isolated scope stacks.
* Scope-local middleware and subscribers are registered on the owning scope or
  an ancestor scope.

Use [Adding Scopes and Marks](/instrument-applications/adding-scopes-and-marks)
and [Scopes](/about-nemo-relay/concepts/scopes) to compare the intended root scope
with the emitted event `uuid` and `parent_uuid` values.

## Confirm Managed Calls

Partial traces often mean some work bypasses the runtime helpers. Check these
areas:

* Tool calls that should emit tool start and end events use the managed tool
  call path.
* Model calls that should emit LLM start and end events use the managed LLM call
  path or an integration wrapper that emits equivalent lifecycle events.
* Manual lifecycle calls emit matched start and end events with the same
  lifecycle UUID.
* Streaming LLM responses are drained until completion so final events,
  collectors, and subscribers can observe the completed output.

Refer to [Instrument a Tool Call](/instrument-applications/instrument-tool-call),
[Instrument an LLM Call](/instrument-applications/instrument-llm-call),
[Wrap Tool Calls](/integrate-into-frameworks/wrap-tool-calls), and
[Wrap LLM Calls](/integrate-into-frameworks/wrap-llm-calls).

## Confirm Subscriber And Exporter Registration

Events are not buffered for subscribers that register after the event has
already been emitted. Verify these conditions:

* Plugin-managed observability components are loaded before the request path.
* Manual subscribers are registered before the scope, tool, or LLM events they
  need to observe.
* Scope-local subscribers are registered on a scope that is active for the work
  they should observe.
* Exporter filters match the intended root scope or event category.
* Shutdown, teardown, or request completion calls flush owned exporters before
  the process exits or the container stops.

Use [Observability](/observability-plugin/about),
[Observability Configuration](/observability-plugin/configuration), and
[Subscribers](/about-nemo-relay/concepts/subscribers) to verify the registration
lifecycle.

## Confirm Exporter Setup

If in-process event inspection works but export fails elsewhere, isolate
exporter transport and configuration from runtime instrumentation.

For file or trajectory export, confirm these settings:

* Output paths are writable by the running process.
* The application shuts down or clears the exporter in a path that flushes
  partial output.
* ATIF export is scoped to the intended agent root and does not mix concurrent
  root scopes.

For OpenTelemetry or OpenInference export, confirm these settings:

* The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network
  egress are available in the target environment.
* The exporter is enabled in the active configuration file or plugin document.
* The backend receives spans with `nemo_relay.uuid` and
  `nemo_relay.parent_uuid` attributes.
* The application flushes and shuts down the subscriber during graceful
  termination.

Refer to [Agent Trajectory Observability Format (ATOF)](/observability-plugin/atof),
[Agent Trajectory Interchange Format (ATIF)](/observability-plugin/atif),
[OpenTelemetry](/observability-plugin/opentelemetry), and
[OpenInference](/observability-plugin/openinference).

## Check For Duplicate Event Sources

Duplicate events usually mean the same boundary is instrumented more than once.
Check these areas:

* The application does not wrap a call that a framework integration already
  wraps.
* Manual lifecycle calls are not emitted around the same call that already uses
  managed tool or LLM helpers.
* Plugin-managed exporters and manually registered exporters are not both
  active for the same output path or backend.
* Retry logic belongs to the framework or application and is not being counted
  as duplicate telemetry for the same real call.

If duplicate events are expected because a retry or fallback actually executed
more than once, preserve the events and add stable names or metadata that let
the downstream backend distinguish attempts.

## Confirm Sanitization Before Export

Sensitive data in telemetry is an incident. Use this order:

1. Stop or disable the affected exporter if sensitive data is leaving the
   intended trust boundary.
2. Keep the application path stable unless the live request itself is unsafe.
3. Add or fix sanitize-request and sanitize-response guardrails before
   subscribers and exporters receive events.
4. Validate the sanitized event with ATOF JSONL or an in-process subscriber
   before re-enabling external export.
5. Re-enable one exporter at a time and confirm the downstream backend no
   longer receives sensitive fields.

Use a request intercept only when the real request to the tool or provider must
change. Use a sanitize guardrail when only the recorded telemetry should change.

## Escalation Capture Checklist

Collect this information before escalating an incident:

* NeMo Relay version and binding package version.
* Language binding and runtime version.
* Whether instrumentation is direct application code, a framework integration,
  or plugin-managed behavior.
* Exporter type, configuration source, and activation path.
* Sanitized event sample that shows `uuid`, `parent_uuid`, `category`,
  `scope_category`, name, and redacted metadata.
* Runtime shape, such as single process, worker pool, async tasks, sidecar, job
  queue, or container orchestration.
* Reproduction scope, including whether the failure occurs for one request, one
  tenant, one service, or all requests.
* Recent changes to instrumentation, plugin configuration, exporter endpoints,
  runtime environment, or tracing backend configuration.

Do not attach raw prompts, model responses, credentials, customer records,
authorization headers, or unredacted tool arguments to escalation artifacts.