> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/relay/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/relay/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/relay/_mcp/server.

# Middleware

This page explains the runtime behavior that runs around managed tool and LLM calls.

## What Middleware Is

Middleware is the runtime behavior that runs around tool and LLM execution.
NeMo Relay uses middleware to control, transform, or observe work at specific
lifecycle points.

Middleware is organized by lifecycle meaning rather than as one undifferentiated
hook system.

## Registration Levels

Middleware and subscribers can be registered at different levels depending on their
lifetime and visibility.

### Global Registrations

Global registrations stay active for the whole process until they are removed.
Use them for defaults that should apply broadly.

### Scope-Local Registrations

Scope-local registrations are owned by one active scope and disappear
automatically when that scope closes.

Use them when behavior should stay local to one request, workflow, or nested
unit of work.

### Plugin-Installed Registrations

Plugins can install middleware during initialization. This is the reusable,
configuration-driven path for shipping middleware bundles without hand-registering
everything in application code.

## Middleware Families

NeMo Relay has two major middleware families:

* **Intercepts** change the real execution path
* **Guardrails** block work or rewrite emitted observability payloads

## Intercepts

Intercepts are middleware that change the real request or execution path.

### Request Intercepts

Request intercepts rewrite the real request before execution continues.

Use them when the next stage of execution should receive changed input, such as:

* Header injection
* Request normalization
* Argument enrichment
* Provider-specific request rewriting

### Execution Intercepts

Execution intercepts wrap or replace the real callback.

Use them when behavior belongs around the invocation boundary itself, such as:

* Retries
* Timing
* Routing
* Wrapper logic
* Framework integration

### Stream Execution Intercepts

LLM streaming has a stream execution path for wrappers that need to run around
chunk delivery and finalization rather than only around a single response
object.

## Guardrails

Guardrails are middleware that block execution or sanitize observability payloads.

### Conditional Execution

Conditional-execution guardrails run before the real callback. They decide
whether execution may proceed.

Use them when the runtime should block work based on policy, budget, or context.

### Sanitize Request

Sanitize-request guardrails rewrite the payload recorded on emitted start events.

Use them when the event stream should hide or reduce sensitive request data.

### Sanitize Response

Sanitize-response guardrails rewrite the payload recorded on emitted end events.

Use them when the event stream should hide or reduce sensitive response data.

Sanitize guardrails are observability-oriented. They do not rewrite the real
arguments passed to the callback or the real value returned to the caller.

## Managed Execution Order

For managed execution, NeMo Relay applies middleware and emits lifecycle events
in this order:

```mermaid
sequenceDiagram
    autonumber
    actor Caller as Application / Framework
    participant Runtime as NeMo Relay Runtime
    participant Cond as Conditional Guardrails
    participant Req as Request Intercepts
    participant Exec as Execution Intercepts
    participant Callback as Real Callback
    participant San as Sanitize Guardrails
    participant Dispatch as Async Subscriber Dispatcher
    participant Subs as Subscribers

    Caller->>Runtime: managed tool or LLM call
    Runtime->>Cond: decide whether work may proceed

    alt blocked
        Cond-->>Caller: reject execution
    else allowed
        Runtime->>Req: rewrite the real request
        Runtime->>San: sanitize emitted start payload
        Runtime->>Dispatch: enqueue start event before execution
        Dispatch-->>Subs: deliver start event later
        Runtime->>Exec: wrap execution
        Exec->>Callback: invoke callback
        Callback-->>Exec: return real result
        Exec-->>Runtime: continue
        Runtime->>San: sanitize emitted end payload
        Runtime->>Dispatch: enqueue end event
        Dispatch-->>Subs: deliver end event later
        Runtime-->>Caller: return real result
    end
```

1. Conditional-execution guardrails
2. Request intercepts
3. Sanitize-request guardrails and emit the start event
4. Execution intercepts
5. The real callback, unless an execution intercept replaces it
6. Sanitize-response guardrails and emit the end event

For streaming LLM flows, the same pre-execution order applies: the runtime
applies `sanitize-request` guardrails and emits the LLM start event before the
stream execution intercept chain runs. Stream execution intercepts are the
execution family for streaming provider callbacks. The runtime then collects
chunks and finalizes the stream before `sanitize-response` guardrails rewrite
the emitted end-event payload at item 6.

This ordering is what makes the semantic split between intercepts and
guardrails important:

* If you need to change the real execution path, use an intercept
* If you need to change only the emitted payload, use a sanitize guardrail

## Detailed Execution Flow

The simplified sequence above is the right mental model for most readers. The
diagram below expands the same flow to show where guardrail rejections, event
subscribers, execution-intercept chaining, and streaming collection/finalization
fit into the runtime path.

```mermaid
flowchart TB
    Request([Request])

    subgraph Execution
        direction TB
        ConditionalExecutionGuardrails{{Conditional-Execution Guardrail}}
        RequestIntercepts[/Request Intercepts/]
        RaiseException[Raise Exception]
        subgraph Invocation
            direction TB
            HasExecutionIntercept{{Has Valid Execution Intercept}}
            ExecutionIntercepts[/Execution Intercepts/]
            DefaultCallable[Default Callable]
            InterceptResult[Execution Result]
        end

        subgraph Streaming
            direction TB
            Finalizer[Finalizer]
            Collector[Collector]
        end

        subgraph Observability
            direction TB
            SanitizeRequestGuardrails[/Sanitize Request Guardrail/]
            SanitizeResponseGuardrails[/Sanitize Response Guardrail/]
            StartEvent[Emit Start Event]
            EndEvent[Emit End Event]
            Dispatcher[["Async Subscriber Dispatcher"]]
            EventSubscribers[["Event Subscribers"]]
        end
    end

    Response([Response])

    Request --> ConditionalExecutionGuardrails
    RequestIntercepts -->|Transformed Request| SanitizeRequestGuardrails
    ConditionalExecutionGuardrails -->|"(rejected event)"| Dispatcher
    ConditionalExecutionGuardrails -->|"(rejected)"| RaiseException
    ConditionalExecutionGuardrails -->|"(passed)"| RequestIntercepts
    SanitizeRequestGuardrails -->|Sanitized Start Payload| StartEvent
    StartEvent --> Dispatcher
    Dispatcher --> EventSubscribers
    StartEvent -->|Before Execution Intercepts| HasExecutionIntercept
    RequestIntercepts -.->|Real Request| HasExecutionIntercept

    HasExecutionIntercept -->|No| DefaultCallable
    HasExecutionIntercept -->|Yes| ExecutionIntercepts
    ExecutionIntercepts -.->|calls next| HasExecutionIntercept
    ExecutionIntercepts -->|returns or replaces| InterceptResult
    DefaultCallable -->|returns| InterceptResult

    InterceptResult -->|Response| SanitizeResponseGuardrails
    InterceptResult -->|Response| Response

    InterceptResult -.->|stream chunks| Collector
    Collector -..->|stream chunks| Response
    InterceptResult -.->|"(stream ends)"| Finalizer
    Finalizer -.->|Aggregated Response| SanitizeResponseGuardrails
    Finalizer o--o|shared state| Collector

    SanitizeResponseGuardrails -->|Sanitized End Payload| EndEvent
    EndEvent --> Dispatcher

    class Execution,Invocation,Streaming,Observability,Request,Response grey-lightest;
    class Dispatcher,EventSubscribers,StartEvent,EndEvent teal-lightest;
    class RequestIntercepts,HasExecutionIntercept,ExecutionIntercepts yellow-lightest;
    class ConditionalExecutionGuardrails,SanitizeRequestGuardrails,SanitizeResponseGuardrails green-lightest;
    class RaiseException red-lightest;
    class DefaultCallable,InterceptResult,Collector,Finalizer magenta-lightest;
```

## Choosing the Right Surface

Use these comparisons to pick the middleware surface that matches the behavior you need.

* Use a **conditional-execution guardrail** when the work should be allowed or
  rejected.
* Use a **request intercept** when the real request must change before the call.
* Use an **execution intercept** when behavior belongs around the invocation
  boundary.
* Use a **sanitize guardrail** when only subscribers and exporters should see
  rewritten data.
* Use a **stream execution intercept** when you need streaming-specific
  behavior applied across the lifecycle of a long-lived or chunked response,
  such as per-chunk transformation, incremental authorization, logging or
  metrics per event, backpressure handling, or cancellation and cleanup,
  rather than an execution intercept that only surrounds a single call
  boundary.

## Practical Guidance

Use these practices when applying the concept in application or integration code.

* Keep process-wide defaults global.
* Keep request-local policy scope-local.
* Use plugins when the middleware bundle should be reusable and
  configuration-driven.
* Treat execution intercepts as the preferred wrapper point for framework
  integrations.