For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • About NVIDIA NeMo Relay
    • Overview
    • Architecture
    • Ecosystem
    • Concepts
      • Scopes
      • Middleware
      • Plugins
      • Events
      • Subscribers
      • Framework Integrations
    • Release Notes
  • Getting Started
    • Agent Runtime Primer
    • Prerequisites
    • Installation
    • Configuration / Setup
    • Quick Start
  • NVIDIA NeMo Relay CLI
    • About
    • Basic Usage
    • Claude Code
    • Codex
    • Cursor
    • Hermes Agent
  • Supported Integrations
    • About
    • OpenClaw Plugin Guide
    • LangChain Integration Guide
    • LangGraph Integration Guide
    • Deep Agents Integration Guide
  • Instrument Applications
    • About
    • Adding Scopes and Marks
    • Instrument a Tool Call
    • Instrument an LLM Call
    • Add Middleware
    • Code Examples
  • Observability Plugin
    • About
    • Configuration
    • Agent Trajectory Interchange Format (ATIF)
    • Agent Trajectory Observability Format (ATOF)
    • OpenTelemetry
    • OpenInference
  • Adaptive Plugin
    • About
    • Configuration
    • Adaptive Cache Governor (ACG)
    • Adaptive Hints
  • NeMo Guardrails Plugin
    • About
    • Configuration
  • Integrate into Frameworks
    • About
    • Adding Scopes
    • Wrap Tool Calls
    • Wrap LLM Calls
    • Handle Non-Serializable Data
    • Using Codecs
    • Provider Codecs
    • Provider Response Codecs
    • Code Examples
  • Build Plugins
    • About
    • Define a Plugin
    • Validate Plugin Configuration
    • Plugin Configuration Files
    • Register Plugin Behavior
    • Design Plugin Configuration
    • NeMo Guardrails Example Plugin
    • Code Examples
  • Contribute
    • About
    • Development Setup
    • Workflow and Reviews
    • Testing and Documentation
  • Reference
    • APIs
    • Performance
  • Resources
    • Support and FAQs
    • Glossary
    • Troubleshooting Guide
    • Community
    • Legal
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogo
On this page
  • What Middleware Is
  • Registration Levels
  • Global Registrations
  • Scope-Local Registrations
  • Plugin-Installed Registrations
  • Middleware Families
  • Intercepts
  • Request Intercepts
  • Execution Intercepts
  • Stream Execution Intercepts
  • Guardrails
  • Conditional Execution
  • Sanitize Request
  • Sanitize Response
  • Managed Execution Order
  • Detailed Execution Flow
  • Choosing the Right Surface
  • Practical Guidance
About NVIDIA NeMo RelayConcepts

Middleware

||View as Markdown|
Previous

Scopes

Next

Plugins

This page explains the runtime behavior that runs around managed tool and LLM calls.

What Middleware Is

Middleware is the runtime behavior that runs around tool and LLM execution. NeMo Relay uses middleware to control, transform, or observe work at specific lifecycle points.

Middleware is organized by lifecycle meaning rather than as one undifferentiated hook system.

Registration Levels

Middleware and subscribers can be registered at different levels depending on their lifetime and visibility.

Global Registrations

Global registrations stay active for the whole process until they are removed. Use them for defaults that should apply broadly.

Scope-Local Registrations

Scope-local registrations are owned by one active scope and disappear automatically when that scope closes.

Use them when behavior should stay local to one request, workflow, or nested unit of work.

Plugin-Installed Registrations

Plugins can install middleware during initialization. This is the reusable, configuration-driven path for shipping middleware bundles without hand-registering everything in application code.

Middleware Families

NeMo Relay has two major middleware families:

  • Intercepts change the real execution path
  • Guardrails block work or rewrite emitted observability payloads

Intercepts

Intercepts are middleware that change the real request or execution path.

Request Intercepts

Request intercepts rewrite the real request before execution continues.

Use them when the next stage of execution should receive changed input, such as:

  • Header injection
  • Request normalization
  • Argument enrichment
  • Provider-specific request rewriting

Execution Intercepts

Execution intercepts wrap or replace the real callback.

Use them when behavior belongs around the invocation boundary itself, such as:

  • Retries
  • Timing
  • Routing
  • Wrapper logic
  • Framework integration

Stream Execution Intercepts

LLM streaming has a stream execution path for wrappers that need to run around chunk delivery and finalization rather than only around a single response object.

Guardrails

Guardrails are middleware that block execution or sanitize observability payloads.

Conditional Execution

Conditional-execution guardrails run before the real callback. They decide whether execution may proceed.

Use them when the runtime should block work based on policy, budget, or context.

Sanitize Request

Sanitize-request guardrails rewrite the payload recorded on emitted start events.

Use them when the event stream should hide or reduce sensitive request data.

Sanitize Response

Sanitize-response guardrails rewrite the payload recorded on emitted end events.

Use them when the event stream should hide or reduce sensitive response data.

Sanitize guardrails are observability-oriented. They do not rewrite the real arguments passed to the callback or the real value returned to the caller.

Managed Execution Order

For managed execution, NeMo Relay applies middleware and emits lifecycle events in this order:

  1. Conditional-execution guardrails
  2. Request intercepts
  3. Sanitize-request guardrails and emit the start event
  4. Execution intercepts
  5. The real callback, unless an execution intercept replaces it
  6. Sanitize-response guardrails and emit the end event

For streaming LLM flows, the same pre-execution order applies: the runtime applies sanitize-request guardrails and emits the LLM start event before the stream execution intercept chain runs. Stream execution intercepts are the execution family for streaming provider callbacks. The runtime then collects chunks and finalizes the stream before sanitize-response guardrails rewrite the emitted end-event payload at item 6.

This ordering is what makes the semantic split between intercepts and guardrails important:

  • If you need to change the real execution path, use an intercept
  • If you need to change only the emitted payload, use a sanitize guardrail

Detailed Execution Flow

The simplified sequence above is the right mental model for most readers. The diagram below expands the same flow to show where guardrail rejections, event subscribers, execution-intercept chaining, and streaming collection/finalization fit into the runtime path.

Choosing the Right Surface

Use these comparisons to pick the middleware surface that matches the behavior you need.

  • Use a conditional-execution guardrail when the work should be allowed or rejected.
  • Use a request intercept when the real request must change before the call.
  • Use an execution intercept when behavior belongs around the invocation boundary.
  • Use a sanitize guardrail when only subscribers and exporters should see rewritten data.
  • Use a stream execution intercept when you need streaming-specific behavior applied across the lifecycle of a long-lived or chunked response, such as per-chunk transformation, incremental authorization, logging or metrics per event, backpressure handling, or cancellation and cleanup, rather than an execution intercept that only surrounds a single call boundary.

Practical Guidance

Use these practices when applying the concept in application or integration code.

  • Keep process-wide defaults global.
  • Keep request-local policy scope-local.
  • Use plugins when the middleware bundle should be reusable and configuration-driven.
  • Treat execution intercepts as the preferred wrapper point for framework integrations.