This page explains the runtime behavior that runs around managed tool and LLM calls.
Middleware is the runtime behavior that runs around tool and LLM execution. NeMo Relay uses middleware to control, transform, or observe work at specific lifecycle points.
Middleware is organized by lifecycle meaning rather than as one undifferentiated hook system.
Middleware and subscribers can be registered at different levels depending on their lifetime and visibility.
Global registrations stay active for the whole process until they are removed. Use them for defaults that should apply broadly.
Scope-local registrations are owned by one active scope and disappear automatically when that scope closes.
Use them when behavior should stay local to one request, workflow, or nested unit of work.
Plugins can install middleware during initialization. This is the reusable, configuration-driven path for shipping middleware bundles without hand-registering everything in application code.
NeMo Relay has two major middleware families:
Intercepts are middleware that change the real request or execution path.
Request intercepts rewrite the real request before execution continues.
Use them when the next stage of execution should receive changed input, such as:
Execution intercepts wrap or replace the real callback.
Use them when behavior belongs around the invocation boundary itself, such as:
LLM streaming has a stream execution path for wrappers that need to run around chunk delivery and finalization rather than only around a single response object.
Guardrails are middleware that block execution or sanitize observability payloads.
Conditional-execution guardrails run before the real callback. They decide whether execution may proceed.
Use them when the runtime should block work based on policy, budget, or context.
Sanitize-request guardrails rewrite the payload recorded on emitted start events.
Use them when the event stream should hide or reduce sensitive request data.
Sanitize-response guardrails rewrite the payload recorded on emitted end events.
Use them when the event stream should hide or reduce sensitive response data.
Sanitize guardrails are observability-oriented. They do not rewrite the real arguments passed to the callback or the real value returned to the caller.
For managed execution, NeMo Relay applies middleware and emits lifecycle events in this order:
For streaming LLM flows, the same pre-execution order applies: the runtime
applies sanitize-request guardrails and emits the LLM start event before the
stream execution intercept chain runs. Stream execution intercepts are the
execution family for streaming provider callbacks. The runtime then collects
chunks and finalizes the stream before sanitize-response guardrails rewrite
the emitted end-event payload at item 6.
This ordering is what makes the semantic split between intercepts and guardrails important:
The simplified sequence above is the right mental model for most readers. The diagram below expands the same flow to show where guardrail rejections, event subscribers, execution-intercept chaining, and streaming collection/finalization fit into the runtime path.
Use these comparisons to pick the middleware surface that matches the behavior you need.
Use these practices when applying the concept in application or integration code.