nemoguardrails.guardrails.iorails
Optimized IORails Engine for specific guardrail configurations.
This module provides an optimized inference path for guardrail configurations that only use specific supported flows (input/output content safety). For configurations outside this supported set, the standard LLMRails engine should be used instead.
Module Contents
Classes
Functions
Data
API
Bases: BaseGuardrails
Workflow engine for accelerated Input/Output rails inference.
True when output rails are configured and streaming is enabled for them.
Context manager (used for testing rather than long-lived instance)
Context manager (used for testing rather than long-lived instance)
Core pipeline: input rails -> LLM call -> output rails.
Sequential path: input rails block before LLM generation starts.
Speculative path: input rails and LLM generation race concurrently.
Race input rails against LLM generation, return LLMResponse or None (rejected).
Runs inside a queue worker task. Wraps the pipeline in
traced_request so each request gets its own span + request ID,
then delegates to _do_generate for the actual input rails →
LLM → output rails flow. Metrics are emitted at the outer
lifecycle scope by generate_async, not here.
Buffer streamed chunks and run output rails on each batch.
Uses the same RollingBuffer and stream_first semantics as
LLMRails:
stream_first=True: yield chunks immediately, then run output rails. If unsafe, inject an error and stop.stream_first=False: run output rails first, only yield chunks if safe.
Raise if output rails exist but streaming is not enabled for them.
Return True iff IORails can handle the given config and llm argument.
Synchronous version of generate_async.
Telemetry is disabled for the ephemeral IORails object used for
the generate() call. For production use, use the asynchronous
generate_async() and stream_async() methods for non-streaming
and streaming requests respectively.
Public entry: submit the request to the internal work queue.
The queue enforces non-streaming concurrency limits
(NONSTREAM_MAX_CONCURRENCY workers draining up to
NONSTREAM_QUEUE_DEPTH pending items). Callers receive
asyncio.QueueFull when the admission buffer is full and
guardrails.nonstream.rejections increments if metrics are enabled.
Request-level metrics (guardrails.requests,
guardrails.request.duration, guardrails.requests.errors)
wrap the queue submission, so duration includes queue-wait time
(OTEL HTTP semconv). A QueueFull rejection shows up in BOTH
requests.errors{error.type=QueueFull} and
nonstream.rejections — honest dual-signal reporting.
Start the IORails engine. Call this during service startup.
Stop the IORails engine. Call this during service shutdown.
Stream LLM response tokens with input/output rails applied.
Returns an async iterator that yields string chunks (or dicts when
include_metadata=True). Input rails run before any tokens are
streamed. If output rails are configured and streaming is enabled,
tokens are buffered and checked using the same RollingBuffer /
stream_first semantics as LLMRails.
Parameters:
Conversation messages in OpenAI format.
Optional GenerationOptions (llm_params are forwarded to the main LLM call).
When True, chunks are dicts with text and
metadata keys instead of plain strings.
Returns: AsyncIterator[Union[str, dict]]
An async iterator of string chunks (or dicts).
Raises:
StreamingNotSupportedError: If output rails are present butrails.output.streaming.enabledis False.ValueError: Ifinclude_metadata=Truewith output rails streaming enabled (BufferStrategy requires plain string chunks).asyncio.QueueFull: If the streaming concurrency limit is reached (load shedding).
Return None if IORails can handle (config, llm), else a human-readable reason.
Build the assistant message returned by generate.
Without tool calls this is the existing {"role", "content"} shape. With
tool calls present, the calls are serialized to OpenAI shape and content
is set to None when empty, matching the OpenAI assistant-message contract.
Serialize ToolCall objects to OpenAI /chat/completions shape.
function.arguments is emitted as a JSON string (OpenAI-native) rather
than the canonical dict carried internally, so the output round-trips
through OpenAI-compatible clients.