nemoguardrails.guardrails.model_engine
nemoguardrails.guardrails.model_engine
Model engine for IORails.
Wraps a single Model config and makes raw HTTP calls to its OpenAI-compatible /v1/chat/completions endpoint via aiohttp. Retries are handled by aiohttp-retry (ExponentialRetry).
Module Contents
Classes
Functions
Data
API
Bases: BaseEngine
Wraps a single Model config and makes HTTP calls to its endpoint.
Each ModelEngine owns its own RetryClient with per-model timeout, retry, and connection pool settings.
Raise if the engine has not been started.
Return the value stored in environment variable variable_name.
Build the client, URL, headers, and body common to every request.
Raise ModelEngineError if the HTTP status indicates an error.
Resolve the API key from model config or environment.
Resolve the base URL from model parameters or engine type.
Strips an optional trailing “/v1” so users can follow the OpenAI / LLMRails convention of including “/v1” in base_url without producing a doubled “/v1/v1/chat/completions” path when _CHAT_COMPLETIONS_ENDPOINT is appended.
Wrap an unexpected exception in a ModelEngineError.
Make a POST request to the /v1/chat/completions endpoint.
Retries on transient failures (429, 5xx, connection errors) are handled automatically by the RetryClient with exponential backoff.
Parameters:
List of message dicts in OpenAI format.
Additional parameters for the request body (temperature, max_tokens, etc.)
Returns: dict
The parsed JSON response dict from the API.
Raises:
ModelEngineError: If the request fails after all retries.
Generate a chat completion and return a structured LLMResponse.
Calls the /v1/chat/completions endpoint and parses the OpenAI-format
response into an LLMResponse carrying content, reasoning (when the
provider exposes reasoning_content), usage, finish reason, and
request id.
Raises:
ModelEngineError: If the request fails or the response format is unexpected.
Make a streaming POST request to the /v1/chat/completions endpoint.
Sends stream=True and yields one LLMResponseChunk per SSE
event that carries a content delta, reasoning delta, OR a
usage payload. Role-only, finish-only, and empty-choices
events without usage are skipped. Retries are handled by the
RetryClient (same as call()).
Note: when the upstream payload includes
stream_options.include_usage=true (default for the
OpenAI-compatible client), the provider sends a final
usage-only chunk with empty choices after the last content
chunk. That terminal chunk is yielded as
LLMResponseChunk(usage=...) with both delta_content
and delta_reasoning unset — callers that only care about
content should gate on chunk.delta_content rather than
assuming every yielded chunk carries one.
Parameters:
List of message dicts in OpenAI format.
Additional parameters for the request body (temperature, max_tokens, etc.)
Raises:
ModelEngineError: If the request fails after all retries.
Stream a chat completion and yield LLMResponseChunk objects.
Thin pass-through over stream_call — see that method’s
docstring for the contract, including the terminal usage-only
chunk emitted when stream_options.include_usage is on.
Raises:
ModelEngineError: If the request fails after all retries.
Bases: Exception
Raised when a model engine call fails.
Bases: NamedTuple
Pre-built parameters for an HTTP request to the completions endpoint.
Convert a /v1/chat/completions response dict into an LLMResponse.
Reasoning is read from message.reasoning_content when the provider
exposes it (NIM, DeepSeek-style). Tool calls are out of scope for this
PR series and are not currently surfaced.
Build an LLMResponseChunk from an SSE chunk dict.
Returns None for chunks without one of: content delta, reasoning delta, or a usage payload. Role-only first events and finish-only events with empty deltas map to None.
Last chunk from OpenAI-compatible providers has a usage field when
stream_options.include_usage=true. This is passed through to capture
the token usage metadata.
Build UsageInfo from an OpenAI-format usage dict.
Picks up reasoning_tokens from completion_tokens_details (OpenAI reasoning models) and cached_tokens from prompt_tokens_details when present.