nemoguardrails.guardrails.model_engine

View as Markdown

Model engine for IORails.

Wraps a single Model config and makes raw HTTP calls to its OpenAI-compatible /v1/chat/completions endpoint via aiohttp. Retries are handled by aiohttp-retry (ExponentialRetry).

Module Contents

Classes

NameDescription
ModelEngineWraps a single Model config and makes HTTP calls to its endpoint.
ModelEngineErrorRaised when a model engine call fails.
_RequestParamsPre-built parameters for an HTTP request to the completions endpoint.

Functions

NameDescription
_parse_chat_completionConvert a /v1/chat/completions response dict into an LLMResponse.
_parse_chat_completion_chunkBuild an LLMResponseChunk from an SSE chunk dict.
_parse_usageBuild UsageInfo from an OpenAI-format usage dict.

Data

_CHAT_COMPLETIONS_ENDPOINT

_ENGINE_BASE_URLS

log

API

class nemoguardrails.guardrails.model_engine.ModelEngine(
model_config: nemoguardrails.rails.llm.config.Model
)

Bases: BaseEngine

Wraps a single Model config and makes HTTP calls to its endpoint.

Each ModelEngine owns its own RetryClient with per-model timeout, retry, and connection pool settings.

api_key
Optional[str] = self._resolve_api_key(model_config.engine)
base_url
str = self._resolve_base_url()
model_name
str = model_config.model or ''
nemoguardrails.guardrails.model_engine.ModelEngine._ensure_running() -> None

Raise if the engine has not been started.

nemoguardrails.guardrails.model_engine.ModelEngine._get_environment_variable(
variable_name: str
) -> str | None

Return the value stored in environment variable variable_name.

nemoguardrails.guardrails.model_engine.ModelEngine._prepare_request(
messages: nemoguardrails.guardrails.guardrails_types.LLMMessages,
kwargs: typing.Any = {}
) -> nemoguardrails.guardrails.model_engine._RequestParams

Build the client, URL, headers, and body common to every request.

nemoguardrails.guardrails.model_engine.ModelEngine._raise_for_status(
response: aiohttp.ClientResponse,
req_id: str,
t0: float
) -> None
async

Raise ModelEngineError if the HTTP status indicates an error.

nemoguardrails.guardrails.model_engine.ModelEngine._resolve_api_key(
engine: str | None
) -> typing.Optional[str]

Resolve the API key from model config or environment.

nemoguardrails.guardrails.model_engine.ModelEngine._resolve_base_url() -> str

Resolve the base URL from model parameters or engine type.

Strips an optional trailing “/v1” so users can follow the OpenAI / LLMRails convention of including “/v1” in base_url without producing a doubled “/v1/v1/chat/completions” path when _CHAT_COMPLETIONS_ENDPOINT is appended.

nemoguardrails.guardrails.model_engine.ModelEngine._wrap_exception(
exc: Exception,
req_id: str,
t0: float,
label: str = 'Request'
) -> nemoguardrails.guardrails.model_engine.ModelEngineError

Wrap an unexpected exception in a ModelEngineError.

nemoguardrails.guardrails.model_engine.ModelEngine.call(
messages: nemoguardrails.guardrails.guardrails_types.LLMMessages,
kwargs: typing.Any = {}
) -> dict
async

Make a POST request to the /v1/chat/completions endpoint.

Retries on transient failures (429, 5xx, connection errors) are handled automatically by the RetryClient with exponential backoff.

Parameters:

messages
LLMMessages

List of message dicts in OpenAI format.

**kwargs
AnyDefaults to {}

Additional parameters for the request body (temperature, max_tokens, etc.)

Returns: dict

The parsed JSON response dict from the API.

Raises:

  • ModelEngineError: If the request fails after all retries.
nemoguardrails.guardrails.model_engine.ModelEngine.chat_completion(
messages: nemoguardrails.guardrails.guardrails_types.LLMMessages,
kwargs: typing.Any = {}
) -> nemoguardrails.types.LLMResponse
async

Generate a chat completion and return a structured LLMResponse.

Calls the /v1/chat/completions endpoint and parses the OpenAI-format response into an LLMResponse carrying content, reasoning (when the provider exposes reasoning_content), usage, finish reason, and request id.

Raises:

  • ModelEngineError: If the request fails or the response format is unexpected.
nemoguardrails.guardrails.model_engine.ModelEngine.stream_call(
messages: nemoguardrails.guardrails.guardrails_types.LLMMessages,
kwargs: typing.Any = {}
) -> collections.abc.AsyncGenerator[nemoguardrails.types.LLMResponseChunk, None]
async

Make a streaming POST request to the /v1/chat/completions endpoint.

Sends stream=True and yields one LLMResponseChunk per SSE event that carries a content delta, reasoning delta, OR a usage payload. Role-only, finish-only, and empty-choices events without usage are skipped. Retries are handled by the RetryClient (same as call()).

Note: when the upstream payload includes stream_options.include_usage=true (default for the OpenAI-compatible client), the provider sends a final usage-only chunk with empty choices after the last content chunk. That terminal chunk is yielded as LLMResponseChunk(usage=...) with both delta_content and delta_reasoning unset — callers that only care about content should gate on chunk.delta_content rather than assuming every yielded chunk carries one.

Parameters:

messages
LLMMessages

List of message dicts in OpenAI format.

**kwargs
AnyDefaults to {}

Additional parameters for the request body (temperature, max_tokens, etc.)

Raises:

  • ModelEngineError: If the request fails after all retries.
nemoguardrails.guardrails.model_engine.ModelEngine.stream_chat_completion(
messages: nemoguardrails.guardrails.guardrails_types.LLMMessages,
kwargs: typing.Any = {}
) -> collections.abc.AsyncGenerator[nemoguardrails.types.LLMResponseChunk, None]
async

Stream a chat completion and yield LLMResponseChunk objects.

Thin pass-through over stream_call — see that method’s docstring for the contract, including the terminal usage-only chunk emitted when stream_options.include_usage is on.

Raises:

  • ModelEngineError: If the request fails after all retries.
class nemoguardrails.guardrails.model_engine.ModelEngineError(
message: str,
model_name: str,
status: int | None = None
)
Exception

Bases: Exception

Raised when a model engine call fails.

class nemoguardrails.guardrails.model_engine._RequestParams()

Bases: NamedTuple

Pre-built parameters for an HTTP request to the completions endpoint.

body
dict[str, Any]
client
RetryClient
headers
dict[str, str]
url
str
nemoguardrails.guardrails.model_engine._parse_chat_completion(
response: dict
) -> nemoguardrails.types.LLMResponse

Convert a /v1/chat/completions response dict into an LLMResponse.

Reasoning is read from message.reasoning_content when the provider exposes it (NIM, DeepSeek-style). Tool calls are out of scope for this PR series and are not currently surfaced.

nemoguardrails.guardrails.model_engine._parse_chat_completion_chunk(
chunk: dict
) -> typing.Optional[nemoguardrails.types.LLMResponseChunk]

Build an LLMResponseChunk from an SSE chunk dict.

Returns None for chunks without one of: content delta, reasoning delta, or a usage payload. Role-only first events and finish-only events with empty deltas map to None.

Last chunk from OpenAI-compatible providers has a usage field when stream_options.include_usage=true. This is passed through to capture the token usage metadata.

nemoguardrails.guardrails.model_engine._parse_usage(
usage_dict: dict
) -> nemoguardrails.types.UsageInfo

Build UsageInfo from an OpenAI-format usage dict.

Picks up reasoning_tokens from completion_tokens_details (OpenAI reasoning models) and cached_tokens from prompt_tokens_details when present.

nemoguardrails.guardrails.model_engine._CHAT_COMPLETIONS_ENDPOINT = '/v1/chat/completions'
nemoguardrails.guardrails.model_engine._ENGINE_BASE_URLS = {'nim': 'https://integrate.api.nvidia.com', 'openai': 'https://api.openai.com'}
nemoguardrails.guardrails.model_engine.log = logging.getLogger(__name__)