Custom LLM Models for the NVIDIA NeMo Guardrails library

View as Markdown

The NVIDIA NeMo Guardrails library defines a small LLMModel protocol that every backend implements. The built-in DefaultFramework ships an OpenAIChatModel for any OpenAI-compatible HTTP endpoint, and the optional LangChainFramework ships a LangChainLLMAdapter that wraps any LangChain BaseChatModel or BaseLLM. When neither matches your backend, you can implement LLMModel directly.

This guide covers when to do that, the contract you must satisfy, a minimal worked example, and pointers to the reference implementations and to the testing helpers.

When to Use a Custom LLMModel

There are three options for connecting a backend to the NVIDIA NeMo Guardrails library. Pick the best fit.

Backend shapeRecommended pathWhere it lives
OpenAI-compatible HTTP endpoint, such as vLLM, TGI, OpenRouter, self-hosted, NIM, and other endpointsUse engine: openai (or the matching built-in engine) and set parameters.base_urlCustom LLM Providers and the configuration reference
You already have a LangChain BaseChatModel or BaseLLMUse engine: langchain and register the LangChain class with register_chat_providerCustom LLM Providers
Custom HTTP API that is not OpenAI-shaped, and you do not want a LangChain dependencyImplement LLMModel and register it with the active frameworkThis guide

Concretely, choose a custom LLMModel when:

  • Your provider speaks a non-OpenAI wire format and you do not want to depend on LangChain.
  • You want full control over retries, headers, streaming parsing, and tool-call accumulation.
  • You want a lean install footprint (no langchain-* packages) and you control the HTTP layer yourself.

The LLMModel Contract

The protocol is nemoguardrails.types.LLMModel. It is @runtime_checkable, so the framework registry can verify with isinstance(model, LLMModel).

A custom model class must implement two async methods and three properties.

1from typing import AsyncIterator, List, Optional, Union
2
3from nemoguardrails import (
4 ChatMessage,
5 LLMResponse,
6 LLMResponseChunk,
7)
8
9class LLMModel:
10 async def generate_async(
11 self,
12 prompt: Union[str, List[ChatMessage]],
13 *,
14 stop: Optional[List[str]] = None,
15 **kwargs,
16 ) -> LLMResponse: ...
17
18 async def stream_async(
19 self,
20 prompt: Union[str, List[ChatMessage]],
21 *,
22 stop: Optional[List[str]] = None,
23 **kwargs,
24 ) -> AsyncIterator[LLMResponseChunk]:
25 yield ... # async generator: implementations use `yield`, not `return`
26
27 @property
28 def model_name(self) -> str: ...
29
30 @property
31 def provider_name(self) -> Optional[str]: ...
32
33 @property
34 def provider_url(self) -> Optional[str]: ...

prompt

Adapters must accept either a plain string or a list of nemoguardrails.types.ChatMessage objects. ChatMessage is a stdlib dataclass with role, content, optional tool_calls, optional tool_call_id, optional name, and a provider_metadata dict for non-standard fields. Convert messages to whatever shape your SDK expects.

stop and **kwargs

stop is the canonical name for stop sequences; keep it as a keyword-only argument. **kwargs carries everything the caller passed under parameters in config.yml plus any per-call overrides, such as temperature, max_tokens, and top_p. Forward these to the underlying SDK.

generate_async returns LLMResponse

LLMResponse is a dataclass in nemoguardrails/types.py:

1@dataclass
2class LLMResponse:
3 content: str
4 reasoning: Optional[str] = None
5 tool_calls: Optional[List[ToolCall]] = None
6 model: Optional[str] = None
7 finish_reason: Optional[FinishReason] = None
8 stop_sequence: Optional[str] = None
9 request_id: Optional[str] = None
10 usage: Optional[UsageInfo] = None
11 provider_metadata: Optional[Dict[str, Any]] = None

content is required and must be a string (use the empty string when the model only produced tool calls). finish_reason is one of "stop", "length", "tool_calls", "content_filter", "error", or "other". Populate tool_calls only when the response is a function-calling/tool-calling response.

stream_async is an async generator

Implementations must be async def generator functions that yield LLMResponseChunk objects. The protocol’s return type is AsyncIterator[LLMResponseChunk]. Each chunk has the shape:

1@dataclass
2class LLMResponseChunk:
3 delta_content: Optional[str] = None
4 delta_reasoning: Optional[str] = None
5 delta_tool_calls: Optional[List[ToolCall]] = None
6 model: Optional[str] = None
7 finish_reason: Optional[FinishReason] = None
8 request_id: Optional[str] = None
9 usage: Optional[UsageInfo] = None
10 provider_metadata: Optional[Dict[str, Any]] = None

Follow these conventions so the rest of the pipeline works:

  • Yield text deltas in delta_content as soon as they arrive.
  • Yield delta_reasoning for chain-of-thought tokens emitted before the visible answer (OpenAI reasoning models, NIM reasoning_content).
  • Tool-call streaming is incremental on the wire: provider chunks usually carry argument fragments. Accumulate them and emit a single completed delta_tool_calls list on the chunk whose finish_reason == "tool_calls". The reference OpenAIChatModel._finalize_tool_calls shows the pattern.
  • Set finish_reason only on the final chunk that carries it. Earlier chunks should leave it None.
  • Emit a final usage-only chunk (no delta_content, only usage and request_id) when the provider sends an end-of-stream usage record. The pipeline tolerates either inline or trailing usage.

Tool calling

ToolCall and ToolCallFunction are dataclasses:

1@dataclass
2class ToolCallFunction:
3 name: str
4 arguments: Dict[str, Any]
5
6@dataclass
7class ToolCall:
8 id: str
9 type: str = "function"
10 function: ToolCallFunction = field(default_factory=lambda: ToolCallFunction(name="", arguments={}))

function.arguments is a Dict[str, Any], not a JSON string. If your provider returns arguments as a JSON string, json.loads() it before constructing the ToolCall. If parsing fails for a streamed response, fall back to an empty dict; the tool layer will surface the real error when the function is invoked.

Properties

  • model_name returns the concrete model identifier (for example gpt-4o-mini, meta/llama-3.1-70b-instruct). Used in logs and error contexts.
  • provider_name returns the engine name as it appears in config.yml (for example openai, nim, my_engine). Return None only if you genuinely cannot determine it.
  • provider_url returns the base URL for HTTP backends, or None for backends that do not have one (for example a SageMaker endpoint addressed by ARN).

Error handling

The pipeline expects errors to be normalized. Raise the exception classes defined in nemoguardrails.exceptions:

  • LLMConnectionError for network or DNS failures.
  • LLMTimeoutError for read or connect timeouts.
  • LLMAuthenticationError for 401 or 403.
  • LLMRateLimitError for 429.
  • LLMResponseValidationError for malformed provider responses.
  • LLMClientError is the common base if you need a generic fallback.

Populate model_name, provider_name, and base_url on the exception when you raise it so downstream logs are usable. The reference OpenAIChatModel._enrich shows the pattern.

Minimal Working Example

Below is a 40-line EchoLLMModel that returns canned responses without making any network call. It is useful as a starting skeleton and as a sanity check for new framework wiring.

Create a config directory my_config/ next to your smoke-test script with two files:

my_config/
├── config.py # EchoLLMModel + register_provider call, run at import time
└── config.yml # references the registered engine name

my_config/config.py:

1import asyncio
2from typing import Any, AsyncIterator, List, Optional, Union
3
4from nemoguardrails import (
5 ChatMessage,
6 LLMResponse,
7 LLMResponseChunk,
8 UsageInfo,
9 register_provider,
10)
11
12class EchoLLMModel:
13 """Returns a canned response. Useful as a skeleton or in offline tests."""
14
15 def __init__(self, model: str, response: str = "echo", **kwargs: Any):
16 self._model = model
17 self._response = response
18 self._default_kwargs = kwargs
19
20 @property
21 def model_name(self) -> str:
22 return self._model
23
24 @property
25 def provider_name(self) -> Optional[str]:
26 return "echo"
27
28 @property
29 def provider_url(self) -> Optional[str]:
30 return None
31
32 async def generate_async(
33 self,
34 prompt: Union[str, List[ChatMessage]],
35 *,
36 stop: Optional[List[str]] = None,
37 **kwargs: Any,
38 ) -> LLMResponse:
39 return LLMResponse(
40 content=self._response,
41 model=self._model,
42 finish_reason="stop",
43 usage=UsageInfo(input_tokens=0, output_tokens=len(self._response)),
44 )
45
46 async def stream_async(
47 self,
48 prompt: Union[str, List[ChatMessage]],
49 *,
50 stop: Optional[List[str]] = None,
51 **kwargs: Any,
52 ) -> AsyncIterator[LLMResponseChunk]:
53 for token in self._response.split():
54 await asyncio.sleep(0)
55 yield LLMResponseChunk(delta_content=token + " ", model=self._model)
56 yield LLMResponseChunk(model=self._model, finish_reason="stop")
57
58register_provider("echo", EchoLLMModel)

The register_provider call attaches EchoLLMModel as the echo engine on whichever framework is currently active. By default, that is DefaultFramework. For the framework layer, refer to Custom LLM Framework.

my_config/config.yml:

1models:
2 - type: main
3 engine: echo
4 model: echo-v1
5 parameters:
6 response: "Hello from echo"

Trying it out

Run a smoke test from the parent directory of my_config/. LLMRails imports config.py automatically, which triggers the register_provider call at the bottom of that file:

1# smoke.py (next to my_config/)
2from nemoguardrails import LLMRails, RailsConfig
3
4config = RailsConfig.from_path("./my_config")
5rails = LLMRails(config)
6
7result = rails.generate(messages=[{"role": "user", "content": "hi"}])
8print(result["content"]) # -> "Hello from echo"

If the smoke test prints Hello from echo, your provider is registered correctly. From there, replace EchoLLMModel.generate_async and stream_async with real backend calls.

What register_provider does

register_provider(name, cls) from nemoguardrails.llm.providers resolves the active framework with get_default_framework() and calls framework.register_provider(name, cls) on it. For DefaultFramework, that adds name to its in-memory dict. Subsequent create_model("echo", ...) calls use your class as the factory. The active framework is selected once per process by NEMOGUARDRAILS_LLM_FRAMEWORK or set_default_framework() from config.py. You do not register on multiple frameworks.

Calling-convention contract for your __init__

framework.create_model(model_name, provider_name, model_kwargs) calls your class as EchoLLMModel(model=model_name, **model_kwargs). Make model a required keyword and accept additional **kwargs so that future configuration keys do not break instantiation.

Reference Implementations

Review these production-grade LLMModel implementations:

Both files import their types directly from nemoguardrails.types. Custom models should do the same.

Testing Your Model

The NVIDIA NeMo Guardrails library ships a pytest-friendly FakeLLMModel under nemoguardrails.testing that is shaped exactly like the protocol and accepts a list of canned strings or LLMResponse objects:

1from nemoguardrails.testing import FakeLLMModel

The two recommended approaches:

  1. Write unit tests for your LLMModel class in isolation: instantiate it, call await model.generate_async(prompt), and assert on the returned LLMResponse. No framework needed.
  2. Write end-to-end tests with a real LLMRails instance by registering a FakeLLMModel (or FakeLLMModel-style class) as a custom provider in the test’s config.py, then driving the full pipeline. For the full set of helpers (FakeLLMModel, TestChat, fixtures), refer to Testing Your Guardrails Configuration.

The contract is small enough that property-based tests are straightforward: any string prompt and any list of ChatMessage objects must produce a non-None LLMResponse.content, and stream_async must always yield a final chunk with a non-None finish_reason.

Best Practices

  1. Implement both methods even if your backend has no native streaming. A simple stream_async that yields a single chunk built from generate_async keeps the streaming consumer paths working.
  2. Pre-flight validate provider responses. The reference OpenAIChatModel._validate_response rejects non-dict bodies and missing choices entries before parsing. This keeps user-facing errors actionable.
  3. Forward **kwargs to the SDK. Anything the user wrote under parameters in config.yml lands here. Letting unknown keys pass through means new SDK options work without a library release.
  4. Pool shared backend clients on the framework. create_model is called once per models: entry at LLMRails startup. After that, your model handles every request. If multiple models: entries point at the same backend, the framework, not the model, should hold the underlying client so they share one connection pool. DefaultFramework._get_or_create_client keys clients by (base_url, api_key, ...) for exactly this reason. Single-model configs can build the client directly in __init__.
  5. Do not raise vanilla Exception. Use the nemoguardrails.exceptions hierarchy so retries and structured logging behave correctly.