Model Configuration for NeMo Guardrails | NVIDIA NeMo Guardrails Library Developer Guide

In this page, learn how to configure the models section in your Guardrails config.yml file. For a complete reference of all configuration options, refer to the Configuration YAML Schema Reference.

NVIDIA NIM Configuration

The NVIDIA NeMo Guardrails library integrates with NVIDIA NIM microservices:

1 models:
2   - type: main
3     engine: nim
4     model: meta/llama-3.1-8b-instruct

This provides access to:

Locally deployed NIMs. You can run models on your own infrastructure with optimized inference.
NVIDIA API Catalog. You can access hosted models on build.nvidia.com.
Specialized NIMs. Includes NemoGuard Content Safety, Topic Control, and Jailbreak Detect.

Local NIM Deployment

For locally deployed NIMs, specify the base URL:

1 models:
2   - type: main
3     engine: nim
4     model: meta/llama-3.1-8b-instruct
5     parameters:
6       base_url: http://localhost:8000/v1

Task-Specific Models

Configure different models for specific tasks:

1 models:
2   - type: main
3     engine: nim
4     model: meta/llama-3.1-8b-instruct
5 
6   - type: self_check_input
7     engine: nim
8     model: meta/llama3-8b-instruct
9 
10   - type: self_check_output
11     engine: nim
12     model: meta/llama-3.1-70b-instruct
13 
14   - type: generate_user_intent
15     engine: nim
16     model: meta/llama-3.1-8b-instruct

Configuration Examples

OpenAI Chat Completions

The following example shows how to configure the OpenAI model as the main application LLM using the chat completions endpoint:

1 models:
2   - type: main
3     engine: openai
4     model: gpt-4o

OpenAI Responses API

Use the OpenAI Responses API (/v1/responses) when an OpenAI model or option requires it. For example, some models, such as gpt-5-pro, require the Responses API, and recent OpenAI models support options such as reasoning_effort through the Responses API.

The NVIDIA NeMo Guardrails library supports this option through the LangChain framework. Set NEMOGUARDRAILS_LLM_FRAMEWORK=langchain, install langchain-openai, and set use_responses_api: true in the model’s parameters block:

1 models:
2   - type: main
3     engine: openai
4     model: gpt-4o
5     parameters:
6       use_responses_api: true

The built-in default framework does not support the Responses API. It calls /v1/chat/completions and forwards use_responses_api as an unknown request field, so the flag has no effect. For background, refer to Migrating to 0.22.

Function tool calls work over the Responses API in streaming and non-streaming modes. The adapter surfaces function calls on LLMResponse.tool_calls and LLMResponseChunk.delta_tool_calls, and reports finish_reason as tool_calls.

Built-in Responses API tools, such as web_search, file_search, code_interpreter, and computer_use, are not surfaced as tool calls. The API returns those tools as response output items, not function calls, so only function-tool passthrough is supported.

Self-hosted servers that expose an OpenAI-compatible /v1/responses endpoint, such as vLLM, work with the same flag plus parameters.base_url. Confirm that your server implements /v1/responses before you enable the flag. Models that use the Harmony response format, such as gpt-oss or gpt-oss-safeguard, should use the Responses API.

1 models:
2   - type: main
3     engine: openai
4     model: openai/gpt-oss-120b
5     api_key_env_var: THIS_CAN_BE_ANY_KEY
6     parameters:
7       base_url: http://localhost:8000/v1
8       use_responses_api: true

A complete example is available in examples/configs/llm/openai-responses-api.

Azure OpenAI

The following example shows how to configure the Azure OpenAI model as the main application LLM using the Azure OpenAI API:

1 models:
2   - type: main
3     engine: azure
4     model: gpt-4
5     parameters:
6       azure_endpoint: https://my-resource.openai.azure.com/
7       azure_deployment: my-gpt4-deployment
8       api_version: "2024-02-15-preview"

You can supply the resource endpoint as azure_endpoint (preferred, matches the OpenAI Python SDK) or base_url (v0.21-compatibility alias). Both fields accept only the resource URL. The framework composes the deployment path. Setting both raises an error.

Set AZURE_OPENAI_API_KEY in the environment, set api_key_env_var on the model entry, or pass parameters.api_key directly. The framework constructs the deployment URL, sets api-version as a query parameter, and authenticates with the api-key header.

Azure OpenAI is supported natively on the default framework in v0.22 with key-based authentication. For Azure AD or token-based authentication, configure engine: openai manually or use LangChain with NEMOGUARDRAILS_LLM_FRAMEWORK=langchain. Refer to Migrating to 0.22 for both alternatives.

Anthropic

The following example shows how to configure the Anthropic model as the main application LLM:

1 models:
2   - type: main
3     engine: anthropic
4     model: claude-3-5-sonnet-20241022

Anthropic’s API is not OpenAI-compatible, so this engine is opt-in. Set NEMOGUARDRAILS_LLM_FRAMEWORK=langchain and install langchain-anthropic. For background, refer to Migrating to 0.22.

vLLM (OpenAI-Compatible)

vLLM exposes an OpenAI-compatible API, so the recommended configuration uses engine: openai pointed at the vLLM endpoint. The built-in client handles it with no LangChain dependency.

1 models:
2   - type: main
3     engine: openai
4     model: meta-llama/Llama-3.1-8B-Instruct
5     parameters:
6       base_url: http://localhost:5000/v1
7       api_key: EMPTY

The following example shows how to configure Llama Guard as a guardrail model using the same pattern:

1 models:
2   - type: llama_guard
3     engine: openai
4     model: meta-llama/LlamaGuard-7b
5     parameters:
6       base_url: http://localhost:5000/v1
7       api_key: EMPTY

When self-hosted vLLM does not enforce authentication, set parameters.api_key to a non-empty placeholder such as EMPTY. If your deployment requires a real token, replace parameters.api_key with the literal token, or omit it and set api_key_env_var at the top level of the model entry, not inside parameters:

1 - type: main
2   engine: openai
3   model: meta-llama/Llama-3.1-8B-Instruct
4   api_key_env_var: MY_VLLM_API_KEY
5   parameters:
6     base_url: http://localhost:5000/v1

Set the referenced environment variable before calling RailsConfig.from_content or RailsConfig.from_path. Otherwise, config loading fails with Model API Key environment variable 'X' not set.. A Pydantic validator on the model schema performs the check eagerly.

The legacy engine: vllm_openai with parameters.openai_api_base form is only needed when running under NEMOGUARDRAILS_LLM_FRAMEWORK=langchain. For new configurations, prefer the form above.

Other OpenAI-Compatible Endpoints

The same engine: openai plus parameters.base_url pattern works for any provider whose wire protocol is OpenAI-compatible. Examples include OpenRouter, Together.ai, Fireworks.ai, Groq, DeepSeek’s hosted API at https://api.deepseek.com/v1, TGI deployments that expose /v1/chat/completions, and the llama.cpp server with --api. Provide parameters.base_url and either parameters.api_key or a top-level api_key_env_var.

Google Vertex AI

The following example shows how to configure the Google Vertex AI model as the main application LLM:

1 models:
2   - type: main
3     engine: vertexai
4     model: gemini-1.0-pro

Vertex AI’s API is not OpenAI-compatible, so this engine is opt-in. Set NEMOGUARDRAILS_LLM_FRAMEWORK=langchain and install langchain-google-vertexai. For background, refer to Migrating to 0.22.

Complete Example

The following example shows how to configure the main application LLM, embeddings model, and a dedicated NemoGuard model for input and output checking:

1 models:
2   # Main application LLM
3   - type: main
4     engine: nim
5     model: meta/llama-3.1-70b-instruct
6     parameters:
7       temperature: 0.7
8       max_tokens: 2000
9 
10   # Embeddings for knowledge base
11   - type: embeddings
12     engine: FastEmbed
13     model: all-MiniLM-L6-v2
14 
15   # Dedicated model for input checking
16   - type: self_check_input
17     engine: nim
18     model: nvidia/llama-3.1-nemoguard-8b-content-safety
19 
20   # Dedicated model for output checking
21   - type: self_check_output
22     engine: nim
23     model: nvidia/llama-3.1-nemoguard-8b-content-safety

Model Parameters

Pass additional parameters to the underlying LLM client. For engines served by the built-in client, such as any OpenAI-compatible endpoint, the runtime forwards parameters to the OpenAI-compatible HTTP request. Examples include temperature, max_tokens, base_url, api_key, default_query, and default_headers. For LangChain engines, parameters follow the conventions of the underlying LangChain class.

1 models:
2   - type: main
3     engine: openai
4     model: gpt-4
5     parameters:
6       temperature: 0.7
7       max_tokens: 1000
8       top_p: 0.9

Common parameters vary by provider. For built-in engines, see the OpenAI-compatible client options. For LangChain engines, refer to the corresponding LangChain provider documentation.