LLMs#
Large language models (LLMs) are deep learning algorithms that can recognize, summarize, translate, predict, and generate content using very large datasets.
Supported LLM Providers#
NVIDIA NeMo Agent Toolkit supports the following LLM providers:
Provider |
Type |
Description |
|---|---|---|
|
NVIDIA Inference Microservice (NIM) |
|
|
OpenAI API |
|
|
AWS Bedrock API |
|
|
Azure OpenAI API |
|
|
LiteLLM API |
|
|
Hugging Face API |
|
|
Hugging Face Inference API, Endpoints, and TGI |
LLM Configuration#
The LLM configuration is defined in the llms section of the workflow configuration file. The _type value refers to the LLM provider, and the model_name value always refers to the name of the model to use.
llms:
nim_llm:
_type: nim
model_name: meta/llama-3.1-70b-instruct
openai_llm:
_type: openai
model_name: gpt-4o-mini
aws_bedrock_llm:
_type: aws_bedrock
model_name: meta/llama-3.1-70b-instruct
region_name: us-east-1
azure_openai_llm:
_type: azure_openai
azure_deployment: gpt-4o-mini
litellm_llm:
_type: litellm
model_name: gpt-4o
huggingface_llm:
_type: huggingface
model_name: Qwen/Qwen3Guard-Gen-0.6B
NVIDIA NIM#
You can use the following environment variables to configure the NVIDIA NIM LLM provider:
NVIDIA_API_KEY- The API key to access NVIDIA NIM resources
The NIM LLM provider is defined by the NIMModelConfig class.
model_name- The name of the model to usetemperature- The temperature to use for the modeltop_p- The top-p value to use for the modelmax_tokens- The maximum number of tokens to generateapi_key- The API key to use for the modelbase_url- The base URL to use for the modelmax_retries- The maximum number of retries for the request
Note
temperature and top_p are model-gated fields and may not be supported by all models. If unsupported and explicitly set, validation will fail. See Gated Fields for details.
OpenAI#
You can use the following environment variables to configure the OpenAI LLM provider:
OPENAI_API_KEY- The API key to access OpenAI resources
The OpenAI LLM provider is defined by the OpenAIModelConfig class.
model_name- The name of the model to usetemperature- The temperature to use for the modeltop_p- The top-p value to use for the modelmax_tokens- The maximum number of tokens to generateseed- The seed to use for the modelapi_key- The API key to use for the modelbase_url- The base URL to use for the modelmax_retries- The maximum number of retries for the requestrequest_timeout- HTTP request timeout in seconds
Note
temperature and top_p are model-gated fields and may not be supported by all models. If unsupported and explicitly set, validation will fail. See Gated Fields for details.
AWS Bedrock#
The AWS Bedrock LLM provider is defined by the AWSBedrockModelConfig class.
model_name- The name of the model to usetemperature- The temperature to use for the modeltop_p- The top-p value to use for the model. This field is ignored for LlamaIndex.max_tokens- The maximum number of tokens to generatecontext_size- The maximum number of tokens available for input. This is only required for LlamaIndex. This field is ignored for LangChain/LangGraph.region_name- The region to use for the modelbase_url- The base URL to use for the modelcredentials_profile_name- The credentials profile name to use for the modelmax_retries- The maximum number of retries for the request
Azure OpenAI#
You can use the following environment variables to configure the Azure OpenAI LLM provider:
AZURE_OPENAI_API_KEY- The API key to access Azure OpenAI resourcesAZURE_OPENAI_ENDPOINT- The Azure OpenAI endpoint to access Azure OpenAI resources
The Azure OpenAI LLM provider is defined by the AzureOpenAIModelConfig class.
api_key- The API key to use for the modelapi_version- The API version to use for the modelazure_endpoint- The Azure OpenAI endpoint to use for the modelazure_deployment- The name of the Azure OpenAI deployment to usetemperature- The temperature to use for the modeltop_p- The top-p value to use for the modelseed- The seed to use for the modelmax_retries- The maximum number of retries for the requestrequest_timeout- HTTP request timeout in seconds
Note
temperature is model-gated and may not be supported by all models. See Gated Fields for details.
LiteLLM#
LiteLLM is a general purpose LLM provider that can be used with any model provider that is supported by LiteLLM. See the LiteLLM provider documentation for more information on how to use LiteLLM.
The LiteLLM LLM provider is defined by the LiteLlmModelConfig class.
model_name- The name of the model to use (dependent on the model provider)api_key- The API key to use for the model (dependent on the model provider)base_url- The base URL to use for the modelseed- The seed to use for the modeltemperature- The temperature to use for the modeltop_p- The top-p value to use for the modelmax_retries- The maximum number of retries for the request
Hugging Face#
Hugging Face is a general-purpose LLM provider that can be used with any model supported by the Hugging Face API. See the Hugging Face documentation for more information.
The Hugging Face LLM provider is defined by the HuggingFaceConfig class.
model_name- The Hugging Face model name or path (for example,Qwen/Qwen3Guard-Gen-0.6B)device- Device for model execution:cpu,cuda,cuda:0, orauto(default:auto)dtype- Torch data type:float16,bfloat16,float32, orauto(default:auto)max_new_tokens- Maximum number of new tokens to generate (default:128)temperature- Sampling temperature (default:0.0)trust_remote_code- Whether to trust remote code when loading the model (default:false)
Note
Hugging Face is a built-in NeMo Agent Toolkit LLM provider, but requires extra dependencies to run. They can be installed with:
pip install "transformers[torch,accelerate]~=4.57"
Hugging Face Inference#
Hugging Face Inference is an LLM provider for remote model inference via the Hugging Face Serverless Inference API, Dedicated Inference Endpoints, or self-hosted TGI servers.
You can use the following environment variables to configure the Hugging Face Inference LLM provider:
HF_TOKEN- The API token to access Hugging Face Inference resources
The Hugging Face Inference LLM provider is defined by the HuggingFaceInferenceLLMConfig class.
model_name- The Hugging Face model identifier (for example,meta-llama/Llama-3.2-8B-Instruct)api_key- The Hugging Face API token for authenticationendpoint_url- Custom endpoint URL for Inference Endpoints or self-hosted TGI servers. If not provided, uses Serverless APImax_new_tokens- Maximum number of new tokens to generate (default:512)temperature- Sampling temperature (default:0.7)top_p- Top-p (nucleus) sampling parametertop_k- Top-k sampling parameterrepetition_penalty- Penalty for repeating tokensseed- Random seed for reproducible generationtimeout- Request timeout in seconds (default:120.0)
llms:
# Serverless Inference API
serverless_llm:
_type: huggingface_inference
model_name: meta-llama/Llama-3.2-8B-Instruct
api_key: ${HF_TOKEN}
max_new_tokens: 512
temperature: 0.7
# Dedicated Inference Endpoint
endpoint_llm:
_type: huggingface_inference
model_name: your-model-name
api_key: ${HF_TOKEN}
endpoint_url: https://your-endpoint.endpoints.huggingface.cloud
# Self-hosted TGI server
tgi_llm:
_type: huggingface_inference
model_name: local-model
endpoint_url: http://localhost:8080
NVIDIA Dynamo (experimental)#
Dynamo is an inference engine agnostic LLM provider designed to optimize KV cache reuse of LLMs served on NVIDIA hardware. See the ai-dynamo repository for instructions on how to use Dynamo.
The Dynamo LLM provider is defined by the DynamoModelConfig class. The provider mirrors the implementation of the OpenAI provider, with additional prefix hints for Dynamo inference optimizations.
model_name- The name of the model to usetemperature- The temperature to use for the modeltop_p- The top-p value to use for the modelmax_tokens- The maximum number of tokens to generateseed- The seed to use for the modelapi_key- The API key to use for the modelbase_url- The base URL to use for the modelmax_retries- The maximum number of retries for the requestprefix_template- a template for conversation prefix IDs. Setting to null will disable use ofprefix_template,prefix_total_requests,prefix_osl, andprefix_iatprefix_total_requests- Expected number of requests for this conversationprefix_osl- Output sequence length for the Dynamo routerprefix_iat- Inter-arrival time hint for the Dynamo routerrequest_timeout- HTTP request timeout in seconds for Dynamo LLM requests
Testing Provider#
nat_test_llm#
nat_test_llm is a development and testing provider intended for examples and CI. It is not intended for production use.
Installation:
uv pip install nvidia-nat-testPurpose: Deterministic cycling responses for quick validation
Not for production
Minimal YAML example with chat_completion:
llms:
main:
_type: nat_test_llm
response_seq: [alpha, beta, gamma]
delay_ms: 0
workflow:
_type: chat_completion
llm_name: main
system_prompt: "Say only the answer."
Learn how to add your own LLM provider: Adding an LLM Provider
See a short tutorial using YAML and
nat_test_llm: Test with nat_test_llm