Change the LLM or Embedding Model for NVIDIA RAG Blueprint#
You can change the LLM or embedding models for the NVIDIA RAG Blueprint by using the following procedures.
Tip
To navigate this page more easily, click the outline button at the top of the page. 
For NVIDIA-Hosted Microservices#
Change the LLM Model#
To change the inference model to a model from the API catalog,
specify the model in the APP_LLM_MODELNAME environment variable when you start the RAG Server.
export APP_LLM_MODELNAME='nvidia/llama-3.3-nemotron-super-49b-v1.5'
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
To get a list of valid model names, use one of the following methods:
Browse the models at https://build.nvidia.com/. View the sample Python code and get the model name from the
modelargument to theclient.chat.completions.createmethod.
Tip
Follow steps in For Helm Deployments to change the inference model for Helm charts.
Model-Specific Configuration#
Nemotron-3-Nano-30B#
The nemotron-3-nano-30b model has different naming conventions depending on the deployment method:
Deployment Type |
Model Name |
|---|---|
NVIDIA-hosted (build.nvidia.com) |
|
Self-hosted / Local NIM |
|
Both names refer to the same underlying model. Use the appropriate name based on your deployment type.
Nemotron 3 Super#
Nemotron 3 Super is a larger model with different GPU and environment requirements: local NIM deployment requires at least 2 GPUs (FP8 TP2), and you may need a dedicated prompt config and reasoning settings. For full deployment steps (Docker and Helm), see the Nemotron 3 Super deployment guide.
Change the Embedding Model#
To change the embedding model to a model from the API catalog,
specify the model in the APP_EMBEDDINGS_MODELNAME environment variable when you start the RAG server.
The following example uses the NVIDIA Embed QA 4 model.
export APP_EMBEDDINGS_MODELNAME='NV-Embed-QA'
export APP_RANKING_MODELNAME='NV-Embed-QA'
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build
As an alternative you can also specify the model names at runtime using /generate API call. Refer to the Generate Answer Endpoint and Document Search Endpoint payload schema in this notebook.
To get a list of valid model names, use one of the following methods:
Browse the models at https://build.nvidia.com/explore/retrieval. View the sample Python code and get the model name from the
modelargument to theclient.embeddings.createmethod.Install the langchain-nvidia-ai-endpoints Python package from PyPi. Use the
get_available_models()method to on an instance of anNVIDIAEmbeddingsobject to list the models. Refer to the package web page for sample code to list the models.
Tip
Always use same embedding model or model having same tokinizers for both ingestion and retrieval to yield good accuracy.
Configure Embedding Dimensions#
The default embedding model (nvidia/llama-nemotron-embed-1b-v2) uses 2048 dimensions by default. When changing to a different embedding model, you may need to update the dimensions to match the model’s output.
Important: Some embedding models have fixed output dimensions and do not accept a dimensions parameter. For example, nvidia/nv-embedqa-e5-v5 always outputs 1024-dimensional embeddings. If you use such a model without configuring the dimensions, you may encounter an error like:
This model does not support 'dimensions', but a value of '2048' was provided.
Configure via Environment Variable#
export APP_EMBEDDINGS_DIMENSIONS=1024 # Match your model's output dimensions
export APP_EMBEDDINGS_MODELNAME='nvidia/nv-embedqa-e5-v5'
Configure via config.yaml (Library Mode)#
embeddings:
model_name: "nvidia/nv-embedqa-e5-v5"
dimensions: 1024 # Must match the model's output dimensions
server_url: "https://integrate.api.nvidia.com/v1"
Warning
Ingestion and retrieval must use the same embedding model and dimensions. If you change the embedding model or dimensions after ingesting documents, you must re-ingest your documents to the vector database for accurate retrieval results.
Note
When using models from different providers (e.g., NVIDIA for LLM, Azure OpenAI for embeddings), you can configure service-specific API keys. See Service-Specific API Keys for details.
For Self-Hosted On Premises Microservices#
You can specify the model for NVIDIA NIM containers to use in the nims.yaml file.
Edit the
deploy/nims.yamlfile and specify an image that includes the model to deploy.services: nim-llm: container_name: nim-llm-ms image: nvcr.io/nim/<image>:<tag> ... nemotron-embedding-ms: container_name: nemotron-embedding-ms image: nvcr.io/nim/<image>:<tag> nemotron-ranking-ms: container_name: nemotron-ranking-ms image: nvcr.io/nim/<image>:<tag>
To get a list of valid model names, use one of the following methods:
Run
ngc registry image list "nim/*".Browse the NGC catalog at https://catalog.ngc.nvidia.com/containers.
Update the corresponding model names using environment variables as required.
export APP_LLM_MODELNAME=<> export APP_RANKING_MODELNAME=<> export APP_EMBEDDINGS_MODELNAME=<>
Follow the steps specified here to relaunch the containers with the updated models. Make sure to specify the correct model names using appropriate environment variables as shown in the earlier step.
For Helm Deployments#
Use this procedure to change models when you are running self-hosted NVIDIA NIM microservices. The Helm values map directly to the Docker Compose/self-hosted settings.
List the available models and images by running the following code. You can browse the NGC catalog at https://catalog.ngc.nvidia.com/containers to learn about the available models.
ngc registry image list "nim/*"
Set the model names and service URLs used by
rag-serverdeployment in values.yaml.# rag-server runtime configuration envVars: # === LLM === APP_LLM_MODELNAME: "<llm-model-name>" # Use the in-cluster NIM LLM service; if empty, NVIDIA-hosted API is used APP_LLM_SERVERURL: "nim-llm:8000" # === Embeddings === APP_EMBEDDINGS_MODELNAME: "<embedding-model-name>" APP_EMBEDDINGS_SERVERURL: "nemotron-embedding-ms:8000/v1" # === Reranker === APP_RANKING_MODELNAME: "<reranker-model-name>" APP_RANKING_SERVERURL: "nemotron-ranking-ms:8000"
Configure the NIM microservices that host those models. Replace
<image>:<tag>with the image you selected (formatnvcr.io/nim/<image>:<tag>) in values.yaml.# LLM NIM nimOperator: nim-llm: enabled: true replicas: 1 service: name: "nim-llm" image: # nvcr.io/nim/<image>:<tag> repository: nvcr.io/nim/<image> tag: "<tag>" pullPolicy: IfNotPresent resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 model: engine: tensorrt_llm env: - name: NIM_HTTP_API_PORT value: "8000" - name: NIM_TRITON_LOG_VERBOSE value: "1" - name: NIM_SERVED_MODEL_NAME value: "<llm-model-name>" # Must match APP_LLM_MODELNAME # Embedding NIM nvidia-nim-llama-32-nv-embedqa-1b-v2: enabled: true replicas: 1 service: name: "nemotron-embedding-ms" image: # nvcr.io/nim/<image>:<tag> repository: nvcr.io/nim/<image> tag: "<tag>" pullPolicy: IfNotPresent resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 env: - name: NIM_HTTP_API_PORT value: "8000" - name: NIM_TRITON_LOG_VERBOSE value: "1" # Reranker NIM nvidia-nim-llama-32-nv-rerankqa-1b-v2: enabled: true replicas: 1 service: name: "nemotron-ranking-ms" image: # nvcr.io/nim/<image>:<tag> repository: nvcr.io/nim/<image> tag: "<tag>" pullPolicy: IfNotPresent resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 env: []
Nemotron Nano Models (Thinking budget LLMs) – vLLM profile
For these Thinking budget LLMs, only the vLLM profile is supported on H100 and RTX GPUs (for example, RTX 6000 Pro).
GPU
Model
Supported profile
H100, RTX 6000 Pro
nvidia/nvidia-nemotron-nano-9b-v2
vllm
H100, RTX 6000 Pro
nvidia/nemotron-3-nano
vllm
Note
If only the vLLM profile is available
When only a vLLM profile is available for a model, such as on H100 and RTX GPUs, you must use the vLLM engine. First run the list-model-profiles command to confirm which profiles are available and then apply the following configurations. For Nemotron Nano Models VLLM profile
When deploying
nvidia/nvidia-nemotron-nano-9b-v2ornvidia/nemotron-3-nano, check iftensorrt_llmprofile is available using below command for your required model.# Change model name as needed USERID=$(id -u) docker run --rm --gpus all \ nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2:latest \ list-model-profiles
If only
vllmprofile is available, you must use the vLLM engine and add these specific configurations:nimOperator: nim-llm: image: repository: nvcr.io/nim/nvidia/nvidia-nemotron-nano-9b-v2 tag: "latest" model: engine: vllm # Required: use vLLM instead of tensorrt_llm env: - name: NIM_SERVED_MODEL_NAME value: "nvidia/nvidia-nemotron-nano-9b-v2" # Must match APP_LLM_MODELNAME # ... other env vars ...
Ensure
APP_LLM_MODELNAMEin therag-serversection matchesNIM_SERVED_MODEL_NAME.After you modify the
values.yamlfile, apply the changes described in Change a Deployment.