Multimodal Query Support for NVIDIA RAG Blueprint#
The multimodal query feature in the NVIDIA RAG Blueprint enables you to query your knowledge base using both text and images. This is particularly useful for use cases where visual context enhances the query, such as:
Product identification: “What is the price of this item?” + product image
Document lookup: “Find documents related to this chart” + chart image
Visual Q&A: “What material is this made of?” + product image
This feature combines:
VLM Embeddings:
nvidia/llama-nemotron-embed-vl-1b-v2for creating multimodal embeddings that understand both text and imagesVision-Language Model:
nvidia/nemotron-3-nano-omni-30b-a3b-reasoningfor generating intelligent responses based on visual and textual context
Prerequisites#
Before enabling multimodal query support, ensure you have:
An NVIDIA H100 or A100 GPU for on-prem deployments
Self-Hosted (On-Prem) Deployment#
Use this section to deploy multimodal query support with locally hosted NVIDIA NIMs.
1. Start the Vector Database#
Start the Milvus vector database service:
docker compose -f deploy/compose/vectordb.yaml up -d
2. Deploy the Ingestion and VLM RAG NIMs#
Set your NGC API key (replace with your actual key):
export NGC_API_KEY="nvapi-..."
Then run the deployment commands:
# Create the model cache directory
mkdir -p ~/.cache/model-cache
export MODEL_DIRECTORY=~/.cache/model-cache
# (Optional) Select a specific GPU for the VLM Microservice
# Use `nvidia-smi` to check available GPUs and set the desired GPU ID
export VLM_MS_GPU_ID=1 # Default is GPU 5; change to use a different GPU
# Deploy ingestion NIMs plus the VLM RAG NIMs.
USERID=$(id -u) docker compose --profile ingest --profile vlm-rag -f deploy/compose/nims.yaml up -d
Warning
The first deployment may take 10-20 minutes as models download (~10GB+). Subsequent deployments will be faster as models are cached.
Monitor the deployment status:
watch -n 5 'docker ps --format "table {{.Names}}\t{{.Status}}"'
Wait until the services show as healthy:
3. Configure Environment Variables#
Set the model names and service URLs for the RAG pipeline:
# VLM (Vision-Language Model) configuration
export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export APP_VLM_SERVERURL="http://vlm-ms:8000/v1"
export APP_LLM_SERVERURL=""
# Optional: use the same VLM for document summaries when no LLM NIM is running.
# You can also point SUMMARY_LLM* to a separate LLM or NVIDIA-hosted endpoint.
export SUMMARY_LLM="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export SUMMARY_LLM_SERVERURL="http://vlm-ms:8000/v1"
# Multimodal embedding model configuration
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-nemotron-embed-vl-1b-v2"
export APP_EMBEDDINGS_SERVERURL="nemotron-vlm-embedding-ms:8000/v1"
export ENABLE_VLM_INFERENCE="true"
export VLM_TO_LLM_FALLBACK="false"
4. Configure Image Extraction for Ingestion#
Enable image extraction and storage during document ingestion:
# Configure image extraction
export APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY=""
export APP_NVINGEST_IMAGE_ELEMENTS_MODALITY="image"
export APP_NVINGEST_EXTRACTIMAGES="True"
# Disable reranker for image-query requests. Image queries use the multimodal
# vector retrieval path directly and bypass reranking.
export ENABLE_RERANKER="false"
export APP_RANKING_SERVERURL=""
5. Start the Ingestor Server#
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --build
Verify the service is healthy
6. Start the RAG Server#
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build
Verify the service is healthy
7. Verify All Services Are Running#
Check the status of all deployed containers:
docker ps --format "table {{.Names}}\t{{.Status}}"
Confirm all the containers are running and healthy
NVIDIA-Hosted (Cloud) Deployment#
Use this section to deploy multimodal query support using NVIDIA-hosted API endpoints.
Note
When using NVIDIA-hosted endpoints, you might encounter rate limiting with larger file ingestions (>10 files). For details, see Troubleshoot.
1. Start the Vector Database#
docker compose -f deploy/compose/vectordb.yaml up -d
2. Configure Environment Variables#
a. Open deploy/compose/.env and uncomment the section Endpoints for using cloud NIMs. Then set the environment variables by running the following code.#
source deploy/compose/.env
b. Set the environment variables to use NVIDIA-hosted endpoints for VLM models:#
Set your NGC API key (replace with your actual key):
export NGC_API_KEY="nvapi-..."
Then set the VLM configuration:
# VLM (Vision-Language Model) configuration - cloud hosted
export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export APP_VLM_SERVERURL="https://integrate.api.nvidia.com/v1"
export APP_LLM_SERVERURL=""
# Optional: use the same NVIDIA-hosted VLM for document summaries.
# You can also leave SUMMARY_LLM* pointing at another supported summarizer.
export SUMMARY_LLM="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export SUMMARY_LLM_SERVERURL="https://integrate.api.nvidia.com/v1"
# Multimodal embedding model configuration - cloud hosted
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-nemotron-embed-vl-1b-v2"
export APP_EMBEDDINGS_SERVERURL="https://integrate.api.nvidia.com/v1"
export ENABLE_VLM_INFERENCE="true"
export VLM_TO_LLM_FALLBACK="false"
3. Configure Image Extraction for Ingestion#
# Configure image extraction
export APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY=""
export APP_NVINGEST_IMAGE_ELEMENTS_MODALITY="image"
export APP_NVINGEST_EXTRACTIMAGES="True"
# Disable reranker (not supported with multimodal queries)
export ENABLE_RERANKER="false"
export APP_RANKING_SERVERURL=""
4. Start the Ingestor Server#
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --build
Verify the ingestor server is healthy
5. Start the RAG Server#
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build
Verify the RAG server is healthy
6. Verify All Services Are Running#
Check the status of all deployed containers
docker ps --format "table {{.Names}}\t{{.Status}}"
You should see output similar to the following:
NAMES STATUS
compose-nv-ingest-ms-runtime-1 Up 5 minutes (healthy)
ingestor-server Up 5 minutes
compose-redis-1 Up 5 minutes
rag-frontend Up 9 minutes
rag-server Up 9 minutes
elasticsearch Up 36 minutes (healthy)
seaweedfs Up 35 minutes (healthy)
Helm Chart Deployment#
Use this section to deploy multimodal query support on Kubernetes using Helm charts.
Note
This configuration disables the default LLM NIM and text embedding NIM, replacing them with VLM NIM and VLM embedding NIM. The GPU resources previously used by the disabled services will be available for the VLM services. If MIG slicing is enabled on the cluster, ensure to assign a dedicated slice to the VLM. Check mig-deployment.md for more information.
1. Modify values.yaml#
Modify values.yaml to enable multimodal query support:
# Multimodal Query Configuration
# This replaces the default LLM and text embedding NIMs with VLM variants
# Enable VLM NIM for multimodal generation
nim-vlm:
enabled: true
# Enable VLM embedding NIM for multimodal embeddings
nvidia-nim-llama-nemotron-embed-vl-1b-v2:
enabled: true
image:
repository: nvcr.io/nim/nvidia/llama-nemotron-embed-vl-1b-v2
tag: "1.12.0"
# Optional: disable the default text embedding NIM
nvidia-nim-llama-nemotron-embed-1b-v2:
enabled: false
# Disable LLM NIM (VLM handles generation)
nim-llm:
enabled: false
# Enable dedicated VLM captioning NIM (image-cap model changed after RC1)
nimOperator:
nim-vlm-captioning:
enabled: true
# Configure environment variables
envVars:
# VLM inference settings
ENABLE_VLM_INFERENCE: "true"
VLM_TO_LLM_FALLBACK: "false"
APP_VLM_MODELNAME: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
APP_VLM_SERVERURL: "http://nim-vlm:8000/v1"
# VLM embedding settings
APP_EMBEDDINGS_SERVERURL: "nemotron-vlm-embedding-ms:8000/v1"
APP_EMBEDDINGS_MODELNAME: "nvidia/llama-nemotron-embed-vl-1b-v2"
# Disable reranker (not supported with multimodal queries)
ENABLE_RERANKER: "False"
APP_RANKING_SERVERURL: ""
ingestor-server:
envVars:
# Image extraction settings
APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY: ""
APP_NVINGEST_IMAGE_ELEMENTS_MODALITY: "image"
APP_NVINGEST_EXTRACTIMAGES: "True"
# Summary generation settings.
# Required for generate_summary=true when nim-llm is disabled.
SUMMARY_LLM: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
SUMMARY_LLM_SERVERURL: "nim-vlm:8000"
# VLM embedding settings for ingestor
APP_EMBEDDINGS_SERVERURL: "nemotron-vlm-embedding-ms:8000/v1"
APP_EMBEDDINGS_MODELNAME: "nvidia/llama-nemotron-embed-vl-1b-v2"
nv-ingest:
envVars:
EMBEDDING_NIM_ENDPOINT: "http://nemotron-vlm-embedding-ms:8000/v1"
EMBEDDING_NIM_MODEL_NAME: "nvidia/llama-nemotron-embed-vl-1b-v2"
2. Deploy or Upgrade the Chart#
After modifying values.yaml, apply the changes as described in Change a Deployment.
For detailed HELM deployment instructions, see Helm Deployment Guide.
3. Verify the Deployment#
Verify the VLM pods are running:
kubectl get pods -n rag | grep -E "(vlm|embedding)"
Expected output:
nim-vlm-f4c446cbf-ffzm7 1/1 Running 0 22m
nemotron-vlm-embedding-ms-... 1/1 Running 0 22m
Note
It may take several minutes for the VLM pods to initialize and download the model weights.
Using Multimodal Queries#
After deployment, you can start querying your knowledge base with both text and images.
Important
You must select a collection before querying. Multimodal queries require a knowledge base to search against. Before performing any query (including visual Q&A, product identification, or document lookup), ensure you have:
Created a collection: Use the Web UI, Python client, or API to create a new collection
Ingested documents: Upload documents (PDFs, images, etc.) to your collection
Selected the collection: When querying, explicitly specify the collection name
Queries without a selected collection will not return relevant results from your knowledge base.
Web UI#
Access the RAG frontend at http://localhost:8090 to experiment with multimodal queries through the user interface.
In the sidebar, select your collection from the Collection dropdown
Upload an image and/or enter your text query
Click Send to get responses based on your knowledge base
For details, see User Interface for NVIDIA RAG Blueprint.
Python Client#
When using the Python client, pass image input using the OpenAI vision content
format and always specify collection_names in your query:
import base64
from pathlib import Path
image_b64 = base64.b64encode(Path("Creme_clutch_purse1-small.jpg").read_bytes()).decode()
image_query = [
{"type": "text", "text": "What material is this made of?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_b64}",
"detail": "auto",
},
},
]
await rag.generate(
messages=[{"role": "user", "content": image_query}],
use_knowledge_base=True,
collection_names=["your_collection_name"],
enable_reranker=False,
)
For details, see NVIDIA RAG Blueprint Python Package.
Interactive Notebook#
For a step-by-step guide with code examples covering collection creation, document ingestion, and querying with images, see the Multimodal Query Notebook.
Limitations#
Image-query reranking is bypassed: When the user query includes an image, use
enable_reranker: False. Image queries use the multimodal vector retrieval path directly.Single-page retrieval for image queries: When an image is included in the query, the retrieval results are constrained to content from a single page per document. Multi-page context retrieval is not supported for image-based queries.