Multimodal Query Support for NVIDIA RAG Blueprint#

The multimodal query feature in the NVIDIA RAG Blueprint enables you to query your knowledge base using both text and images. This is particularly useful for use cases where visual context enhances the query, such as:

Product identification: “What is the price of this item?” + product image
Document lookup: “Find documents related to this chart” + chart image
Visual Q&A: “What material is this made of?” + product image

This feature combines:

VLM Embeddings: nvidia/llama-nemotron-embed-vl-1b-v2 for creating multimodal embeddings that understand both text and images
Vision-Language Model: nvidia/nemotron-3-nano-omni-30b-a3b-reasoning for generating intelligent responses based on visual and textual context

Prerequisites#

Before enabling multimodal query support, ensure you have:

Obtained an API Key
Deployed the NVIDIA RAG Blueprint
An NVIDIA H100 or A100 GPU for on-prem deployments

Self-Hosted (On-Prem) Deployment#

Use this section to deploy multimodal query support with locally hosted NVIDIA NIMs.

1. Start the Vector Database#

Start the Milvus vector database service:

docker compose -f deploy/compose/vectordb.yaml up -d

2. Deploy the Ingestion and VLM RAG NIMs#

Set your NGC API key (replace with your actual key):

export NGC_API_KEY="nvapi-..."

Then run the deployment commands:

# Create the model cache directory
mkdir -p ~/.cache/model-cache
export MODEL_DIRECTORY=~/.cache/model-cache

# (Optional) Select a specific GPU for the VLM Microservice
# Use `nvidia-smi` to check available GPUs and set the desired GPU ID
export VLM_MS_GPU_ID=1  # Default is GPU 5; change to use a different GPU

# Deploy ingestion NIMs plus the VLM RAG NIMs.
USERID=$(id -u) docker compose --profile ingest --profile vlm-rag -f deploy/compose/nims.yaml up -d

Warning

The first deployment may take 10-20 minutes as models download (~10GB+). Subsequent deployments will be faster as models are cached.

Monitor the deployment status:

watch -n 5 'docker ps --format "table {{.Names}}\t{{.Status}}"'

Wait until the services show as healthy:

3. Configure Environment Variables#

Set the model names and service URLs for the RAG pipeline:

# VLM (Vision-Language Model) configuration
export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export APP_VLM_SERVERURL="http://vlm-ms:8000/v1"
export APP_LLM_SERVERURL=""

# Optional: use the same VLM for document summaries when no LLM NIM is running.
# You can also point SUMMARY_LLM* to a separate LLM or NVIDIA-hosted endpoint.
export SUMMARY_LLM="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export SUMMARY_LLM_SERVERURL="http://vlm-ms:8000/v1"

# Multimodal embedding model configuration
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-nemotron-embed-vl-1b-v2"
export APP_EMBEDDINGS_SERVERURL="nemotron-vlm-embedding-ms:8000/v1"
export ENABLE_VLM_INFERENCE="true"
export VLM_TO_LLM_FALLBACK="false"

4. Configure Image Extraction for Ingestion#

Enable image extraction and storage during document ingestion:

# Configure image extraction
export APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY=""
export APP_NVINGEST_IMAGE_ELEMENTS_MODALITY="image"
export APP_NVINGEST_EXTRACTIMAGES="True"

# Disable reranker for image-query requests. Image queries use the multimodal
# vector retrieval path directly and bypass reranking.
export ENABLE_RERANKER="false"
export APP_RANKING_SERVERURL=""

5. Start the Ingestor Server#

docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --build

Verify the service is healthy

6. Start the RAG Server#

docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build

Verify the service is healthy

7. Verify All Services Are Running#

Check the status of all deployed containers:

docker ps --format "table {{.Names}}\t{{.Status}}"

Confirm all the containers are running and healthy

NVIDIA-Hosted (Cloud) Deployment#

Use this section to deploy multimodal query support using NVIDIA-hosted API endpoints.

Note

When using NVIDIA-hosted endpoints, you might encounter rate limiting with larger file ingestions (>10 files). For details, see Troubleshoot.

1. Start the Vector Database#

docker compose -f deploy/compose/vectordb.yaml up -d

2. Configure Environment Variables#

a. Open `deploy/compose/.env` and uncomment the section `Endpoints for using cloud NIMs`. Then set the environment variables by running the following code.#

source deploy/compose/.env

b. Set the environment variables to use NVIDIA-hosted endpoints for VLM models:#

Set your NGC API key (replace with your actual key):

export NGC_API_KEY="nvapi-..."

Then set the VLM configuration:

# VLM (Vision-Language Model) configuration - cloud hosted
export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export APP_VLM_SERVERURL="https://integrate.api.nvidia.com/v1"
export APP_LLM_SERVERURL=""

# Optional: use the same NVIDIA-hosted VLM for document summaries.
# You can also leave SUMMARY_LLM* pointing at another supported summarizer.
export SUMMARY_LLM="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export SUMMARY_LLM_SERVERURL="https://integrate.api.nvidia.com/v1"

# Multimodal embedding model configuration - cloud hosted
export APP_EMBEDDINGS_MODELNAME="nvidia/llama-nemotron-embed-vl-1b-v2"
export APP_EMBEDDINGS_SERVERURL="https://integrate.api.nvidia.com/v1"
export ENABLE_VLM_INFERENCE="true"
export VLM_TO_LLM_FALLBACK="false"

3. Configure Image Extraction for Ingestion#

# Configure image extraction
export APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY=""
export APP_NVINGEST_IMAGE_ELEMENTS_MODALITY="image"
export APP_NVINGEST_EXTRACTIMAGES="True"

# Disable reranker (not supported with multimodal queries)
export ENABLE_RERANKER="false"
export APP_RANKING_SERVERURL=""

4. Start the Ingestor Server#

docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d --build

Verify the ingestor server is healthy

5. Start the RAG Server#

docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build

Verify the RAG server is healthy

6. Verify All Services Are Running#

Check the status of all deployed containers

docker ps --format "table {{.Names}}\t{{.Status}}"

You should see output similar to the following:

NAMES                                   STATUS
compose-nv-ingest-ms-runtime-1          Up 5 minutes (healthy)
ingestor-server                         Up 5 minutes
compose-redis-1                         Up 5 minutes
rag-frontend                            Up 9 minutes
rag-server                              Up 9 minutes
elasticsearch                           Up 36 minutes (healthy)
seaweedfs                               Up 35 minutes (healthy)

Helm Chart Deployment#

Use this section to deploy multimodal query support on Kubernetes using Helm charts.

Note

This configuration disables the default LLM NIM and text embedding NIM, replacing them with VLM NIM and VLM embedding NIM. The GPU resources previously used by the disabled services will be available for the VLM services. If MIG slicing is enabled on the cluster, ensure to assign a dedicated slice to the VLM. Check mig-deployment.md for more information.

1. Modify values.yaml#

Modify values.yaml to enable multimodal query support:

# Multimodal Query Configuration
# This replaces the default LLM and text embedding NIMs with VLM variants

# Enable VLM NIM for multimodal generation
nim-vlm:
  enabled: true

# Enable VLM embedding NIM for multimodal embeddings
nvidia-nim-llama-nemotron-embed-vl-1b-v2:
  enabled: true
  image:
    repository: nvcr.io/nim/nvidia/llama-nemotron-embed-vl-1b-v2
    tag: "1.12.0"

# Optional: disable the default text embedding NIM
nvidia-nim-llama-nemotron-embed-1b-v2:
  enabled: false

# Disable LLM NIM (VLM handles generation)
nim-llm:
  enabled: false

# Enable dedicated VLM captioning NIM (image-cap model changed after RC1)
nimOperator:
  nim-vlm-captioning:
    enabled: true

# Configure environment variables
envVars:
  # VLM inference settings
  ENABLE_VLM_INFERENCE: "true"
  VLM_TO_LLM_FALLBACK: "false"
  APP_VLM_MODELNAME: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
  APP_VLM_SERVERURL: "http://nim-vlm:8000/v1"

  # VLM embedding settings
  APP_EMBEDDINGS_SERVERURL: "nemotron-vlm-embedding-ms:8000/v1"
  APP_EMBEDDINGS_MODELNAME: "nvidia/llama-nemotron-embed-vl-1b-v2"

  # Disable reranker (not supported with multimodal queries)
  ENABLE_RERANKER: "False"
  APP_RANKING_SERVERURL: ""

ingestor-server:
  envVars:
    # Image extraction settings
    APP_NVINGEST_STRUCTURED_ELEMENTS_MODALITY: ""
    APP_NVINGEST_IMAGE_ELEMENTS_MODALITY: "image"
    APP_NVINGEST_EXTRACTIMAGES: "True"

    # Summary generation settings.
    # Required for generate_summary=true when nim-llm is disabled.
    SUMMARY_LLM: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
    SUMMARY_LLM_SERVERURL: "nim-vlm:8000"

    # VLM embedding settings for ingestor
    APP_EMBEDDINGS_SERVERURL: "nemotron-vlm-embedding-ms:8000/v1"
    APP_EMBEDDINGS_MODELNAME: "nvidia/llama-nemotron-embed-vl-1b-v2"

nv-ingest:
  envVars:
    EMBEDDING_NIM_ENDPOINT: "http://nemotron-vlm-embedding-ms:8000/v1"
    EMBEDDING_NIM_MODEL_NAME: "nvidia/llama-nemotron-embed-vl-1b-v2"

2. Deploy or Upgrade the Chart#

After modifying values.yaml, apply the changes as described in Change a Deployment.

For detailed HELM deployment instructions, see Helm Deployment Guide.

3. Verify the Deployment#

Verify the VLM pods are running:

kubectl get pods -n rag | grep -E "(vlm|embedding)"

Expected output:

nim-vlm-f4c446cbf-ffzm7                              1/1     Running   0          22m
nemotron-vlm-embedding-ms-...                   1/1     Running   0          22m

Note

It may take several minutes for the VLM pods to initialize and download the model weights.

Using Multimodal Queries#

After deployment, you can start querying your knowledge base with both text and images.

Important

You must select a collection before querying. Multimodal queries require a knowledge base to search against. Before performing any query (including visual Q&A, product identification, or document lookup), ensure you have:

Created a collection: Use the Web UI, Python client, or API to create a new collection
Ingested documents: Upload documents (PDFs, images, etc.) to your collection
Selected the collection: When querying, explicitly specify the collection name

Queries without a selected collection will not return relevant results from your knowledge base.

Web UI#

Access the RAG frontend at http://localhost:8090 to experiment with multimodal queries through the user interface.

In the sidebar, select your collection from the Collection dropdown
Upload an image and/or enter your text query
Click Send to get responses based on your knowledge base

For details, see User Interface for NVIDIA RAG Blueprint.

Python Client#

When using the Python client, pass image input using the OpenAI vision content format and always specify collection_names in your query:

import base64
from pathlib import Path

image_b64 = base64.b64encode(Path("Creme_clutch_purse1-small.jpg").read_bytes()).decode()
image_query = [
    {"type": "text", "text": "What material is this made of?"},
    {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/png;base64,{image_b64}",
            "detail": "auto",
        },
    },
]

await rag.generate(
    messages=[{"role": "user", "content": image_query}],
    use_knowledge_base=True,
    collection_names=["your_collection_name"],
    enable_reranker=False,
)

For details, see NVIDIA RAG Blueprint Python Package.

Interactive Notebook#

For a step-by-step guide with code examples covering collection creation, document ingestion, and querying with images, see the Multimodal Query Notebook.

Limitations#

Image-query reranking is bypassed: When the user query includes an image, use enable_reranker: False. Image queries use the multimodal vector retrieval path directly.
Single-page retrieval for image queries: When an image is included in the query, the retrieval results are constrained to content from a single page per document. Multi-page context retrieval is not supported for image-based queries.

Multimodal Query Support for NVIDIA RAG Blueprint#

Prerequisites#

Self-Hosted (On-Prem) Deployment#

1. Start the Vector Database#

2. Deploy the Ingestion and VLM RAG NIMs#

3. Configure Environment Variables#

4. Configure Image Extraction for Ingestion#

5. Start the Ingestor Server#

6. Start the RAG Server#

7. Verify All Services Are Running#

NVIDIA-Hosted (Cloud) Deployment#

1. Start the Vector Database#

2. Configure Environment Variables#

a. Open deploy/compose/.env and uncomment the section Endpoints for using cloud NIMs. Then set the environment variables by running the following code.#

b. Set the environment variables to use NVIDIA-hosted endpoints for VLM models:#

3. Configure Image Extraction for Ingestion#

4. Start the Ingestor Server#

5. Start the RAG Server#

6. Verify All Services Are Running#

Helm Chart Deployment#

1. Modify values.yaml#

2. Deploy or Upgrade the Chart#

3. Verify the Deployment#

Using Multimodal Queries#

Web UI#

Python Client#

Interactive Notebook#

Limitations#

Related Topics#

a. Open `deploy/compose/.env` and uncomment the section `Endpoints for using cloud NIMs`. Then set the environment variables by running the following code.#