Vision-Language Model (VLM) for Generation for NVIDIA RAG Blueprint#
The Vision-Language Model (VLM) inference feature in the NVIDIA RAG Blueprint enhances the system’s ability to understand and reason about visual content. Unlike traditional image upload systems, this feature operates on image citations that are internally discovered during the retrieval process.
Note
B200 GPUs are not supported for VLM based inferencing in RAG. For this feature, use H100 or A100 GPUs instead.
Key use cases for VLM
Documents with charts and graphs: Financial reports, scientific papers, business analytics
Technical diagrams: Engineering schematics, architectural plans, flowcharts
Visual data representations: Infographics, tables with visual elements, dashboards
Mixed content documents: PDFs containing both text and images
Image-heavy content: Catalogs, product documentation, visual guides
Key benefits of VLM
Integrated multimodal experience – Users do not need to manually upload images; visual content is automatically discovered and analyzed from images embedded in documents.
Improved accuracy – Enhanced response quality for documents containing images, charts, diagrams, and visual data.
Quality assurance – Internal reasoning ensures only relevant visual insights are used.
Contextual understanding – Visual analysis is performed in the context of the user’s specific question.
Fallback handling – System gracefully handles cases where images are insufficient or irrelevant.
Warning
Enabling VLM inference increases response latency from additional image processing and VLM model inference time. Consider this trade-off between accuracy and speed based on your requirements.
How VLM Works in the RAG Pipeline#
When VLM inference is enabled, the VLM replaces the traditional LLM in the RAG pipeline for generation tasks.
Automatic Image Discovery: When a user query is processed, the RAG system retrieves relevant documents from the vector database. If any of these documents contain images (charts, diagrams, photos, and so on), they are automatically identified.
Image Captioning at Ingestion: During ingestion, images are extracted and captioned so they can be indexed and later cited for question answering.
VLM Answer Generation: At query time, the RAG server sends the user question, conversation history, and cited images to a Vision-Language Model. The VLM directly generates the final answer for the user, taking the place of the traditional LLM.
What users experience: Users interact with the system normally. The VLM processing happens transparently:
User asks a question about content that may have visual elements.
System retrieves relevant documents including any images.
VLM analyzes images and text context if present and relevant.
User receives a single, coherent answer generated directly by the VLM.
Prompt Customization#
The VLM feature uses predefined prompts that can be customized in src/nvidia_rag/rag_server/prompt.yaml under the vlm_template section. The vlm_template controls how the question, textual context, and cited images are presented to the VLM.
VLM reasoning compared to non-reasoning mode: Nemotron Omni supports two modes controlled by the APP_VLM_ENABLE_THINKING environment variable:
Reasoning mode (default):
APP_VLM_ENABLE_THINKING=true. The model produces a chain-of-thought trace before the final answer. Default parameters:APP_VLM_TEMPERATURE=0.6,APP_VLM_TOP_P=0.95,APP_VLM_MAX_TOKENS=32768,APP_VLM_THINKING_TOKEN_BUDGET=16384.Non-reasoning mode:
APP_VLM_ENABLE_THINKING=false. The model skips the reasoning trace and returns only the final answer.
What reaches the streaming client is structured by field:
Reasoning is filtered out of user-facing
contentand surfaced inreasoning_contentwhen the model emits it.The final answer streams through
content.VLM_FILTER_THINK_TOKENSis retained as a compatibility setting; streamed reasoning is not wrapped or concatenated intocontent.
Set these parameters using environment variables in your deployment configuration (for example in docker-compose-rag-server.yaml or Helm values.yaml).
Enable VLM with Docker Compose#
NVIDIA RAG uses the Nemotron Omni (nvidia/nemotron-3-nano-omni-30b-a3b-reasoning) Vision-Language Model by default, provided as the vlm-ms service in deploy/compose/nims.yaml.
The vlm-generation profile in deploy/compose/nims.yaml is designed for VLM-based generation on 2xH100 GPUs. It skips the NIM LLM deployment (VLM replaces LLM), deploys the VLM service (vlm-ms), and deploys embedding and reranker microservices. The ingest profile (combined below) additionally starts the ingestion-extraction NIMs (page-elements, graphic-elements, table-structure, nemotron-ocr) and the captioning VLM (vlm-captioning-ms): without these, ingestion of PDFs and image-bearing documents will fail to extract tables, charts, and OCR text.
GPU allocation for 2xH100: GPU 0 for Embedding and Reranker; GPU 1 for VLM (replaces LLM). You must set VLM_MS_GPU_ID=1.
Set the VLM GPU assignment and start the VLM generation services together with the ingestion-extraction NIMs (skips
nim-llm):export VLM_MS_GPU_ID=1 USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation --profile ingest up -d
Combining
--profile vlm-generationwith--profile ingestis equivalent to “start everything innims.yamlexceptnim-llm”: the LLM container is intentionally omitted because VLM replaces it. You can confirm the exact service set withdocker compose -f deploy/compose/nims.yaml --profile vlm-generation --profile ingest config --services.Warning
Only change
VLM_MS_GPU_IDfor systems with 3+ GPUs.For systems with 3+ GPUs, you can assign VLM to a different GPU (for example, GPU 3):
export VLM_MS_GPU_ID=3 USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile vlm-generation --profile ingest up -d
Enable image extraction and captioning for ingestion. In
deploy/compose/docker-compose-ingestor-server.yaml, under theingestor-serverservice, setAPP_NVINGEST_EXTRACTIMAGEStoTrueso images are extracted and stored (disabled by default). Image captioning is enabled by default:APP_NVINGEST_CAPTIONMODELNAMEis set tonvidia/nemotron-3-nano-omni-30b-a3b-reasoningandAPP_NVINGEST_CAPTIONENDPOINTURLpoints to thevlm-msservice. Override using environment variables if needed:export APP_NVINGEST_EXTRACTIMAGES=True docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d
Configure the RAG server to use VLM. Set the following environment variables in docker-compose-rag-server.yaml, then restart the rag-server:
export ENABLE_VLM_INFERENCE="true" export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning" export APP_VLM_SERVERURL="http://vlm-ms:8000/v1" docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
ENABLE_VLM_INFERENCE: Enables VLM inference in the RAG server.APP_VLM_MODELNAME: The name of the VLM model to use.APP_VLM_SERVERURL: The URL of the VLM NIM server (local or remote).
Note
When using the
vlm-generationprofile, there is no LLM service running. The VLM handles all generation tasks. Optional fallback is controlled byVLM_TO_LLM_FALLBACK(see VLM to LLM Fallback).Continue with the rest of the steps in Deploy with Docker (Self-Hosted Models) to deploy the ingestion-server and rag-server containers.
Using a Remote NVIDIA-Hosted NIM (Optional)#
To use a remote NVIDIA-hosted NIM for VLM inference, set APP_VLM_SERVERURL to the remote endpoint:
export ENABLE_VLM_INFERENCE="true"
export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export APP_VLM_SERVERURL="https://integrate.api.nvidia.com/v1/"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Continue with Deploy with Docker (NVIDIA-Hosted Models) as needed.
Enable VLM with Helm#
Note
GPU requirements for Helm: VLM uses the same GPU normally assigned to LLM (GPU 1). With MIG slicing, assign a dedicated MIG slice to the VLM—see mig-deployment.md and values-mig-h100.yaml or values-mig-rtx6000.yaml. To run both VLM and LLM simultaneously, an additional GPU is required.
In values.yaml, under the
rag-serverenvVarssection, set:ENABLE_VLM_INFERENCE: "true" APP_VLM_MODELNAME: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning" APP_VLM_SERVERURL: "http://nim-vlm:8000/v1"
Enable image extraction and captioning for ingestion. Image captioning is recommended when running VLM generation so that ingested images are indexed with their captions and surface as citations at query time. The captioning model is served by a dedicated
nim-vlm-captioningNIM (see Separate VLMs for Generation and Captioning). UndernimOperatorandingestor-server.envVars, set:nimOperator: nim-vlm-captioning: enabled: true ingestor-server: envVars: APP_NVINGEST_EXTRACTIMAGES: "True" APP_NVINGEST_CAPTIONENDPOINTURL: "http://nim-vlm-captioning:8000/v1/chat/completions" APP_NVINGEST_CAPTIONMODELNAME: "nvidia/nemotron-nano-12b-v2-vl"
Enable
nim-vlmand disablenim-llm(VLM replaces LLM for generation):nimOperator: nim-vlm: enabled: true nim-llm: enabled: false
Important
By disabling
nim-llmand enablingnim-vlm, the VLM uses the GPU resources normally allocated to the LLM, so no additional hardware is required.Apply the changes as described in Change a Deployment. For full steps, see Deploy with Helm.
Verify the VLM pod is running. A pod with the name
nim-vlm-*will start (thenim-llmpod will not be created when it is disabled). Example status:rag nim-vlm-f4c446cbf-ffzm7 1/1 Running 0 22m
Note
Service architecture: With VLM enabled and LLM disabled, the RAG pipeline uses VLM for all generation tasks. The embedding and reranking services remain active for document retrieval. For local VLM inference, ensure the VLM NIM service is running and accessible at the configured APP_VLM_SERVERURL. For remote endpoints, the NGC_API_KEY is required for authentication.
Configuration#
Image limits:
APP_VLM_MAX_TOTAL_IMAGES(default: 5) is the maximum total images (from query, history, and context) included in the VLM prompt. Set using environment variables and restart the rag-server to apply.Example (Docker Compose):
export ENABLE_VLM_INFERENCE="true" export APP_VLM_MAX_TOTAL_IMAGES="5" docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Context limitations: The VLM receives the current user query, a truncated conversation history, and a textual summary of retrieved documents, together with any cited images. The effective context window of the VLM is limited, so very long conversations or large document contexts may be truncated.
Warning
Keep user questions as self-contained as possible, especially in long-running conversations. Use retrieval and prompt tuning to focus the most relevant context for the VLM.
Separate VLMs for Generation and Captioning#
This blueprint uses two distinct VLM NIMs by default: one for chat / RAG answering and one for ingestion-time image captioning. The two services are configured independently so each can be scaled, replaced, or pointed at a different endpoint without affecting the other.
Role |
Default model |
Compose service |
Helm chart key |
|---|---|---|---|
Generation (chat / RAG answering) |
|
|
|
Ingestion-time image captioning |
|
|
|
Configure the generation VLM#
Set the generation VLM using APP_VLM_MODELNAME and APP_VLM_SERVERURL. With Docker Compose:
export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export APP_VLM_SERVERURL="http://vlm-ms:8000/v1"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
With Helm, in values.yaml:
envVars:
APP_VLM_MODELNAME: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
APP_VLM_SERVERURL: "http://nim-vlm:8000/v1"
nimOperator:
nim-vlm:
enabled: true
Configure the captioning VLM#
The captioning model is consumed by both the ingestor-server and the upstream nv-ingest-ms-runtime, so set both pairs of env vars to the same values.
With Docker Compose:
export APP_NVINGEST_CAPTIONMODELNAME="nvidia/nemotron-nano-12b-v2-vl"
export APP_NVINGEST_CAPTIONENDPOINTURL="http://vlm-captioning-ms:8000/v1/chat/completions"
export VLM_CAPTION_MODEL_NAME="nvidia/nemotron-nano-12b-v2-vl"
export VLM_CAPTION_ENDPOINT="http://vlm-captioning-ms:8000/v1/chat/completions"
docker compose -f deploy/compose/nims.yaml --profile vlm-rag up -d # starts vlm-captioning-ms
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d
With Helm, in values.yaml:
nimOperator:
nim-vlm-captioning:
enabled: true
ingestor-server:
envVars:
APP_NVINGEST_CAPTIONMODELNAME: "nvidia/nemotron-nano-12b-v2-vl"
APP_NVINGEST_CAPTIONENDPOINTURL: "http://nim-vlm-captioning:8000/v1/chat/completions"
nv-ingest:
envVars:
VLM_CAPTION_MODEL_NAME: nvidia/nemotron-nano-12b-v2-vl
VLM_CAPTION_ENDPOINT: http://nim-vlm-captioning:8000/v1/chat/completions
NVIDIA-hosted endpoints#
To use NVIDIA-hosted endpoints for either role, point the corresponding *_SERVERURL / *_ENDPOINTURL at https://integrate.api.nvidia.com/v1 (chat completions: https://integrate.api.nvidia.com/v1/chat/completions). Both nvidia/nemotron-3-nano-omni-30b-a3b-reasoning and nvidia/nemotron-nano-12b-v2-vl are available there.
VLM to LLM Fallback (Optional)#
By default, with VLM enabled, the RAG server uses VLM for all generation tasks. The VLM_TO_LLM_FALLBACK environment variable controls behavior for text-only queries (no images in query, messages, or retrieved context).
Default (no fallback):
VLM_TO_LLM_FALLBACK="false". The VLM handles all queries. Recommended for the 2xH100 setup.Enable fallback:
VLM_TO_LLM_FALLBACK="true". Text-only queries use a traditional LLM. You must deploy an LLM service alongside the VLM.
GPU requirements for fallback: Minimum 3xH100 GPUs: GPU 0: Embedding and Reranker; GPU 1: VLM; GPU 2: LLM.
Docker Compose with fallback: Start both VLM and LLM (do not use the vlm-generation profile):
export VLM_MS_GPU_ID=1
export LLM_MS_GPU_ID=2
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d
export ENABLE_VLM_INFERENCE="true"
export VLM_TO_LLM_FALLBACK="true"
export APP_VLM_MODELNAME="nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
export APP_VLM_SERVERURL="http://vlm-ms:8000/v1"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Warning
Do not use the vlm-generation profile when fallback is enabled; it skips the LLM. Using VLM_TO_LLM_FALLBACK="true" with that profile will cause errors for text-only queries.
Helm with fallback: In values.yaml, set VLM_TO_LLM_FALLBACK: "true" and keep both nim-vlm and nim-llm enabled:
envVars:
ENABLE_VLM_INFERENCE: "true"
VLM_TO_LLM_FALLBACK: "true"
APP_VLM_MODELNAME: "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
APP_VLM_SERVERURL: "http://nim-vlm:8000/v1"
nimOperator:
nim-vlm:
enabled: true
nim-llm:
enabled: true
Enabling Full VLM Multimodal RAG Pipeline#
The VLM generation path covered above can be combined with the multimodal embedder and a multimodal reranker for a fully VLM-powered ingestion + retrieval + generation pipeline. This is the recommended setup when your corpus is image-heavy (PDFs with charts, diagrams, scanned tables) or when end-user queries themselves carry images.
The pipeline has three independently switchable components, each with its own dedicated guide:
1. VLM Embedding for ingestion (image modality)#
The default embedder is nvidia/llama-nemotron-embed-vl-1b-v2, so PDF pages, tables, charts, and image elements are embedded by a multimodal model. The same model embeds text + image queries at retrieval time, so no extra rag-server config is needed beyond pointing APP_EMBEDDINGS_* at the VLM embedding NIM.
Setup, modality switches (text-only, structured-as-image, page-as-image), and Docker/Helm flows: see Multimodal Retriever: Part 1: VLM Embedding for Ingestion.
2. VLM Reranker (image-aware reranking)#
Swap the default text reranker for nvidia/llama-nemotron-rerank-vl-1b-v2 and turn on ENABLE_VLM_RERANKER_IMAGE_INPUT=True so the reranker scores passages with awareness of the cited images, not just the surrounding text. This noticeably improves ordering when the most relevant chunk is signalled by its image content.
What the flag does, when to enable it, and Docker/Helm flows: see Multimodal Retriever: Part 2: VLM Reranker.
3. VLM Generation#
Covered earlier on this page (Enable VLM with Docker Compose, Enable VLM with Helm). With reasoning mode enabled, the VLM produces a chain-of-thought before the final answer; the rag-server streams either just the answer or both, depending on VLM_FILTER_THINK_TOKENS.
Putting it all together#
For an end-to-end deployment that wires up multimodal queries (image + text from the user) on top of all three components, see Multimodal Query Support. It walks through the combined Docker Compose and Helm setups, including the trade-offs around reranking on image queries.
Troubleshooting#
Ensure the VLM NIM is running and reachable at the configured
APP_VLM_SERVERURL.For remote endpoints, ensure
NGC_API_KEYis valid and has access to the model.Check rag-server logs for VLM inference or API authentication errors.
Verify that images are ingested, captioned, and indexed in your knowledge base.