Release Notes for NVIDIA RAG Blueprint#
This documentation contains the release notes for NVIDIA RAG Blueprint.
Release 2.5.0 (2026-03-12)#
This release introduces support for the Nemotron-super-3 model, updates NIMs to the latest versions, upgrades NV-Ingest, and adds continuous ingestion along with RTX 6000 MIG support.
Highlights#
This release includes the following key updates:
Nemotron-super-3 model support. You can now integrate the Nemotron-super-3 model by following the steps outlined in Change the Inference or Embedding Model.
NIMs updated to latest versions. The following model updates are included:
nvidia/llama-3.2-nv-embedqa-1b-v2→nvidia/llama-nemotron-embed-1b-v2nvidia/llama-3.2-nv-rerankqa-1b-v2→nvidia/llama-nemotron-rerank-1b-v2nemoretriever-page-elements-v3→nemotron-page-elements-v3nemoretriever-graphic-elements-v1→nemotron-graphic-elements-v1nemoretriever-table-structure-v1→nemotron-table-structure-v1nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1→nvidia/llama-nemotron-embed-vl-1b-v2
Updated NVIngest to version 26.1.2.
Added an example demonstrating the continuous ingestion pipeline. For more information, see rag_event_ingest.ipynb.
Added MIG support for RTX 6000. For details, refer to MIG Deployment and use
values-mig-rtx6000.yamlandmig-config-rtx6000.yaml.Added documentation for the experimental Nemotron-parse-only ingestion pipeline. This configuration allows you to perform extraction using only Nemotron Parse through NV-Ingest, without relying on OCR, page-elements, graphic-elements, or table-structure NIMs. For more information, refer to nemotron-parse-extraction.md.
Several bug fixes, including frontend CVE resolutions, improved multimodal content concatenation for VLM embeddings, enhanced VDB serialization for high-concurrency parallel ingestion, and updates to observability and NeMo Guardrails configurations.
Added agentic skills support: the
rag-blueprintskill enables AI coding assistants (Claude Code, Cursor, Codex, etc.) to deploy, configure, troubleshoot, and manage the RAG Blueprint autonomously. For details, refer to RAG Blueprint Agent Skill.Added accuracy benchmark results across seven public datasets (RagBattlepacket, KG-RAG, Financebench, DC767, HotPotQA, Google Frames, and Vidore), comparing LLM and VLM configurations with reasoning on/off. Benchmarks use the NVIDIA Answer Accuracy metric from RAGAS.
Fixed Known Issues#
The following known issues have been resolved in this release:
Addressed frontend CVEs.
Resolved VDB indexing issues during high-concurrency batch parallel ingestion by implementing VDB serialization.
Release 2.4.0 (2026-02-20)#
This release adds new features to the RAG pipeline for supporting agent workflows and enhances generations with VLMs augmenting multimodal input.
Highlights#
This release contains the following key changes:
Updated NIMs and code to support NeMo Retriever Library 26.01 release.
Added support for non-NIM models including OpenAI, models hosted on AWS and Azure, OSS models, and others. Supported through service-specific API keys. For details, refer to Get an API Key.
The RAG Blueprint now uses nemoretriever-ocr-v1 as the default OCR model. For details, refer to NeMo Retriever Library OCR Configuration Guide.
Improved VLM based generation support. The Vision-Language Model (VLM) inference feature now uses the model nemotron-nano-12b-v2-vl. For details, refer to VLM for Generation.
User interface improvements including catalog display, image and text query, and others. For details, refer to User Interface.
Added ingestion metrics endpoint support with OpenTelemetry (OTEL) for monitoring document uploads, elements ingested, and pages processed. For details, refer to Observability.
Support image and text as input query. For details, refer to Multimodal Query Support.
Nemotron-3-Nano model support with reasoning budget. For details, refer to Enable Reasoning.
Vector Database enhancements including secure database access. For details, refer to Milvus Configuration and Elasticsearch Configuration.
You can now access RAG functionality from a Model Context Protocol (MCP) server for tool integration. For details, refer to MCP Server and Client Usage.
Added OpenAI-compatible search endpoint for integration with OpenAI tools. For details, refer to API - RAG Server Schema.
Added support for collection-level data catalog, descriptions, and metadata. For details, refer to Data Catalog.
Enhanced
/statusendpoint publishing ingestion metrics and status information. For details, refer to the ingestion notebook.Multi-turn conversation support is no longer the default for either retrieval or generation stage in the pipeline. Refer to Multi-Turn Conversation Support for details.
Improved document processing and element extraction.
Enhancements to RAG library mode including the following. For details, refer to Use the NVIDIA RAG Blueprint Python Package.
Independent multi-instance support for the RAG Server and the ingestion server
Configuration support through function arguments
Async interface for RAG methods
Compatibility with the NVIDIA NeMo Agent Toolkit (NAT)
Summarization enhancements including the following. For details, refer to Document Summarization Customization Guide.
Shallow summarization support
Easy model switches and dedicated configurations
Ease of prompt changes
Reserved field names
type,subtype, andlocationfor NeMo Retriever Library exclusive use in metadata schemas.Added support for rag_library_lite_usage.ipynb which demonstrates containerless deployment of the NVIDIA RAG Python package in lite mode.
Added example showcasing NeMo Agent Toolkit integration with NVIDIA RAG.
Added weighted hybrid search support with configurable weights.
RAG server logging improvements
Fixed Known Issues#
The following are the known issues that are fixed in this version:
Fixed issue in NIM LLM for automatic profile selection. For details, refer to Model Profiles.
Known limitations#
The following are the known limitations in this version:
DRA support using NIM operator based helm chart is not available in this release.
For the full list of known issues, refer to Known Issues.
Release 2.3.2 (2025-12-25)#
This release is a hotfix for RAG v2.3.0, and includes the following changes:
Bump embedqa version to 1.10.1 and nim-llm to version 1.14.0.
Align Helm values and any referenced tags with the new embedqa and nim-llm versions.
All Known Issues#
The following are the known issues for the NVIDIA RAG Blueprint:
DRA support
Optional features reflection and image captioning are not available in Helm-based deployment.
Currently, Helm-based deployment is not supported for NeMo Guardrails.
The Blueprint responses can have significant latency when using NVIDIA API Catalog cloud hosted models.
The accuracy of the pipeline is optimized for certain file types like
.pdf,.txt,.docx. The accuracy may be poor for other file types supported by NeMo Retriever Library, since image captioning is disabled by default.When updating model configurations in Kubernetes
values.yaml(for example, changing from 70B to 8B models), the RAG UI automatically detects and displays the new model configuration from the backend. No container rebuilds are required - simply redeploy the Helm chart with updated values and refresh the UI to see the new model settings in the Settings panel.The NeMo LLM microservice can take 5-6 minutes to start for every deployment.
B200 GPUs are not supported for the following advanced features. For these features, use H100 or A100 GPUs instead.
Image captioning support for ingested documents
NeMo Guardrails for guardrails at input/output
VLM-based inferencing in RAG
PDF extraction with Nemotron Parse
Sometimes when HTTP cloud NIM endpoints are used from
deploy/compose/.env, thenv-ingest-ms-runtimestill logs gRPC environment variables. Following log entries can be ignored.For MIG support, currently the ingestion profile has been scaled down while deploying the chart with MIG slicing. This affects the ingestion performance during bulk ingestion, specifically large bulk ingestion jobs might fail.
Individual file uploads are limited to a maximum size of 400 MB during ingestion. Files exceeding this limit are rejected and must be split into smaller segments before ingesting.
llama-3.3-nemotron-super-49b-v1.5model provides more verbose responses in non-reasoning mode compared to v1.0. For some queries the LLM model may respond with information not available in given context. Also for out of domain queries the model may provide responses based on its own knowledge. Developers are strongly advised to tune the prompt for their use cases to avoid these scenarios.Slow VDB upload is observed in Helm deployments for Elasticsearch.
Audio model deployment on Kubernetes on RTX‑6000 Pro is not supported in this release.