Release Notes for NVIDIA RAG Blueprint#
This documentation contains the release notes for NVIDIA RAG Blueprint.
Release 2.3.0 (2025-10-14)#
This release adds RTX6000 platform support, adds deployment by using NIM operator, improves vector database pluggability with the blueprint, and other changes.
Highlights#
This release contains the following key changes:
You can now deploy the RAG Blueprint on RTX Pro 6000 Blackwell Server Edition.
Migrated to
llama-3.3-nemotron-super-49b-v1.5as the default LLM model.Added support to deploy the helm chart by using NVIDIA NIM operator. For details, refer to Deploy NVIDIA RAG Blueprint with NIM Operator.
Updated all NIMs, NVIDIA Ingest and third party dependencies to latest versions.
Refactored to support custom 3rd-party vector database integration in a streamlined manner. For details, refer to Building Custom Vector Database Operators.
Added support for elasticsearch vector DB as an alternate to milvus.
Added opt-in query decomposition support.
Added opt-in nemoretriever-ocr support.
Added opt-in VLM embedding support
Custom metadata enhancements including the following. For details, refer to Advanced Metadata Filtering.
Added support for more data types.
Added opt-in support to generate filters using LLM yielding better accuracy.
Added an interactive notebook that showcases the new features. For details, refer to Notebooks.
Added dependency check support for ingestor server /health API.
Added support for configurable confidence threshold for retrieval from API layer.
Added support to store NV-Ingest extraction results directly from the filesystem.
Logging enhancements
Added better latency data reporting for RAG server
API level enhancements for component level latency
Added dedicated Prometheus metric endpoint
Added independent script to showcase batch ingestion
Enabled support for GPU indexing with CPU search
Exposed
APP_VECTORSTORE_EFas a configurable parameter
Added environment variables to control llm parameters LLM_MAX_TOKENS, LLM_TEMPERATURE and LLM_TOP_P
Added notebooks for showcasing RAG evaluation using common metrics. For details, refer to Notebooks.
Added unit tests and pre-commit hooks for maintaining code quality.
Optimized container sizes by removing unnecessary packages and improving security.
Refactored the rag-playground code including the following changes. For details, refer to User Interface.
Use React end to end. Next.js dependencies were deprecated.
More developer friendly and intuitive look and feel.
The
rag-playgroundservice is renamed torag-frontend.
Refactored helm chart support including the following changes. For details, refer to Deploy with Helm.
Expanded and reorganized Helm chart configuration, enabling granular control over service components, resource settings, and observability (tracing, metrics).
Introduced ConfigMap and service definitions to facilitate improved application deployment flexibility.
Implemented refined service account and secret management in Helm templates.
Added a new Helm values file for nim-operator to configure LLM model environment and component toggles.
Removed#
Removed consistency level configuration support for Milvus.
Removed
EMBEDDING_NIM_ENDPOINTandEMBEDDING_NIM_MODEL_NAMEenvironment variables for nv-ingest.Removed
ENABLE_MULTITURNenvironment variable from rag-server.Removed
ENABLE_NEMOTRON_THINKINGenvironment variable from rag-server.
Fixed Known Issues#
The following are the known issues that are fixed in this version:
Fixed support for long audio file ingestion.
Fixed support to ingest images without charts/tables.
Fixed requirement of rebuilding rag frontend container when LLM model name was changed.
For the full list of known issues, see Known Issues.
Release 2.2.1 (2025-07-22)#
This is a minor patch release.
This release updates to the latest nv-ingest-client version 25.6.3 to fix breaking changes introduced by pypdfium2.
For details, refer to NVIDIA NV Ingest 25.6.3.
Release 2.2.0 (2025-07-08)#
This release adds B200 platform support, a native Python API, and major enhancements for multimodal and metadata features. It also improves deployment flexibility and customization across the RAG blueprint.
Highlights#
This release contains the following key changes:
You can now deploy the RAG Blueprint on DGX B200.
You can use Multi Instance GPUs to reduce the GPU requirements to 3xH100. For details, refer to Deploy on Kubernetes with Helm and MIG Support.
The RAG Blueprint project now uses
uvas the package manager.Added support for NVIDIA AI Workbench.
Added support for native python API. For details, refer to Python Client Package.
Added support for custom metadata for files and metadata-based filtering. For details, refer to Advanced Metadata Filtering.
Added support for multi collection based retrieval. For details, refer to Multi-Collection Retrieval.
Added support for .mp3 and .wav files. For details, refer to Audio Ingestion.
Added support for vision language model-based inferencing for charts and images. For details, refer to Vision Language Model for Generation.
Added support for generating summaries of uploaded files. For details, refer to Document Summarization.
Added support for configurable vector store consistency levels (Bounded/Strong/Session) to optimize retrieval performance and accuracy trade-offs.
Added support for non-blocking file upload to the user interface. For details, refer to User Interface.
Added more efficient error reporting to the user interface for ingestion failures. For details, refer to User Interface.
Added support for customizing prompts without rebuilding images. For details, refer to Customize Prompts.
Added support to enable infographics, which improves accuracy for documents containing text in image format. For details, refer to Ingestion and Chunking.
Optimized batch mode ingestion support to improve performance for multi-user concurrent file upload. For details, refer to Advanced Ingestion Batch Mode Optimization.
Added support for enhanced pdf extraction. For details, refer to Nemoretriever Parse.
Added support for running CPU-based Milvus. For details, refer to Milvus Configuration.
Added support for running NV-Ingest as a standalone service for the RAG Blueprint. For more information, refer to Deploy NV-Ingest Standalone.
Updated the API, including the following changes. For details, refer to Migration Guide.
POST /collections is replaced by POST /collection for
ingestor-server.New endpoint GET /summary added for rag-server.
Metadata information available as part of GET /collections and GET /documents API.
Release 2.1.0 (2025-05-13)#
This release reduces the overall GPU requirement for the deployment of the RAG Blueprint. This release also improves the performance and stability for both Docker- and Helm-based deployments.
Highlights#
This release contains the following key changes:
The overall GPU requirement is now reduced to 2xH100 / 3xA100. For details, refer to Support Matrix.
Changed the default LLM model to llama-3_3-nemotron-super-49b-v1. This reduces overall GPU needed to deploy LLM model to 1xH100 / 2xA100.
Changed the default GPU needed for all other NIMs (ingestion and reranker) to 1xH100 / 1xA100.
The Helm chart is now published on the NGC Public registry. For details, refer to NVIDIA NGC Catalog.
Helm chart customization is now available for many optional features. For details, refer to Deploy with Helm.
Changed the default chunk size to 512 to reduce LLM context size and in turn reduce RAG server response latency.
Exposed config to split PDFs after chunking. You can control this by using the
APP_NVINGEST_ENABLEPDFSPLITTERenvironment variable. For details, refer to Best Practices.Added batch-based ingestion which can help manage memory usage of
ingestor-servermore effectively. You can control this by using theENABLE_NV_INGEST_BATCH_MODEandNV_INGEST_FILES_PER_BATCHvariables. For details, refer to Best Practices.Removed
extract_optionsfrom API level ofingestor-server.Made security and stability improvements.
Added non-blocking async support to upload documents API. For details, refer to Migration Guide.
Added a new field
blocking: boolto control this behavior from the client side. The default is set totrue.Added a new API
/statusto monitor state or completion status of uploaded docs.
Fixed Known Issues#
The following are the known issues that are fixed in this version:
Issues processing very large files have been fixed.
Resolved an issue during bulk ingestion where the ingestion job failed if ingestion of a single file failed.
For the full list of known issues, see Known Issues.
Release 2.0.0 (2025-03-18)#
This release adds support for multimodal documents including the ability to parse PDF, Word, and PowerPoint documents. This release also significantly improves accuracy and performance considerations by refactoring the APIs and architecture. There is also a new developer-friendly user interface.
Highlights#
This release contains the following key changes:
The RAG Blueprint now uses two separate microservices to manage ingestion and retrieval/generation.
The RAG Blueprint now uses Retriever Extraction instead of unstructured.io for the ingestion pipeline.
Default settings are now configured to achieve a balance between accuracy and perf. For details, refer to Best Practices.
Added support for observability and telemetry. For details, refer to Observability.
Added new react and nodeJS-based user interface to showcase runtime configurations. For details, refer to User Interface.
Query rewriting now uses a smaller llama3.1-8b-instruct model and is turned off by default. For details, refer to Query Rewriting.
Added support for the following optional features to improve accuracy and reliability of the pipeline. These are turned off by default. For details, refer to Best Practices.
Support to use conversation history during retrieval for low-latency multiturn support.
Added a deployment-ready notebook intended to run in a Brev environment. For details, refer to Notebooks.
Helm charts are now modularized, with separate helm charts provided for each distinct microservice. For details, refer to Deploy with Helm.
The default docker deployment flow now uses on-premises models. Alternatively, you can deploy with Docker and using NVIDIA-Hosted Models. For details, refer to the following:
Made security and stability improvements.
Updated the API, including the following changes. For details, refer to Migration Guide.
Support runtime configuration of all common parameters.
Multimodal citation support.
New dedicated endpoints for deleting collection, creating collections and re-ingestion of documents.
Release 1.0.0 (2025-01-15)#
This is the initial release of the NVIDIA RAG Blueprint.
All Known Issues#
The following are the known issues for the NVIDIA RAG Blueprint:
Optional features reflection and image captioning are not available in Helm-based deployment.
Currently, Helm-based deployment is not supported for NeMo Guardrails.
The Blueprint responses can have significant latency when using NVIDIA API Catalog cloud hosted models.
The accuracy of the pipeline is optimized for certain file types like
.pdf,.txt,.docx. The accuracy may be poor for other file types supported by NV-Ingest, since image captioning is disabled by default.When updating model configurations in Kubernetes
values.yaml(for example, changing from 70B to 8B models), the RAG UI automatically detects and displays the new model configuration from the backend. No container rebuilds are required - simply redeploy the Helm chart with updated values and refresh the UI to see the new model settings in the Settings panel.The NeMo LLM microservice can take 5-6 minutes to start for every deployment.
B200 GPUs are not supported for the following advanced features. For these features, use H100 or A100 GPUs instead.
Image captioning support for ingested documents
NeMo Guardrails for guardrails at input/output
VLM-based inferencing in RAG
PDF extraction with Nemoretriever Parse
Sometimes when HTTP cloud NIM endpoints are used from
deploy/compose/.env, thenv-ingest-ms-runtimestill logs gRPC environment variables. Following log entries can be ignored.For MIG support, currently the ingestion profile has been scaled down while deploying the chart with MIG slicing This affects the ingestion performance during bulk ingestion, specifically large bulk ingestion jobs might fail.
Individual file uploads are limited to a maximum size of 400 MB during ingestion. Files exceeding this limit are rejected and must be split into smaller segments before ingesting.
llama-3.3-nemotron-super-49b-v1.5model provides more verbose responses in non-reasoning mode compared to v1.0. For some queries the LLM model may respond with information not available in given context. Also for out of domain queries the model may provide responses based on it’s own knowledge. Developers are strongly advised to tune the prompt for their use cases to avoid these scenarios.The auto selected NIM-LLM profile for llama-3.3-nemotron-super-49b-v1.5 may not work for some GPUs. Follow steps outlined in the appropriate deployment guide to select an optimized profile using
NIM_MODEL_PROFILEbefore deploying.Slow VDB upload is observed in Helm deployments for Elasticsearch.