Best Practices for Common NVIDIA RAG Blueprint Settings#

Use this documentation to learn how to configure the performance of the NVIDIA RAG Blueprint according to your specific use-case. Default values are set to balance between accuracy and performance. Change the setting if you want different behavior.

Ingestion and Chunking#

Name	Default	Description	Advantages	Disadvantages
`APP_NVINGEST_CHUNKOVERLAP`	`150`	Increase overlap to ensure smooth transitions between chunks.	- Larger overlap provides smoother transitions between chunks.	- Might increase processing overhead.
`APP_NVINGEST_CHUNKSIZE`	`512`	Increase chunk size for more context.	- Larger chunks retain more context, improving coherence.	- Larger chunks increase embedding size, slowing retrieval. - Longer chunks might increase latency due to larger prompt size.
`APP_NVINGEST_ENABLEPDFSPLITTER`	`true`	Set to `true` to perform chunk-based splitting of pdfs after the default page-level extraction occurs. Recommended for PDFs that are mostly text content.	- Provides more granular content segmentation.	- Can increase the number of chunks and slow down the ingestion process.
`APP_NVINGEST_EXTRACTCHARTS`	`true`	Set to `true` to extract charts.	- Improves accuracy for documents that contain charts.	- Increases ingestion time.
`APP_NVINGEST_EXTRACTIMAGES`	`false`	Set to `true` to enable image captioning during ingestion. For details, refer to Image Captioning Support.	- Enhances multimodal retrieval accuracy for documents having images.	- Increased processing time during ingestion. - Requires additional GPU resources for VLM model deployment.
`APP_NVINGEST_EXTRACTINFOGRAPHICS`	`false`	Set to `true` to extract infographics and text-as-images.	- Improves accuracy for documents that contain text in image format.	- Increases ingestion time.
`APP_NVINGEST_EXTRACTTABLES`	`true`	Set to `true` to extract tables.	- Improves accuracy for documents that contain tables.	- Increases ingestion time.
`APP_NVINGEST_PDFEXTRACTMETHOD`	`pdfium`	Set to `nemoretriever_parse` to use nemoretriever_parse to extract pdfs. For details, refer to PDF extraction with Nemoretriever Parse.	- Provides enhanced PDF parsing and structure understanding. - Better extraction of complex PDF layouts and content.	- Requires additional GPU resources for the Nemoretriever Parse service. - Only supports PDF format documents. - Not supported on NVIDIA B200 GPUs.
`APP_NVINGEST_SEGMENTAUDIO`	`false`	Set to `true` to enable audio segmentation. For details, refer to Audio Ingestion Support.	- Segments audio files based on commas and other punctuation marks for more granular audio chunks. - Improves downstream processing and retrieval accuracy for audio content.	- Might increase processing time during audio ingestion.
`INGEST_DISABLE_DYNAMIC_SCALING`	`true`	Set to `true` to disable dynamic scaling.	- When disabled, provides better ingestion performance and throughput. - When disabled, more predictable resource allocation and processing behavior.	- When disabled, higher memory utilization as resources are statically allocated. - When disabled, less efficient memory usage when processing smaller workloads.

Retrieval and Generation#

Name	Default	Description	Advantages	Disadvantages
- `APP_LLM_MODELNAME` - `APP_EMBEDDINGS_MODELNAME` - `APP_RANKING_MODELNAME`	See description	The default models are the following: - `nvidia/llama-3.3-nemotron-super-49b-v1.5` - `nvidia/llama-3.2-nv-embedqa-1b-v2` - `nvidia/llama-3.2-nv-rerankqa-1b-v2` You can use larger models. For details, refer to Change the Inference or Embedding Model.	- Higher accuracy with better reasoning and a larger context length.	- Slower response time. - Higher inference cost. - Higher GPU requirement.
`APP_VECTORSTORE_SEARCHTYPE`	`dense`	Set to `hybrid` to enable hybrid search. For details, refer to Hybrid Search Support.	- Can provide better retrieval accuracy for domain-specific content.	- Can induce higher latency for large number of documents.
`ENABLE_GUARDRAILS`	`false`	Set to `true` to enable NeMo Guardrails. For details, refer to Nemo Guardrails Support.	- Applies input/output constraints for better safety and consistency.	- Significant increased processing overhead for additional LLM calls. - Needs additional GPUs to deploy guardrails-specific models locally.
`ENABLE_QUERYREWRITER`	`false`	Set to `true` to enable query rewriting. For details, refer to Query Rewriting Support.	- Enhances retrieval accuracy for multi-turn scenarios by rephrasing the query.	- Adds an extra LLM call, increasing latency.
`ENABLE_REFLECTION`	`false`	Set to `true` to enable self-reflection. For details, refer to Self-Reflection Support.	- Can improve the response quality by refining intermediate retrieval and final LLM output.	- Significantly higher latency due to multiple iterations of LLM model call. - You might need to deploy a separate judge LLM model, increasing GPU requirement.
`ENABLE_RERANKER`	`true`	Set to `true` to use the reranking model.	- Improves accuracy by selecting better documents for response generation.	- Increases latency due to additional processing. - Additional hardware requirements for self-hosted on premises deployment.
`ENABLE_VLM_INFERENCE`	`false`	Set to `true` to use the Vision-Language Model (VLM) for response generation. For details, refer to VLM for Generation.	- Enables analysis of retrieved images alongside text for richer, multimodal responses. - Can process up to 4 images per citation. - Useful for document Q&A, visual search, and multimodal chatbots.	- Requires additional GPU resources for VLM model deployment. - Increases latency due to image processing.
Reasoning in `llama-3.3-nemotron-super-49b-v1.5`	`/no_think`	Use `/think` to enable reasoning. For details, refer to Enable Reasoning.	- Improves response quality through enhanced reasoning capabilities. - Yields more precise responses. The default model is verbose and works best with reasoning enabled.	- Can increase response latency due to additional thinking process. - Can increase token usage and computational overhead.
`RERANKER_CONFIDENCE_THRESHOLD`	`0.0`	Filters out retrieved chunks if reranker relevance is lower than this threshold. We recommend that you set this value between `0.3` and `0.5` to balance quality and coverage. For details, refer to Use the Python Package.	- Faster retrieval by processing fewer documents. - Can improve accuracy by excluding low-relevance documents.	- Requires `ENABLE_RERANKER` set to `true` for effective filtering. - Might filter out too many chunks if the threshold is set high, causing no response from the RAG server.
`RERANKER TOP K`	10	Increase `reranker TOP K` to increase the probability of relevant context being part of the top-k contexts.	Increasing the value can improve accuracy.	Increasing the value can increase latency.
`VDB TOP K`	100	Increase `VDB TOP K` to provide a larger candidate pool for reranking.	Increasing the value can improve accuracy.	Increasing the value can increase latency.

Advanced Ingestion Batch Mode Optimization#

By default, the ingestion server processes files in parallel batches, distributing the workload to multiple workers for efficient ingestion. This parallel processing architecture helps optimize throughput while managing system resources effectively. You can use the following environment variables to configure the batch processing behavior.

Caution

These variables are not “set it and forget it” variables. These variables require trial and error tuning for optimal performance.

Name	Default	Description	Advantages	Disadvantages
`NV_INGEST_CONCURRENT_BATCHES`	4	Controls the number of parallel batch processing streams.	- You can increase this for systems with high memory capacity.	- Higher values require more system memory. - Requires careful tuning based on available system resources.
`NV_INGEST_FILES_PER_BATCH`	16	Controls how many files are processed in a single batch during ingestion.	- Adjust this to helps optimize memory usage and processing efficiency.	- Setting this too high can cause memory pressure. - Setting this too low can reduce throughput.

Tip

For optimal resource utilization, NV_INGEST_CONCURRENT_BATCHES times NV_INGEST_FILES_PER_BATCH should approximately equal MAX_INGEST_PROCESS_WORKERS.

Best Practices for Common NVIDIA RAG Blueprint Settings#

Ingestion and Chunking#

Retrieval and Generation#

Advanced Ingestion Batch Mode Optimization#

Related Topics#