Best Practices for Common NVIDIA RAG Blueprint Settings#

Use this documentation to learn how to configure the performance of the NVIDIA RAG Blueprint according to your specific use-case. Default values are set to balance between accuracy and performance. Change the setting if you want different behavior.

Ingestion and Chunking#

Name

Default

Description

Advantages

Disadvantages

APP_NVINGEST_CHUNKOVERLAP

150

Increase overlap to ensure smooth transitions between chunks.

- Larger overlap provides smoother transitions between chunks.

- Might increase processing overhead.

APP_NVINGEST_CHUNKSIZE

512

Increase chunk size for more context.

- Larger chunks retain more context, improving coherence.

- Larger chunks increase embedding size, slowing retrieval.
- Longer chunks might increase latency due to larger prompt size.

APP_NVINGEST_ENABLEPDFSPLITTER

true

Set to true to perform chunk-based splitting of pdfs after the default page-level extraction occurs. Recommended for PDFs that are mostly text content.

- Provides more granular content segmentation.

- Can increase the number of chunks and slow down the ingestion process.

APP_NVINGEST_EXTRACTCHARTS

true

Set to true to extract charts.

- Improves accuracy for documents that contain charts.

- Increases ingestion time.

APP_NVINGEST_EXTRACTIMAGES

false

Set to true to enable image captioning during ingestion. For details, refer to Image Captioning Support.

- Enhances multimodal retrieval accuracy for documents having images.

- Increased processing time during ingestion.
- Requires additional GPU resources for VLM model deployment.

APP_NVINGEST_EXTRACTINFOGRAPHICS

false

Set to true to extract infographics and text-as-images.

- Improves accuracy for documents that contain text in image format.

- Increases ingestion time.

APP_NVINGEST_EXTRACTTABLES

true

Set to true to extract tables.

- Improves accuracy for documents that contain tables.

- Increases ingestion time.

APP_NVINGEST_PDFEXTRACTMETHOD

pdfium

Set to nemoretriever_parse to use nemoretriever_parse to extract pdfs. For details, refer to PDF extraction with Nemoretriever Parse.

- Provides enhanced PDF parsing and structure understanding.
- Better extraction of complex PDF layouts and content.

- Requires additional GPU resources for the Nemoretriever Parse service.
- Only supports PDF format documents.
- Not supported on NVIDIA B200 GPUs.

APP_NVINGEST_SEGMENTAUDIO

false

Set to true to enable audio segmentation. For details, refer to Audio Ingestion Support.

- Segments audio files based on commas and other punctuation marks for more granular audio chunks.
- Improves downstream processing and retrieval accuracy for audio content.

- Might increase processing time during audio ingestion.

INGEST_DISABLE_DYNAMIC_SCALING

true

Set to true to disable dynamic scaling.

- When disabled, provides better ingestion performance and throughput.
- When disabled, more predictable resource allocation and processing behavior.

- When disabled, higher memory utilization as resources are statically allocated.
- When disabled, less efficient memory usage when processing smaller workloads.

Retrieval and Generation#

Name

Default

Description

Advantages

Disadvantages

- APP_LLM_MODELNAME
- APP_EMBEDDINGS_MODELNAME
- APP_RANKING_MODELNAME

See description

The default models are the following:
- nvidia/llama-3.3-nemotron-super-49b-v1.5
- nvidia/llama-3.2-nv-embedqa-1b-v2
- nvidia/llama-3.2-nv-rerankqa-1b-v2

You can use larger models. For details, refer to Change the Inference or Embedding Model.

- Higher accuracy with better reasoning and a larger context length.

- Slower response time.
- Higher inference cost.
- Higher GPU requirement.

APP_VECTORSTORE_SEARCHTYPE

dense

Set to hybrid to enable hybrid search. For details, refer to Hybrid Search Support.

- Can provide better retrieval accuracy for domain-specific content.

- Can induce higher latency for large number of documents.

ENABLE_GUARDRAILS

false

Set to true to enable NeMo Guardrails. For details, refer to Nemo Guardrails Support.

- Applies input/output constraints for better safety and consistency.

- Significant increased processing overhead for additional LLM calls.
- Needs additional GPUs to deploy guardrails-specific models locally.

ENABLE_QUERYREWRITER

false

Set to true to enable query rewriting. For details, refer to Query Rewriting Support.

- Enhances retrieval accuracy for multi-turn scenarios by rephrasing the query.

- Adds an extra LLM call, increasing latency.

ENABLE_REFLECTION

false

Set to true to enable self-reflection. For details, refer to Self-Reflection Support.

- Can improve the response quality by refining intermediate retrieval and final LLM output.

- Significantly higher latency due to multiple iterations of LLM model call.
- You might need to deploy a separate judge LLM model, increasing GPU requirement.

ENABLE_RERANKER

true

Set to true to use the reranking model.

- Improves accuracy by selecting better documents for response generation.

- Increases latency due to additional processing.
- Additional hardware requirements for self-hosted on premises deployment.

ENABLE_VLM_INFERENCE

false

Set to true to use the Vision-Language Model (VLM) for response generation. For details, refer to VLM for Generation.

- Enables analysis of retrieved images alongside text for richer, multimodal responses.
- Can process up to 4 images per citation.
- Useful for document Q&A, visual search, and multimodal chatbots.

- Requires additional GPU resources for VLM model deployment.
- Increases latency due to image processing.

Reasoning in llama-3.3-nemotron-super-49b-v1.5

/no_think

Use /think to enable reasoning. For details, refer to Enable Reasoning.

- Improves response quality through enhanced reasoning capabilities.
- Yields more precise responses. The default model is verbose and works best with reasoning enabled.

- Can increase response latency due to additional thinking process.
- Can increase token usage and computational overhead.

RERANKER_CONFIDENCE_THRESHOLD

0.0

Filters out retrieved chunks if reranker relevance is lower than this threshold. We recommend that you set this value between 0.3 and 0.5 to balance quality and coverage. For details, refer to Use the Python Package.

- Faster retrieval by processing fewer documents.
- Can improve accuracy by excluding low-relevance documents.

- Requires ENABLE_RERANKER set to true for effective filtering.
- Might filter out too many chunks if the threshold is set high, causing no response from the RAG server.

RERANKER TOP K

10

Increase reranker TOP K to increase the probability of relevant context being part of the top-k contexts.

Increasing the value can improve accuracy.

Increasing the value can increase latency.

VDB TOP K

100

Increase VDB TOP K to provide a larger candidate pool for reranking.

Increasing the value can improve accuracy.

Increasing the value can increase latency.

Advanced Ingestion Batch Mode Optimization#

By default, the ingestion server processes files in parallel batches, distributing the workload to multiple workers for efficient ingestion. This parallel processing architecture helps optimize throughput while managing system resources effectively. You can use the following environment variables to configure the batch processing behavior.

Caution

These variables are not “set it and forget it” variables. These variables require trial and error tuning for optimal performance.

Name

Default

Description

Advantages

Disadvantages

NV_INGEST_CONCURRENT_BATCHES

4

Controls the number of parallel batch processing streams.

- You can increase this for systems with high memory capacity.

- Higher values require more system memory.
- Requires careful tuning based on available system resources.

NV_INGEST_FILES_PER_BATCH

16

Controls how many files are processed in a single batch during ingestion.

- Adjust this to helps optimize memory usage and processing efficiency.

- Setting this too high can cause memory pressure.
- Setting this too low can reduce throughput.

Tip

For optimal resource utilization, NV_INGEST_CONCURRENT_BATCHES times NV_INGEST_FILES_PER_BATCH should approximately equal MAX_INGEST_PROCESS_WORKERS.