Best Practices for Common NVIDIA RAG Blueprint Settings#
Use this documentation to learn how to configure the performance of the NVIDIA RAG Blueprint according to your specific use-case. Default values are set to balance between accuracy and performance. Change the setting if you want different behavior.
Ingestion and Chunking#
Name |
Default |
Description |
Advantages |
Disadvantages |
|---|---|---|---|---|
|
|
Increase overlap to ensure smooth transitions between chunks. |
- Larger overlap provides smoother transitions between chunks. |
- Might increase processing overhead. |
|
|
Increase chunk size for more context. |
- Larger chunks retain more context, improving coherence. |
- Larger chunks increase embedding size, slowing retrieval. |
|
|
Set to |
- Provides more granular content segmentation. |
- Can increase the number of chunks and slow down the ingestion process. |
|
|
Set to |
- Improves accuracy for documents that contain charts. |
- Increases ingestion time. |
|
|
Set to |
- Enhances multimodal retrieval accuracy for documents having images. |
- Increased processing time during ingestion. |
|
|
Set to |
- Improves accuracy for documents that contain text in image format. |
- Increases ingestion time. |
|
|
Set to |
- Improves accuracy for documents that contain tables. |
- Increases ingestion time. |
|
|
Set to |
- Provides enhanced PDF parsing and structure understanding. |
- Requires additional GPU resources for the Nemoretriever Parse service. |
|
|
Set to |
- Segments audio files based on commas and other punctuation marks for more granular audio chunks. |
- Might increase processing time during audio ingestion. |
|
|
Set to |
- When disabled, provides better ingestion performance and throughput. |
- When disabled, higher memory utilization as resources are statically allocated. |
Retrieval and Generation#
Name |
Default |
Description |
Advantages |
Disadvantages |
|---|---|---|---|---|
- |
See description |
The default models are the following: |
- Higher accuracy with better reasoning and a larger context length. |
- Slower response time. |
|
|
Set to |
- Can provide better retrieval accuracy for domain-specific content. |
- Can induce higher latency for large number of documents. |
|
|
Set to |
- Applies input/output constraints for better safety and consistency. |
- Significant increased processing overhead for additional LLM calls. |
|
|
Set to |
- Enhances retrieval accuracy for multi-turn scenarios by rephrasing the query. |
- Adds an extra LLM call, increasing latency. |
|
|
Set to |
- Can improve the response quality by refining intermediate retrieval and final LLM output. |
- Significantly higher latency due to multiple iterations of LLM model call. |
|
|
Set to |
- Improves accuracy by selecting better documents for response generation. |
- Increases latency due to additional processing. |
|
|
Set to |
- Enables analysis of retrieved images alongside text for richer, multimodal responses. |
- Requires additional GPU resources for VLM model deployment. |
Reasoning in |
|
Use |
- Improves response quality through enhanced reasoning capabilities. |
- Can increase response latency due to additional thinking process. |
|
|
Filters out retrieved chunks if reranker relevance is lower than this threshold. We recommend that you set this value between |
- Faster retrieval by processing fewer documents. |
- Requires |
|
10 |
Increase |
Increasing the value can improve accuracy. |
Increasing the value can increase latency. |
|
100 |
Increase |
Increasing the value can improve accuracy. |
Increasing the value can increase latency. |
Advanced Ingestion Batch Mode Optimization#
By default, the ingestion server processes files in parallel batches, distributing the workload to multiple workers for efficient ingestion. This parallel processing architecture helps optimize throughput while managing system resources effectively. You can use the following environment variables to configure the batch processing behavior.
Caution
These variables are not “set it and forget it” variables. These variables require trial and error tuning for optimal performance.
Name |
Default |
Description |
Advantages |
Disadvantages |
|---|---|---|---|---|
|
4 |
Controls the number of parallel batch processing streams. |
- You can increase this for systems with high memory capacity. |
- Higher values require more system memory. |
|
16 |
Controls how many files are processed in a single batch during ingestion. |
- Adjust this to helps optimize memory usage and processing efficiency. |
- Setting this too high can cause memory pressure. |
Tip
For optimal resource utilization, NV_INGEST_CONCURRENT_BATCHES times NV_INGEST_FILES_PER_BATCH should approximately equal MAX_INGEST_PROCESS_WORKERS.