Multi-Turn Conversation Support for NVIDIA RAG Blueprint#
The NVIDIA RAG Blueprint supports multi-turn conversations through two configuration options:
CONVERSATION_HISTORY: Controls how many conversation turns are passed to the LLM for response generation
Query Processing: Either query rewriting (
ENABLE_QUERYREWRITER) or simple retrieval (MULTITURN_RETRIEVER_SIMPLE)
Important
For multi-turn conversations to work, you must set CONVERSATION_HISTORY > 0 (e.g., 3-5 conversation turns).
Additionally, enable either:
ENABLE_QUERYREWRITER=True(recommended for best accuracy), ORMULTITURN_RETRIEVER_SIMPLE=True(for lower latency)
Without these settings, each query is processed independently without conversational context.
How Multi-Turn Conversations Work#
Generation Stage (CONVERSATION_HISTORY)#
CONVERSATION_HISTORY determines the number of conversation turns (user-assistant pairs) passed to the LLM when generating responses. This provides the LLM with context from previous exchanges.
Default: 0 (no conversation history)
Example:
CONVERSATION_HISTORY=2
This passes the last 2 conversation turns (4 messages: 2 user + 2 assistant) to the LLM, providing context from recent exchanges.
Retrieval Stage#
The retrieval stage supports two approaches:
Option 1: Query Rewriting (ENABLE_QUERYREWRITER)#
Query rewriting makes an additional LLM call to decontextualize the incoming question before sending it to the retrieval pipeline, enabling higher accuracy for multiturn queries.
Default: False (disabled)
How it works:
Uses an LLM to reformulate the user’s query based on conversation context
Creates a standalone, context-aware query that doesn’t require history
Provides best retrieval accuracy for multi-turn conversations
Adds latency due to additional LLM call
Warning
If you enable query rewriting (ENABLE_QUERYREWRITER=True) but keep CONVERSATION_HISTORY=0, query rewriting will be skipped with a warning.
Option 2: Simple History Concatenation (MULTITURN_RETRIEVER_SIMPLE)#
When MULTITURN_RETRIEVER_SIMPLE is enabled, previous user queries from the conversation are concatenated with the current query before retrieving documents from the vector database.
Default: False (disabled)
Example:
User Turn 1: "What is NVIDIA?"
User Turn 2: "Tell me about their GPUs"
When disabled (False): Only “Tell me about their GPUs” is used for retrieval
When enabled (True): “What is NVIDIA?. Tell me about their GPUs” is used for retrieval
How it works:
Concatenates previous user queries with the current query using “. “ separator
Lower latency (no additional LLM call)
May be less accurate than query rewriting for complex conversational references
Note
MULTITURN_RETRIEVER_SIMPLE only applies when query rewriting is disabled. If ENABLE_QUERYREWRITER is True, query rewriting takes precedence.
API Usage#
The RAG server exposes an OpenAI-compatible API for providing custom conversation history. For full details, see API - RAG Server Schema.
Use the /generate endpoint to generate responses with custom conversation history.
Required Parameters#
Parameter |
Description |
Type |
|---|---|---|
messages |
A sequence of messages that form a conversation history. Each message contains a |
Array |
use_knowledge_base |
|
Boolean |
Example API Payload#
{
"messages": [
{
"role": "system",
"content": "You are an assistant that provides information about FastAPI."
},
{
"role": "user",
"content": "What is FastAPI?"
},
{
"role": "assistant",
"content": "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints."
},
{
"role": "user",
"content": "What are the key features of FastAPI?"
}
],
"use_knowledge_base": true
}
For hands-on examples, refer to the retriever API usage notebook.
Multi-Turn Conversation Strategies#
This section lists down different strategies available for enabling multi turn query handling in the pipeline.
Strategy 1: Query Rewriting (Recommended for Best Accuracy)#
Configuration:
ENABLE_QUERYREWRITER="True"
CONVERSATION_HISTORY="5"
When to use:
Accuracy is the highest priority
User queries frequently reference previous conversation turns
You can tolerate additional latency for better results
Strategy 2: Simple History Concatenation#
Configuration:
MULTITURN_RETRIEVER_SIMPLE="True"
CONVERSATION_HISTORY="5"
When to use:
You need multi-turn support with lower latency
Queries have simple references to previous turns
Query rewriting adds too much latency for your use case
Strategy 3: Single-Turn Mode (No History)#
Configuration:
CONVERSATION_HISTORY="0"
When to use:
This is the default setting
Queries are independent and don’t reference previous turns
Minimizing token usage and latency is critical
Building a Q&A system without conversational memory
Docker Deployment#
Prerequisites#
Follow the deployment guide for Self-Hosted Models or NVIDIA-Hosted Models.
Enable Query Rewriting with On-Prem Model (Recommended)#
Verify the nim-llm container is healthy:
docker ps --filter "name=nim-llm" --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
Example Output:
NAMES STATUS nim-llm Up 38 minutes (healthy)
Enable query rewriting:
export APP_QUERYREWRITER_SERVERURL="nim-llm:8000" export ENABLE_QUERYREWRITER="True" export CONVERSATION_HISTORY="5" docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Tip
You can enable query rewriting at runtime by setting enable_query_rewriting: True in the POST /generate API schema without relaunching containers. Refer to the retrieval notebook. Note that CONVERSATION_HISTORY must still be > 0.
Enable Query Rewriting with Cloud-Hosted Model#
Configure for cloud-hosted model:
export APP_QUERYREWRITER_SERVERURL="" export ENABLE_QUERYREWRITER="True" export CONVERSATION_HISTORY="5" docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Tip
For externally hosted LLM models, customize the endpoint and model name:
export APP_QUERYREWRITER_SERVERURL="<llm_nim_http_endpoint_url>"
export APP_QUERYREWRITER_MODELNAME="<model_name>"
Enable Simple History Concatenation#
export MULTITURN_RETRIEVER_SIMPLE="True"
export CONVERSATION_HISTORY="5"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Disable All Multi-Turn Features (Single-Turn Mode)#
export CONVERSATION_HISTORY="0"
export MULTITURN_RETRIEVER_SIMPLE="False"
export ENABLE_QUERYREWRITER="False"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Helm Deployment#
For details on Helm deployment, see Deploy with Helm.
Enable Query Rewriting with On-Prem Model (Recommended)#
Note
Only on-prem deployment of the LLM is supported for Helm. The model must be deployed separately using the NIM LLM Helm chart.
Modify
values.yamlto enable query rewriting:# Environment variables for rag-server envVars: # ... existing configurations ... # === Query Rewriter Model specific configurations === APP_QUERYREWRITER_MODELNAME: "nvidia/llama-3.3-nemotron-super-49b-v1.5" APP_QUERYREWRITER_SERVERURL: "nim-llm:8000" # Fully qualified service name ENABLE_QUERYREWRITER: "True" CONVERSATION_HISTORY: "5"
Deploy or upgrade the chart:
After modifying
values.yaml, apply the changes as described in Change a Deployment.For detailed HELM deployment instructions, see Helm Deployment Guide.
Enable Simple History Concatenation#
Modify
values.yamlto enable simple history concatenation:# Environment variables for rag-server envVars: # ... existing configurations ... # === Simple Multi-Turn (History Concatenation) === MULTITURN_RETRIEVER_SIMPLE: "True" CONVERSATION_HISTORY: "5"
Upgrade the deployment:
After modifying
values.yaml, apply the changes as described in Change a Deployment.For detailed HELM deployment instructions, see Helm Deployment Guide.
Configuration Summary#
Environment Variable |
Stage |
Default |
Required For |
Description |
|---|---|---|---|---|
|
Generation |
|
All multi-turn features |
Number of conversation turns to pass to LLM (0 = no history) |
|
Retrieval |
|
Advanced multi-turn |
Enable AI-powered query rewriting for better retrieval accuracy |
|
Retrieval |
|
Simple multi-turn |
Concatenate conversation history with current query for document retrieval |
|
Retrieval |
- |
Query rewriting |
Server URL for query rewriter model (empty string for cloud-hosted) |
|
Retrieval |
- |
Query rewriting |
Model name for query rewriter |