Multi-Turn Conversation Support for NVIDIA RAG Blueprint#

The NVIDIA RAG Blueprint supports multi-turn conversations through two configuration options:

CONVERSATION_HISTORY: Controls how many conversation turns are passed to the LLM for response generation
Query Processing: Either query rewriting (ENABLE_QUERYREWRITER) or simple retrieval (MULTITURN_RETRIEVER_SIMPLE)

Important

For multi-turn conversations to work, you must set CONVERSATION_HISTORY > 0 (e.g., 3-5 conversation turns).

Additionally, enable either:

ENABLE_QUERYREWRITER=True (recommended for best accuracy), OR
MULTITURN_RETRIEVER_SIMPLE=True (for lower latency)

Without these settings, each query is processed independently without conversational context.

How Multi-Turn Conversations Work#

Generation Stage (CONVERSATION_HISTORY)#

CONVERSATION_HISTORY determines the number of conversation turns (user-assistant pairs) passed to the LLM when generating responses. This provides the LLM with context from previous exchanges.

Default: 0 (no conversation history)

Example:

CONVERSATION_HISTORY=2

This passes the last 2 conversation turns (4 messages: 2 user + 2 assistant) to the LLM, providing context from recent exchanges.

Retrieval Stage#

The retrieval stage supports two approaches:

Option 1: Query Rewriting (ENABLE_QUERYREWRITER)#

Query rewriting makes an additional LLM call to decontextualize the incoming question before sending it to the retrieval pipeline, enabling higher accuracy for multiturn queries.

Default: False (disabled)

How it works:

Uses an LLM to reformulate the user’s query based on conversation context
Creates a standalone, context-aware query that doesn’t require history
Provides best retrieval accuracy for multi-turn conversations
Adds latency due to additional LLM call

Warning

If you enable query rewriting (ENABLE_QUERYREWRITER=True) but keep CONVERSATION_HISTORY=0, query rewriting will be skipped with a warning.

Option 2: Simple History Concatenation (MULTITURN_RETRIEVER_SIMPLE)#

When MULTITURN_RETRIEVER_SIMPLE is enabled, previous user queries from the conversation are concatenated with the current query before retrieving documents from the vector database.

Default: False (disabled)

Example:

User Turn 1: "What is NVIDIA?"
User Turn 2: "Tell me about their GPUs"

When disabled (False): Only “Tell me about their GPUs” is used for retrieval
When enabled (True): “What is NVIDIA?. Tell me about their GPUs” is used for retrieval

How it works:

Concatenates previous user queries with the current query using “. “ separator
Lower latency (no additional LLM call)
May be less accurate than query rewriting for complex conversational references

Note

MULTITURN_RETRIEVER_SIMPLE only applies when query rewriting is disabled. If ENABLE_QUERYREWRITER is True, query rewriting takes precedence.

API Usage#

The RAG server exposes an OpenAI-compatible API for providing custom conversation history. For full details, see API - RAG Server Schema.

Use the /generate endpoint to generate responses with custom conversation history.

Required Parameters#

Parameter	Description	Type
messages	A sequence of messages that form a conversation history. Each message contains a `role` field (`user`, `assistant`, or `system`) and a `content` field.	Array
use_knowledge_base	`true` to use a knowledge base; otherwise `false`.	Boolean

Example API Payload#

{
    "messages": [
        {
            "role": "system",
            "content": "You are an assistant that provides information about FastAPI."
        },
        {
            "role": "user",
            "content": "What is FastAPI?"
        },
        {
            "role": "assistant",
            "content": "FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints."
        },
        {
            "role": "user",
            "content": "What are the key features of FastAPI?"
        }
    ],
    "use_knowledge_base": true
}

For hands-on examples, refer to the retriever API usage notebook.

Multi-Turn Conversation Strategies#

This section lists down different strategies available for enabling multi turn query handling in the pipeline.

Strategy 1: Query Rewriting (Recommended for Best Accuracy)#

Configuration:

ENABLE_QUERYREWRITER="True"
CONVERSATION_HISTORY="5"

When to use:

Accuracy is the highest priority
User queries frequently reference previous conversation turns
You can tolerate additional latency for better results

Strategy 2: Simple History Concatenation#

Configuration:

MULTITURN_RETRIEVER_SIMPLE="True"
CONVERSATION_HISTORY="5"

When to use:

You need multi-turn support with lower latency
Queries have simple references to previous turns
Query rewriting adds too much latency for your use case

Strategy 3: Single-Turn Mode (No History)#

Configuration:

CONVERSATION_HISTORY="0"

When to use:

This is the default setting
Queries are independent and don’t reference previous turns
Minimizing token usage and latency is critical
Building a Q&A system without conversational memory

Docker Deployment#

Prerequisites#

Follow the deployment guide for Self-Hosted Models or NVIDIA-Hosted Models.

Enable Query Rewriting with On-Prem Model (Recommended)#

Verify the nim-llm container is healthy:

docker ps --filter "name=nim-llm" --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"

Example Output:

NAMES                                   STATUS
nim-llm                              Up 38 minutes (healthy)

Enable query rewriting:

export APP_QUERYREWRITER_SERVERURL="nim-llm:8000"
export ENABLE_QUERYREWRITER="True"
export CONVERSATION_HISTORY="5"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Tip

You can enable query rewriting at runtime by setting enable_query_rewriting: True in the POST /generate API schema without relaunching containers. Refer to the retrieval notebook. Note that CONVERSATION_HISTORY must still be > 0.

Enable Query Rewriting with Cloud-Hosted Model#

Configure for cloud-hosted model:

export APP_QUERYREWRITER_SERVERURL=""
export ENABLE_QUERYREWRITER="True"
export CONVERSATION_HISTORY="5"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Tip

For externally hosted LLM models, customize the endpoint and model name:

export APP_QUERYREWRITER_SERVERURL="<llm_nim_http_endpoint_url>"
export APP_QUERYREWRITER_MODELNAME="<model_name>"

Enable Simple History Concatenation#

export MULTITURN_RETRIEVER_SIMPLE="True"
export CONVERSATION_HISTORY="5"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Disable All Multi-Turn Features (Single-Turn Mode)#

export CONVERSATION_HISTORY="0"
export MULTITURN_RETRIEVER_SIMPLE="False"
export ENABLE_QUERYREWRITER="False"
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d

Helm Deployment#

For details on Helm deployment, see Deploy with Helm.

Enable Query Rewriting with On-Prem Model (Recommended)#

Note

Only on-prem deployment of the LLM is supported for Helm. The model must be deployed separately using the NIM LLM Helm chart.

Modify values.yaml to enable query rewriting:

# Environment variables for rag-server
envVars:
  # ... existing configurations ...
  
  # === Query Rewriter Model specific configurations ===
  APP_QUERYREWRITER_MODELNAME: "nvidia/llama-3.3-nemotron-super-49b-v1.5"
  APP_QUERYREWRITER_SERVERURL: "nim-llm:8000"  # Fully qualified service name
  ENABLE_QUERYREWRITER: "True"
  CONVERSATION_HISTORY: "5"

Deploy or upgrade the chart:

After modifying values.yaml, apply the changes as described in Change a Deployment.

For detailed HELM deployment instructions, see Helm Deployment Guide.

Enable Simple History Concatenation#

Modify values.yaml to enable simple history concatenation:

# Environment variables for rag-server
envVars:
  # ... existing configurations ...
  
  # === Simple Multi-Turn (History Concatenation) ===
  MULTITURN_RETRIEVER_SIMPLE: "True"
  CONVERSATION_HISTORY: "5"

Upgrade the deployment:

After modifying values.yaml, apply the changes as described in Change a Deployment.

For detailed HELM deployment instructions, see Helm Deployment Guide.

Configuration Summary#

Environment Variable	Stage	Default	Required For	Description
`CONVERSATION_HISTORY`	Generation	`0`	All multi-turn features	Number of conversation turns to pass to LLM (0 = no history)
`ENABLE_QUERYREWRITER`	Retrieval	`False`	Advanced multi-turn	Enable AI-powered query rewriting for better retrieval accuracy
`MULTITURN_RETRIEVER_SIMPLE`	Retrieval	`False`	Simple multi-turn	Concatenate conversation history with current query for document retrieval
`APP_QUERYREWRITER_SERVERURL`	Retrieval	-	Query rewriting	Server URL for query rewriter model (empty string for cloud-hosted)
`APP_QUERYREWRITER_MODELNAME`	Retrieval	-	Query rewriting	Model name for query rewriter

Multi-Turn Conversation Support for NVIDIA RAG Blueprint#

How Multi-Turn Conversations Work#

Generation Stage (CONVERSATION_HISTORY)#

Retrieval Stage#

Option 1: Query Rewriting (ENABLE_QUERYREWRITER)#

Option 2: Simple History Concatenation (MULTITURN_RETRIEVER_SIMPLE)#

API Usage#

Required Parameters#

Example API Payload#

Multi-Turn Conversation Strategies#

Strategy 1: Query Rewriting (Recommended for Best Accuracy)#

Strategy 2: Simple History Concatenation#

Strategy 3: Single-Turn Mode (No History)#

Docker Deployment#

Prerequisites#

Enable Query Rewriting with On-Prem Model (Recommended)#

Enable Query Rewriting with Cloud-Hosted Model#

Enable Simple History Concatenation#

Disable All Multi-Turn Features (Single-Turn Mode)#

Helm Deployment#

Enable Query Rewriting with On-Prem Model (Recommended)#

Enable Simple History Concatenation#

Configuration Summary#

Related Topics#