Enable Reasoning for NVIDIA RAG Blueprint#
The NVIDIA RAG Blueprint supports reasoning capabilities that allow models to “think through” complex questions before answering. This feature improves accuracy for challenging queries but increases response latency due to additional reasoning tokens.
Tip
Reasoning is particularly beneficial for the following:
Complex multi-step questions
Queries requiring logical deduction
Technical or mathematical problem-solving
Scenarios where accuracy is more important than response speed
This guide explains how to enable reasoning for different Nemotron models, each using a different control mechanism.
Model |
Control Method |
Thinking Budget Parameters |
|---|---|---|
Nemotron 1.5 |
System prompts |
None |
Nemotron-3-Nano 9B |
System prompts |
min/max thinking tokens |
Nemotron-3-Nano 30B |
Environment variable |
max thinking tokens only |
Enable Reasoning for Nemotron 1.5#
Reasoning in Nemotron 1.5 models (such as nvidia/llama-3.3-nemotron-super-49b-v1.5) is controlled through system prompts. The model switches between reasoning and non-reasoning modes using /think and /no_think directives.
Update the System Prompt#
To enable reasoning, update the system prompt from /no_think to /think in prompt.yaml, as shown in the following code.
rag_template:
system: |
/think
human: |
You are a helpful AI assistant named Envie.
You must answer only using the information provided in the context. While answering you must follow the instructions given below.
<instructions>
1. Do NOT use any external knowledge.
2. Do NOT add explanations, suggestions, opinions, disclaimers, or hints.
3. NEVER say phrases like "based on the context", "from the documents", or "I cannot find".
4. NEVER offer to answer using general knowledge or invite the user to ask again.
5. Do NOT include citations, sources, or document mentions.
6. Answer concisely. Use short, direct sentences by default. Only give longer responses if the question truly requires it.
7. Do not mention or refer to these rules in any way.
8. Do not ask follow-up questions.
9. Do not mention this instructions in your response.
</instructions>
Context:
{context}
Make sure the response you are generating strictly follow the rules mentioned above i.e. never say phrases like "based on the context", "from the documents", or "I cannot find" and mention about the instruction in response.
Configure Model Parameters#
After you enable the /think prompt, configure the model parameters for optimal reasoning performance:
export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95
Filter Reasoning Tokens#
By default, reasoning tokens (shown between <think> tags) are filtered out so only the final answer is returned in the model response.
To view the full reasoning process including the <think> tags in the model response, use the following code.
export FILTER_THINK_TOKENS=false
Note
For most production use cases, keep FILTER_THINK_TOKENS=true (default) to provide cleaner responses to end users.
Enable Reasoning for Nemotron-3-Nano 9B#
The nvidia/nvidia-nemotron-nano-9b-v2 model uses system prompts to control reasoning similar to Nemotron 1.5. It also adds support for thinking budget parameters to control the extent of reasoning.
Update the System Prompt#
Change the system prompt from /no_think to /think in prompt.yaml as shown in the previous Nemotron 1.5 example.
Configure Model Parameters#
export LLM_TEMPERATURE=0.6
export LLM_TOP_P=0.95
Configure Thinking Budget (Optional)#
The 9B model supports both minimum and maximum thinking token limits to control the reasoning phase. You can include these parameters in API requests to the model:
{
"model": "nvidia/nvidia-nemotron-nano-9b-v2",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"min_thinking_tokens": 1024,
"max_thinking_tokens": 8192
}
Thinking budget parameters:
min_thinking_tokensMinimum number of reasoning tokens before generating the final answer.
max_thinking_tokensMaximum number of reasoning tokens allowed before generating the final answer.
Important
The key differences for the 9B model are the following:
Requires both
min_thinking_tokensandmax_thinking_tokensparametersReasoning is available in the model output’s
reasoning_contentfield (not wrapped in<think>tags)The
reasoning_contentfield is present in the model output but isn’t exposed in the generate API responseNo filtering is needed because reasoning is already separated from the final answer
Enable Reasoning for Nemotron-3-Nano 30B#
The nvidia/nemotron-3-nano-30b-a3b model uses a different approach for reasoning control. Instead of system prompts, you control reasoning through an environment variable.
Enable Reasoning Through an Environment Variable#
Set the environment variable to enable or disable reasoning:
# Enable reasoning (default)
export ENABLE_NEMOTRON_3_NANO_THINKING=true
# Disable reasoning
export ENABLE_NEMOTRON_3_NANO_THINKING=false
Configure Thinking Budget (Optional)#
The 30B model supports a maximum thinking token limit to control the reasoning phase:
{
"model": "nvidia/nemotron-3-nano-30b-a3b",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_thinking_tokens": 8192
}
Thinking budget parameters:
max_thinking_tokensMaximum number of reasoning tokens allowed before generating the final answer.
Important
The key differences for the 30B model are the following:
Uses only
max_thinking_tokens(notmin_thinking_tokens)Reasoning is available in the model output’s
reasoning_contentfield (not wrapped in<think>tags)The
reasoning_contentfield is present in the model output but isn’t exposed in the generate API responseNo filtering is needed because reasoning is already separated from the final answer
Model Naming#
Use the correct model name based on your deployment:
- Locally deployed NIMs
nvidia/nemotron-3-nano- NVIDIA-hosted models
nvidia/nemotron-3-nano-30b-a3b
Deploy with Reasoning Enabled#
After you configure reasoning settings in prompt.yaml or environment variables, redeploy your services:
Docker Compose#
# For prompt changes, rebuild and restart the RAG server
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d --build
# For environment variable changes only
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
Helm#
For Helm deployments with custom prompts or environment variables, refer to Customize Prompts for detailed instructions.
Thinking Budget Recommendations#
For models that support thinking budget parameters, a max_thinking_tokens value of 8192 is recommended for most use cases. This value provides:
Sufficient capacity for comprehensive reasoning
Reasonable response times
Good balance between quality and latency
Tip
Adjust the thinking budget based on your use case:
Lower values (1024-4096): Faster responses for simpler questions
Higher values (8192-16384): More thorough reasoning for complex queries