Use Reasoning Models with NVIDIA NIM for LLMs#
NIM for LLMs supports deploying reasoning models designed to generate detailed, step-by-step thought processes. These models are post-trained using unique system prompts to support two different modes:
Return chain-of-thought responses
Return concise responses
This enables a single model with the option to toggle between two behaviors by simply changing the request. No additional scaffolding is required.
Some models are specially post-trained with parallel reasoning capabilities that can be enabled without any additional scaffolding required. For more information, see section Parallel Reasoning.
Some models let you limit the “thinking” tokens a model can generate before it must start producing its final answer. For more information, refer to Thinking Budget Control.
Reasoning Mode#
Reasoning mode is controlled by the system prompt or chat template keyword arguments, depending on the model.
When you configure your request, you can instruct the model to either generate an extended chain-of-thought response or provide a more direct answer. For models where this mode is controlled by the system prompt, the system prompt must be the first message in the conversation.
Select one of the following:
Review the prerequisites and common setup steps before deploying a reasoning model. For information about supported configurations, refer to supported models.
Detailed Thinking Prompt#
Use the following system prompts to control the reasoning mode:
System Prompt |
Description |
Recommended Settings |
|---|---|---|
|
Generates long chain-of-thought style responses with explicit thinking tokens. |
|
|
Generates more concise responses without extended chain-of-thought or thinking tokens. |
|
Refer to the API Reference for more general configuration options.
Detailed Thinking On#
In the following example, the system prompt instructs the model to include thinking tokens in its output, resulting in a detailed response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "How many 'r's are in 'strawberry'?"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3.3-nemotron-super-49b-v1", # Change the model ID if needed.
messages=messages,
max_tokens=32768,
stream=False,
temperature=0.6,
top_p=0.95
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Expected Output:
<think>
Okay, let's see. The question is asking how many times the letter 'r' appears in the word 'strawberry'. Hmm, I need to count the 'r's. First, I should write down the word so I can look at each letter one by one. Let me spell it out: S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me check again. Strawberry is spelled S-T-R-A-W-B-E-R-R-Y. Yeah, that seems correct.
Now, I need to go through each letter and count every time I see an 'r'. Let's break it down letter by letter:
1. S – not an 'r'.
2. T – not an 'r'.
3. R – that's the first 'r'.
4. A – not an 'r'.
5. W – not an 'r'.
6. B – not an 'r'.
7. E – not an 'r'.
8. R – that's the second 'r'.
9. R – that's the third 'r'.
10. Y – not an 'r'.
Wait, so in the sequence S-T-R-A-W-B-E-R-R-Y, the 'r's are at positions 3, 8, and 9. That makes three 'r's in total. Let me double-check to make sure I didn't miss any or count an extra. Let me spell the word again: S-T-R-A-W-B-E-R-R-Y. Yes, after the 'E' there are two 'r's in a row, then a 'Y'. So that's three 'r's. Hmm, I think that's correct. But wait, sometimes people might confuse similar letters or miscount. Let me write it out again and underline the 'r's:
S - T - **R** - A - W - B - E - **R** - **R** - Y
Yes, that's three 'r's. So the answer should be 3. But wait, I remember sometimes in words like 'strawberry', the pronunciation might affect how letters are perceived, but the question is about the spelling, right? So regardless of pronunciation, we're just counting the letters as they are written. So even if the 'w' is silent or something, we don't count that. So in the spelling, there are three 'r's. Therefore, the answer is 3. I think that's right. Let me confirm once more. S-T-R-A-W-B-E-R-R-Y. Yep, three 'r's. Alright, confident with that.
</think>
The word "strawberry" is spelled as S-T-R-A-W-B-E-R-R-Y.
Breaking it down letter by letter:
1. **S** – not an 'r'
2. **T** – not an 'r'
3. **R** – 1st 'r'
4. **A** – not an 'r'
5. **W** – not an 'r'
6. **B** – not an 'r'
7. **E** – not an 'r'
8. **R** – 2nd 'r'
9. **R** – 3rd 'r'
10. **Y** – not an 'r'
There are **3** instances of the letter 'r' in "strawberry".
**Answer:** 3
Detailed Thinking Off#
In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8001/v1", api_key="not-used")
messages = [
{"role": "system", "content": "detailed thinking off"},
{"role": "user", "content": "How many 'r's are in 'strawberry'?"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3.3-nemotron-super-49b-v1", # Change the model ID if needed.
messages=messages,
max_tokens=32768,
stream=False,
temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Expected Output:
A simple yet fun question!
Let's count the 'r's in 'strawberry' together:
1. **S** - no 'r' here
2. **T** - no 'r' here
3. **R** - here's the **1st 'r'**
4. **A** - no 'r' here
5. **W** - no 'r' here
6. **B** - no 'r' here
7. **E** - no 'r' here
8. **R** - here's the **2nd 'r'**
9. **R** - here's the **3rd 'r'**
10. **Y** - no 'r' here
So, there are **3 'r's** in the word 'strawberry'.
No Think Prompt#
Detailed reasoning is on by default for these models. Use the following system prompt to change the default reasoning mode:
System Prompt |
Description |
Recommended Settings |
|---|---|---|
|
Generates more concise responses without extended chain-of-thought or thinking tokens. |
|
Refer to the API Reference for more general configuration options.
Detailed Thinking Off#
In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "system", "content": "/no_think"},
{"role": "user", "content": "How many 'r's are in 'strawberry'?"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3.3-nemotron-super-49b-v1.5",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Expected Output:
There are 3 'r's in 'strawberry'.
Chat Template Keyword Arguments#
Use the following chat template keyword argument to control the reasoning mode:
enable_thinking: Default is True. Generates long chain-of-thought style
responses with explicit thinking tokens. Set to False to generate more concise
responses without extended chain-of-thought or thinking tokens.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "user", "content": "Tell me a story."}
]
extra_body = {
"chat_template_kwargs": {"enable_thinking": False},
}
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Parallel Reasoning#
NIM for LLMs supports models with advanced inference-time compute scaling, generating parallel reasoning traces to solve difficult and complex problems. These models are specially post-trained with parallel reasoning capabilities that can be enabled without any additional scaffolding required.
Parallel reasoning mode is incompatible with tool calling.
Parallel reasoning is supported on Nemotron 3 Nano.
Chat Template Keyword Arguments#
Use the following chat template keyword argument to control the reasoning mode:
parallel_reasoning_mode: Default is none, no parallel reasoning
inference-time compute scaling. Set to low, medium, or heavy to enable
parallel reasoning with different inference-time compute scaling budgets.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "user", "content": "Tell me a story."}
]
extra_body = {
"chat_template_kwargs": {"parallel_reasoning_mode": "low"},
}
chat_response = client.chat.completions.create(
model="nvidia/nemotron-3-nano",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Reasoning Effort#
The reasoning_effort parameter controls how much computational effort the model spends on reasoning, allowing you to balance between response quality and generation speed. This parameter is only supported by the Chat Completions API for GPT-OSS-20B and GPT-OSS-120B when deployed using the multi-LLM NIM.
Important
To use reasoning_effort, you must set -e NIM_REASONING_PARSER=openai_gptoss when deploying on the multi-LLM NIM container.
Reasoning Effort |
Description |
Use Case |
|---|---|---|
|
Faster responses with less reasoning depth. |
Quick questions or time-sensitive applications. |
|
Balanced reasoning and speed. |
General-purpose reasoning tasks. |
|
Maximum reasoning depth for complex problems. |
Complex problem-solving requiring thorough analysis. |
Refer to Text-only Language Models for backend and feature compatibility.
Example Deployment Configuration#
In the following example, the OpenAI GPT-OSS 120B model is deployed using the multi-LLM NIM with NIM_REASONING_PARSER set.
export CONTAINER_NAME=llm-nim
export IMG_NAME=nvcr.io/nim/nvidia/llm-nim:latest
export NIM_MODEL_NAME=hf://openai/gpt-oss-120b
export LOCAL_NIM_CACHE=/raid/nfs/nim
sudo mkdir -p "$LOCAL_NIM_CACHE"
sudo chmod -R a+w "$LOCAL_NIM_CACHE"
docker run -it --rm --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus all \
--shm-size=16GB \
-e HF_TOKEN=$HF_TOKEN \
-e NIM_MODEL_NAME=$NIM_MODEL_NAME \
-e NIM_REASONING_PARSER=openai_gptoss \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME
Example API Request#
In the following example, the model is configured with high reasoning effort for more thorough analysis.
curl --location 'http://0.0.0.0:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"messages": [{"role": "user", "content": "What is the role of AI in medicine?"}],
"model": "openai/gpt-oss-120b",
"reasoning_effort": "high",
"stream": false
}'