Use Reasoning Models with NVIDIA NIM for LLMs#
NIM LLM supports deploying Reasoning Models designed to generate detailed, step-by-step thought processes. These models are post-trained using unique system prompts to support two different modes:
Return chain-of-thought responses
Return concise responses
This enables a single model with the option to toggle between two behaviors by simply changing the system prompt, without any additional scaffolding required.
Reasoning Mode#
Reasoning mode is controlled entirely by the system prompt. When configuring your prompt, you can instruct the model to either generate an extended chain-of-thought response or provide a more direct answer. This system prompt must be the first message in the conversation.
The text of the system prompt varies according to model.
Llama 3.3 Nemotron Super 49B V1#
Use the following system prompts to control the reasoning mode:
System Prompt |
Description |
Recommended Settings |
---|---|---|
|
Generates long chain-of-thought style responses with explicit thinking tokens. |
|
|
Generates more concise responses without extended chain-of-thought or thinking tokens. |
|
Refer to the API Reference for more general configuration options.
Detailed Thinking On#
In the following example, the system prompt instructs the model to include thinking tokens in its output, resulting in a detailed response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3.3-nemotron-super-49b-v1",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0.6,
top_p=0.95
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Detailed Thinking Off#
In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "system", "content": "detailed thinking off"},
{"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3.3-nemotron-super-49b-v1",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Llama 3.3 Nemotron Super 49B V1.5#
Detailed reasoning is on by default for this model. Use the following system prompt to change the default reasoning mode:
System Prompt |
Description |
Recommended Settings |
---|---|---|
|
Generates more concise responses without extended chain-of-thought or thinking tokens. |
|
Refer to the API Reference for more general configuration options.
Detailed Thinking Off#
In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "system", "content": "/no_think"},
{"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3_3-nemotron-super-49b-v1_5",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)