Using Reasoning Models#
NIM LLM supports deploying Reasoning Models designed to generate detailed, step-by-step thought processes. These models are post-trained using two unique system prompts to support two different modes: detailed thinking on
(chain-of-thought responses) and detailed thinking off
(concise responses). This enables a single model with the option to toggle between two behaviors by simply changing the system prompt, without any additional scaffolding required.
Note
You can try out a reasoning model like Llama 3.3 Nemotron Super 49B V1 via the preview API.
Reasoning Mode#
Reasoning mode is controlled entirely by the system prompt. When configuring your prompt, you can instruct the model to either generate an extended chain-of-thought response or provide a more direct answer.
System Prompt |
Description |
Recommended Settings |
---|---|---|
|
Generates long chain-of-thought style responses with explicit thinking tokens. |
|
|
Generates more concise responses without extended chain-of-thought or thinking tokens. |
|
Refer to the API Reference for more general configuration options.
Detailed Thinking On#
In the following example, the system prompt instructs the model to include thinking tokens in its output, resulting in a detailed response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "system", "content": "detailed thinking on"},
{"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3.3-nemotron-super-49b-v1",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0.6,
top_p=0.95
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Detailed Thinking Off#
In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
{"role": "system", "content": "detailed thinking off"},
{"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]
chat_response = client.chat.completions.create(
model="nvidia/llama-3.3-nemotron-super-49b-v1",
messages=messages,
max_tokens=32768,
stream=False,
temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)