Using Reasoning Models#
NIM LLM supports deploying Reasoning Models designed to generate detailed, step-by-step thought processes. These models are post-trained using two unique system prompts to support two different modes: detailed thinking on (chain-of-thought responses) and detailed thinking off (concise responses). This enables a single model with the option to toggle between two behaviors by simply changing the system prompt, without any additional scaffolding required.
Note
You can try out a reasoning model like Llama 3.3 Nemotron Super 49B V1 via the preview API.
Reasoning Mode#
Reasoning mode is controlled entirely by the system prompt. When configuring your prompt, you can instruct the model to either generate an extended chain-of-thought response or provide a more direct answer.
System Prompt  | 
Description  | 
Recommended Settings  | 
|---|---|---|
  | 
Generates long chain-of-thought style responses with explicit thinking tokens.  | 
  | 
  | 
Generates more concise responses without extended chain-of-thought or thinking tokens.  | 
  | 
Refer to the API Reference for more general configuration options.
Detailed Thinking On#
In the following example, the system prompt instructs the model to include thinking tokens in its output, resulting in a detailed response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]
chat_response = client.chat.completions.create(
    model="nvidia/llama-3.3-nemotron-super-49b-v1",
    messages=messages,
    max_tokens=32768,
    stream=False,
    temperature=0.6,
    top_p=0.95
)
assistant_message = chat_response.choices[0].message
print(assistant_message)
Detailed Thinking Off#
In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")
messages = [
    {"role": "system", "content": "detailed thinking off"},
    {"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]
chat_response = client.chat.completions.create(
    model="nvidia/llama-3.3-nemotron-super-49b-v1",
    messages=messages,
    max_tokens=32768,
    stream=False,
    temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)