Using Reasoning Models#

NIM LLM supports deploying Reasoning Models designed to generate detailed, step-by-step thought processes. These models are post-trained using two unique system prompts to support two different modes: detailed thinking on (chain-of-thought responses) and detailed thinking off (concise responses). This enables a single model with the option to toggle between two behaviors by simply changing the system prompt, without any additional scaffolding required.

Note

You can try out a reasoning model like Llama 3.3 Nemotron Super 49B V1 via the preview API.

Reasoning Mode#

Reasoning mode is controlled entirely by the system prompt. When configuring your prompt, you can instruct the model to either generate an extended chain-of-thought response or provide a more direct answer.

System Prompt	Description	Recommended Settings
`detailed thinking on`	Generates long chain-of-thought style responses with explicit thinking tokens.	`temperature = 0.6`, `top p = 0.95`
`detailed thinking off`	Generates more concise responses without extended chain-of-thought or thinking tokens.	`temperature = 0`

Refer to the API Reference for more general configuration options.

Detailed Thinking On#

In the following example, the system prompt instructs the model to include thinking tokens in its output, resulting in a detailed response.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]

chat_response = client.chat.completions.create(
    model="nvidia/llama-3.3-nemotron-super-49b-v1",
    messages=messages,
    max_tokens=32768,
    stream=False,
    temperature=0.6,
    top_p=0.95
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Detailed Thinking Off#

In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

messages = [
    {"role": "system", "content": "detailed thinking off"},
    {"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]

chat_response = client.chat.completions.create(
    model="nvidia/llama-3.3-nemotron-super-49b-v1",
    messages=messages,
    max_tokens=32768,
    stream=False,
    temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)