Use Reasoning Models with NVIDIA NIM for LLMs#

NIM LLM supports deploying Reasoning Models designed to generate detailed, step-by-step thought processes. These models are post-trained using unique system prompts to support two different modes:

  • Return chain-of-thought responses

  • Return concise responses

This enables a single model with the option to toggle between two behaviors by simply changing the system prompt, without any additional scaffolding required.

Reasoning Mode#

Reasoning mode is controlled entirely by the system prompt. When configuring your prompt, you can instruct the model to either generate an extended chain-of-thought response or provide a more direct answer. This system prompt must be the first message in the conversation.

The text of the system prompt varies according to model.

Llama 3.3 Nemotron Super 49B V1#

Use the following system prompts to control the reasoning mode:

System Prompt

Description

Recommended Settings

detailed thinking on

Generates long chain-of-thought style responses with explicit thinking tokens.

temperature = 0.6, top p = 0.95

detailed thinking off

Generates more concise responses without extended chain-of-thought or thinking tokens.

temperature = 0

Refer to the API Reference for more general configuration options.

Detailed Thinking On#

In the following example, the system prompt instructs the model to include thinking tokens in its output, resulting in a detailed response.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

messages = [
    {"role": "system", "content": "detailed thinking on"},
    {"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]

chat_response = client.chat.completions.create(
    model="nvidia/llama-3.3-nemotron-super-49b-v1",
    messages=messages,
    max_tokens=32768,
    stream=False,
    temperature=0.6,
    top_p=0.95
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Detailed Thinking Off#

In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

messages = [
    {"role": "system", "content": "detailed thinking off"},
    {"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]

chat_response = client.chat.completions.create(
    model="nvidia/llama-3.3-nemotron-super-49b-v1",
    messages=messages,
    max_tokens=32768,
    stream=False,
    temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)

Llama 3.3 Nemotron Super 49B V1.5#

Detailed reasoning is on by default for this model. Use the following system prompt to change the default reasoning mode:

System Prompt

Description

Recommended Settings

/no_think

Generates more concise responses without extended chain-of-thought or thinking tokens.

temperature = 0

Refer to the API Reference for more general configuration options.

Detailed Thinking Off#

In the following example, the system prompt instructs the model to exclude thinking tokens in its output, resulting in a more concise response.

from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="not-used")

messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": "Solve x*(sin(x)+2)=0"}
]

chat_response = client.chat.completions.create(
    model="nvidia/llama-3_3-nemotron-super-49b-v1_5",
    messages=messages,
    max_tokens=32768,
    stream=False,
    temperature=0,
)
assistant_message = chat_response.choices[0].message
print(assistant_message)