Thinking Budget Control (Thinking-Token Limiter)#
Thinking budget control (also known as reasoning budget control) is a feature that limits how many “thinking” tokens a model can generate before it must start producing its final answer. It is useful for models that follow a reflect-then-answer pattern (for example, Chain-of-Thought) like Qwen3 where you would like to cap the reasoning portion for latency or cost reasons.
What the Processor Does#
When the request attribute
nvext.max_thinking_tokensis set (see the OpenAI-compatible request schema), theBudgetControlLogitsProcessorkeeps a counter while the model is inside the thinking region (between<think>&</think>or any custom tag you pick).Once the counter reaches the budget it starts forcing the model to emit the early-exit string (default
</think>). After that the processor disables itself so generation continues normally.If the model overruns the budget but hasn’t yet produced a newline, an extension window (10% of the budget by default) is granted so it can finish the current sentence gracefully. This is an empirical value that has minimal impact on performance while improving user experience.
Enable Thinking Budget Control#
Note
The NVIDIA-Nemotron-Nano-9B-v2 NIM has budget control enabled by default. You do not need to manually set the following environment variables for this NIM.
Set the environment variable NIM_ENABLE_BUDGET_CONTROL to 1 to enable
thinking budget control. NIM_BUDGET_CONTROL_THINKING_START_STRING and NIM_BUDGET_CONTROL_THINKING_STOP_STRING should be set to the specific start and end
thinking tags for the model you’re using. You can also add additional prompting using these environment variables, such as setting
NIM_BUDGET_CONTROL_THINKING_STOP_STRING=$'</think>\nFinal answer:' which adds Final answer: after the closing think tag.
docker run -e NIM_ENABLE_BUDGET_CONTROL=1 \
-e NIM_BUDGET_CONTROL_THINKING_START_STRING="<think>"
-e NIM_BUDGET_CONTROL_THINKING_STOP_STRING="</think>" \
… <IMAGE>
Make Requests#
To limit the number of thinking tokens, either explicitly set the reasoning budget or set the thinking effort to a low level.
Set the Reasoning Budget#
For some models, set the reasoning_budget field in the request. This field
enforces a hard upper limit on reasoning tokens using a logits processor that
forces the </think> token once the budget is reached. You must have
thinking enabled to use
this field:
{
"model": "nvidia/nemotron-3-super-120b-a12b",
"messages": [{"role": "user", "content": "What is 15 * 37? Think step by step."}],
"max_tokens": 1000,
"chat_template_kwargs": {"enable_thinking": true, "reasoning_budget": 25}
}
Other models use the nvext.max_thinking_tokens field in the request:
{
"model": "nvidia/nvidia-nemotron-nano-9b-v2",
"messages": [{"role": "user", "content": "…"}],
"max_tokens": 512,
"nvext": {
"max_thinking_tokens": 128
}
}
If you set both reasoning_budget and nvext.max_thinking_tokens fields, the
maximum number of thinking tokens is limited to the smaller field value.
Set the Thinking Effort#
Instead of specifying a reasoning budget, some models let you limit the thinking
effort by setting the low_effort field in the request. Set
"low_effort": true to instruct the model to produce shorter chain-of-thought
reasoning. This is a soft, advisory hint; the model is encouraged, but not
forced, to reduce its thinking. The total number of reasoning tokens is not
bounded.
Set low_effort: true when shorter reasoning is preferred but not critical. Use
the reasoning_budget parameter when strict control
over the length of reasoning is required.
{
"model": "nvidia/nemotron-3-super-120b-a12b",
"messages": [{"role": "user", "content": "What is 15 * 37? Think step by step."}],
"max_tokens": 1000,
"chat_template_kwargs": {"enable_thinking": true, "low_effort": true}
}
Advanced Options (Environment Variables)#
Variable |
Default |
Description |
|---|---|---|
|
– |
Master switch. Set to |
|
|
Tag that marks the beginning of the thinking region. |
|
|
Tag that marks the end of the thinking region. |
Status#
Important
Thinking budget control is not supported with SGLang.
Important
Thinking budget control cannot be used with other structured generation backends at the moment.