Thinking Budget Control (Thinking-Token Limiter)#

Thinking budget control (also known as reasoning budget control) is a feature that limits how many “thinking” tokens a model can generate before it must start producing its final answer. It is useful for models that follow a reflect-then-answer pattern (for example, Chain-of-Thought) like Qwen3 where you would like to cap the reasoning portion for latency or cost reasons.

What the Processor Does#

When the request attribute nvext.max_thinking_tokens is set (see the OpenAI-compatible request schema), the BudgetControlLogitsProcessor keeps a counter while the model is inside the thinking region (between <think> & </think> or any custom tag you pick).
Once the counter reaches the budget it starts forcing the model to emit the early-exit string (default </think>). After that the processor disables itself so generation continues normally.
If the model overruns the budget but hasn’t yet produced a newline, an extension window (10% of the budget by default) is granted so it can finish the current sentence gracefully. This is an empirical value that has minimal impact on performance while improving user experience.

Enable Thinking Budget Control#

Note

The NVIDIA-Nemotron-Nano-9B-v2 NIM has budget control enabled by default. You do not need to manually set the following environment variables for this NIM.

Set the environment variable NIM_ENABLE_BUDGET_CONTROL to 1 to enable thinking budget control. NIM_BUDGET_CONTROL_THINKING_START_STRING and NIM_BUDGET_CONTROL_THINKING_STOP_STRING should be set to the specific start and end thinking tags for the model you’re using. You can also add additional prompting using these environment variables, such as setting NIM_BUDGET_CONTROL_THINKING_STOP_STRING=$'</think>\nFinal answer:' which adds Final answer: after the closing think tag.

docker run -e NIM_ENABLE_BUDGET_CONTROL=1 \
           -e NIM_BUDGET_CONTROL_THINKING_START_STRING="<think>" 
           -e NIM_BUDGET_CONTROL_THINKING_STOP_STRING="</think>" \
           … <IMAGE>

Make Requests#

To limit the number of thinking tokens, either explicitly set the reasoning budget or set the thinking effort to a low level.

Set the Reasoning Budget#

For some models, set the reasoning_budget field in the request. This field enforces a hard upper limit on reasoning tokens using a logits processor that forces the </think> token once the budget is reached. You must have thinking enabled to use this field:

{
  "model": "nvidia/nemotron-3-super-120b-a12b",
  "messages": [{"role": "user", "content": "What is 15 * 37? Think step by step."}],
  "max_tokens": 1000,
  "chat_template_kwargs": {"enable_thinking": true, "reasoning_budget": 25}
}

Other models use the nvext.max_thinking_tokens field in the request:

{
  "model": "nvidia/nvidia-nemotron-nano-9b-v2",
  "messages": [{"role": "user", "content": "…"}],
  "max_tokens": 512,
  "nvext": {
    "max_thinking_tokens": 128
  }
}

If you set both reasoning_budget and nvext.max_thinking_tokens fields, the maximum number of thinking tokens is limited to the smaller field value.

Set the Thinking Effort#

Instead of specifying a reasoning budget, some models let you limit the thinking effort by setting the low_effort field in the request. Set "low_effort": true to instruct the model to produce shorter chain-of-thought reasoning. This is a soft, advisory hint; the model is encouraged, but not forced, to reduce its thinking. The total number of reasoning tokens is not bounded.

Set low_effort: true when shorter reasoning is preferred but not critical. Use the reasoning_budget parameter when strict control over the length of reasoning is required.

{
  "model": "nvidia/nemotron-3-super-120b-a12b",
  "messages": [{"role": "user", "content": "What is 15 * 37? Think step by step."}],
  "max_tokens": 1000,
  "chat_template_kwargs": {"enable_thinking": true, "low_effort": true}
}

Advanced Options (Environment Variables)#

Variable	Default	Description
`NIM_ENABLE_BUDGET_CONTROL`	–	Master switch. Set to `1` to enable.
`NIM_BUDGET_CONTROL_THINKING_START_STRING`	`<think>`	Tag that marks the beginning of the thinking region.
`NIM_BUDGET_CONTROL_THINKING_STOP_STRING`	`</think>`	Tag that marks the end of the thinking region.

Status#

Important

Thinking budget control is not supported with SGLang.

Important

Thinking budget control cannot be used with other structured generation backends at the moment.