Thinking Budget Control (Thinking-Token Limiter)#
Thinking budget control is a feature that limits how many “thinking” tokens a model can generate before it must start producing its final answer. It’s useful for models that follow a reflect-then-answer pattern (for example, Chain-of-Thought) like Qwen3 where you would like to cap the reasoning portion for latency or cost reasons.
What the processor does#
When the request attribute
nvext.max_thinking_tokens
is set (see the OpenAI-compatible request schema), theBudgetControlLogitsProcessor
keeps a counter while the model is inside the thinking region (between<think>
&</think>
or any custom tag you pick).Once the counter reaches the budget it starts forcing the model to emit the early-exit string (default
</think>
). After that the processor disables itself so generation continues normally.If the model overruns the budget but hasn’t yet produced a newline, an extension window (10 % of the budget by default) is granted so it can finish the current sentence gracefully.
How to enable#
You can set the environment variable NIM_ENABLE_BUDGET_CONTROL
to 1
to enable thinking budget control.
NIM_BUDGET_CONTROL_THINKING_START_STRING
and NIM_BUDGET_CONTROL_THINKING_STOP_STRING
should be set to the specific start and end
thinking tags for the model you’re using. You can also add additional prompting using these environment variables, such as setting
NIM_BUDGET_CONTROL_THINKING_STOP_STRING=$'</think>\nFinal answer:'
which adds Final answer:
after the closing think tag.
docker run -e NIM_ENABLE_BUDGET_CONTROL=1 \
-e NIM_BUDGET_CONTROL_THINKING_START_STRING="<think>"
-e NIM_BUDGET_CONTROL_THINKING_STOP_STRING="</think>" \
… <IMAGE>
Making requests#
Set the nvext.max_thinking_tokens
field on your (chat) completion request:
{
"model": "my-model",
"messages": [{"role": "user", "content": "…"}],
"max_tokens": 512,
"nvext": {
"max_thinking_tokens": 128
}
}
Advanced knobs (environment variables)#
Variable |
Default |
Description |
---|---|---|
|
– |
Master switch. Set to |
|
|
Tag that marks the beginning of the thinking region. |
|
|
Tag that marks the end of the thinking region. |
Status#
Important
Thinking budget control is not supported with SGLang.
Important
Thinking budget control cannot be used with other structured generation backends at the moment.