Remove Reasoning Traces#
This guide walks you through configuring NeMo Evaluator Launcher for evaluating reasoning models. It shows how to:
adjust sampling parameters
remove reasoning traces from the answer
controlling thinking budget
ensuring accurate benchmark evaluation.
Tip
Need more in-depth explanation? See the Evaluation of Reasoning Models guide.
Before You Start#
Ensure you have:
Model Endpoint: An OpenAI-compatible API reasoning endpoint for your model (completions or chat). See Testing Endpoint Compatibility for snippets you can use to test your endpoint and Evaluation of Reasoning Models for details on reasoning models.
API Access: Valid API key if your endpoint requires authentication
Installed Packages: NeMo Evaluator or access to evaluation containers
Prepare your config file#
Configure the Evaluation#
Select tasks:
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
Adjust sampling parameters for a reasoning model, e.g.:
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
request_timeout: 3600 # long timeout to account for thinking time
parallelism: 1 # single parallel request to avoid overloading the server
Enable Reasoning Interceptor to remove reasoning traces from the model’s responses:
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
request_timeout: 3600 # long timeout to account for thinking time
parallelism: 1 # single parallel request to avoid overloading the server
target:
api_endpoint:
adapter_config:
interceptors:
- name: endpoint
- name: reasoning
In this example we will use NVIDIA-Nemotron-Nano-9B-v2, which produces reasoning trace in a <think>...</think> format.
If your model uses a different formatting, make sure to configure the interceptor as shown in Evaluation of Reasoning Models.
(Optional) Modify the request to turn the reasoning on.
In this example we work with an endpoint that requires “/think” to be present in the system message to enable reasoning. We will use the Interceptor to add it to the request.
Adjust the example below to match your endpoint (see detailed instructions in Evaluation of Reasoning Models).
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
target:
api_endpoint:
adapter_config:
interceptors:
- name: system_message
config:
system_message: "/think"
- name: endpoint
- name: reasoning
Select your execution backend and deployment specification#
For the purpose of this example, we will use local execution without deployment. See other How-to guides to adjust this example to your needs.
Configure local executor
defaults:
- execution: local
- _self_
execution:
output_dir: nel-results
Configure target endpoint
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: nel-results
target:
api_endpoint:
# see https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2 for endpoint details
model_id: nvidia/nvidia-nemotron-nano-9b-v2
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com
The Full Config#
Combine all components into a config file for your experiment:
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: nel-results
target:
api_endpoint:
# see https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2 for endpoint details
model_id: nvidia/nvidia-nemotron-nano-9b-v2
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY # API Key with access to build.nvidia.com
evaluation:
tasks:
- name: simple_evals.mmlu_pro
- name: mgsm
nemo_evaluator_config:
config:
params:
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768 # for reasoning + final answer
request_timeout: 3600 # long timeout to account for thinking time
parallelism: 1 # single parallel request to avoid overloading the server
target:
api_endpoint:
adapter_config:
interceptors:
- name: system_message
config:
system_message: "/think"
- name: endpoint
- name: reasoning
Verify and execute your experiment#
Save the prepared config in a file, e.g.
nemotron_eval.yaml(Recommended) Inspect the configuration with
--dry_run
export NGC_API_KEY=nvapi-your-key
nemo-evaluator-launcher run --config nemotron_eval.yaml --dry_run
(Recommended) Run a short experiment with 10 samples per benchmark to verify your config
export NGC_API_KEY=nvapi-your-key
nemo-evaluator-launcher run --config nemotron_eval.yaml \
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
Tip
If everything works correctly you should see logs from the ResponseReasoningInterceptor similar to the ones below:
[I 2025-12-02T16:14:28.257] Reasoning tracking information reasoning_words=1905 original_content_words=85 updated_content_words=85 reasoning_finished=True reasoning_started=True reasoning_tokens=unknown updated_content_tokens=unknown logger=ResponseReasoningInterceptor request_id=ccff76b2-2b85-4eed-a9d0-2363b533ae58
Run the full experiment
export NGC_API_KEY=nvapi-your-key
nemo-evaluator-launcher run --config nemotron_eval.yaml
Analyze the metrics and reasoning statistics
After evaluation completes, check these key artifacts:
results.yaml: Contains the benchmark metrics (see Evaluation Output)eval_factory_metrics.json: Contains reasoning statistics under thereasoningkey, including:responses_with_reasoning: How many responses included reasoning tracesreasoning_finished_countvsreasoning_started_count: If these match, yourmax_new_tokenswas sufficientreasoning_unfinished_count: Number of responses where reasoning started but was truncated (didn’t reach end token)reasoning_finished_ratio: Percentage (expressed as ratio within 0-1) of responses where reasoning completed to all responses with reasoningavg_reasoning_wordsand other word- and tokens count metrics: Use these for cost analysis
Tip
For detailed explanation of reasoning statistics and artifacts, see Evaluation of Reasoning Models.