Reason1 Evaluation Guide#
This page outlines how to set up and run evaluation of a trained Reason1 model using the ‘evaluate.py’ script.
Download Benchmark Assets#
Download annotations and sample video clips using the script below:
python tools/eval/download_hf_data.py \
--target data \
--task benchmark
This script downloads the following:
Annotations for the following platforms:
AV: General description, driving difficulty, and notices for autonomous vehicles
RoboVQA: Videos, instructions, and question-answer pairs for agents (robots, humans, humans-with-grasping-tools) executing a task
AgiBot-World: A wide range of real-life tasks for robot manipulation
BridgeData V2: A wide array of robotic manipulation behaviors
HoloAssist Dataset: Crucial first-person perspectives that provide natural and immersive understanding of human actions
Video clips following platforms:
AV
RoboVQA
Note
Optional video clips for AgiBot-World
, BridgeData V2
, and HoloAssist
must be downloaded manually in the next step.
Download Additional Video Clips (Optional)#
Follow these steps to manually download and preprocess the AgiBot, BridgeV2, or HoloAssist datasets for evaluation.
Ensure the following packages are installed:
pip install tensorflow tensorflow_datasets ffmpeg
Accept the AgiBot-World license on Hugging Face to get access.
After downloading the data, run the following script:
Note
Replace
holoassist
in the command below withagibot
orbridgev2
as needed.python tools/eval/preprocessing.py \ --dataset holoassist \ --data_dir data \ --task benchmark
Run Evaluation on Benchmarks#
This step walks you through running evaluations on your model using the provided script.
Configure Evaluation#
You can configure evaluation settings by editing the YAML file for a specific dataset under tools/eval/configs/
. For example,
the robovqa.yaml
has the following configuration:
datasets:
- robovqa
model:
model_name: nvidia/Cosmos-Reason1-7B # You can also replace the model_name by a safetensors folder path mentioned above
tokenizer_model_name: qwen2.5-vl-7b
dtype: bfloat16
tp_size: 4
max_length: 128000
evaluation:
answer_type: reasoning
num_processes: 80
skip_saved: false
fps: 4
seed: 1
generation:
max_retries: 10
max_tokens: 1024
temperature: 0.6
repetition_penalty: 1.0
presence_penalty: 0.0
frequency_penalty: 0.0
Run Evaluation#
Run the evaluate.py
script. The following example uses the RoboVQA dataset:
# Set tensor parallelism size (adjust as needed)
export TP_SIZE=4
# Run the evaluation script
PYTHONPATH=. python3 tools/eval/evaluate.py \
--config tools/eval/configs/robovqa.yaml \
--data_dir data \
--results_dir results
Tip
You can also use --model_name
to specify either a Hugging Face model name or a local safetensors folder path.
Benchmark Scoring#
This step computes benchmark accuracy metrics from prediction results stored in a specified directory. It is used to evaluate model performance on datasets such as RoboVQA.
About Evaluation#
The evaluation process uses accuracy as the primary metric, comparing model predictions against ground-truth answers. Accuracy is computed as follows:
Accuracy = (# correct predictions) / (total questions)
For open-ended questions, a prediction is considered correct if it exactly matches the ground truth (case-insensitive string match). For multiple-choice questions, the selected option is compared against the correct choice.
Note
These scoring rules follow common practices in VLM QA literature, but users are encouraged to adapt or extend them for specific use cases (e.g partial credit, VQA-style soft accuracy).
Usage#
Run the following command to compute accuracy:
python tools/eval/calculate_accuracy.py --result_dir results --dataset robovqa
--result_dir
: Path to the directory containing the model prediction results. This should match the--result_dir
used during evaluation inevaluate.py
--dataset
: Name of the dataset to evaluate (e.g.robovqa
,av
,agibot
,bridgev2
,holoassist
)