Reason1 Evaluation Guide#

This page outlines how to set up and run evaluation of a trained Reason1 model using the ‘evaluate.py’ script.

Download Benchmark Assets#

Download annotations and sample video clips using the script below:

python tools/eval/download_hf_data.py \
    --target data \
    --task benchmark 

This script downloads the following:

  • Annotations for the following platforms:

    • AV: General description, driving difficulty, and notices for autonomous vehicles

    • RoboVQA: Videos, instructions, and question-answer pairs for agents (robots, humans, humans-with-grasping-tools) executing a task

    • AgiBot-World: A wide range of real-life tasks for robot manipulation

    • BridgeData V2: A wide array of robotic manipulation behaviors

    • HoloAssist Dataset: Crucial first-person perspectives that provide natural and immersive understanding of human actions

  • Video clips following platforms:

    • AV

    • RoboVQA

Note

Optional video clips for AgiBot-World, BridgeData V2, and HoloAssist must be downloaded manually in the next step.

Download Additional Video Clips (Optional)#

Follow these steps to manually download and preprocess the AgiBot, BridgeV2, or HoloAssist datasets for evaluation.

  1. Ensure the following packages are installed:

    pip install tensorflow tensorflow_datasets ffmpeg
    
  2. Accept the AgiBot-World license on Hugging Face to get access.

  3. After downloading the data, run the following script:

    Note

    Replace holoassist in the command below with agibot or bridgev2 as needed.

    python tools/eval/preprocessing.py \
     --dataset holoassist \
     --data_dir data \
     --task benchmark
    

Run Evaluation on Benchmarks#

This step walks you through running evaluations on your model using the provided script.

Configure Evaluation#

You can configure evaluation settings by editing the YAML file for a specific dataset under tools/eval/configs/. For example, the robovqa.yaml has the following configuration:

datasets:
  - robovqa

model:
  model_name: nvidia/Cosmos-Reason1-7B # You can also replace the model_name by a safetensors folder path mentioned above
  tokenizer_model_name: qwen2.5-vl-7b
  dtype: bfloat16
  tp_size: 4
  max_length: 128000

evaluation:
  answer_type: reasoning
  num_processes: 80
  skip_saved: false
  fps: 4
  seed: 1

generation:
  max_retries: 10
  max_tokens: 1024
  temperature: 0.6
  repetition_penalty: 1.0
  presence_penalty: 0.0
  frequency_penalty: 0.0

Run Evaluation#

Run the evaluate.py script. The following example uses the RoboVQA dataset:

# Set tensor parallelism size (adjust as needed)
export TP_SIZE=4

# Run the evaluation script
PYTHONPATH=. python3 tools/eval/evaluate.py \
    --config tools/eval/configs/robovqa.yaml \
    --data_dir data \
    --results_dir results 

Tip

You can also use --model_name to specify either a Hugging Face model name or a local safetensors folder path.

Benchmark Scoring#

This step computes benchmark accuracy metrics from prediction results stored in a specified directory. It is used to evaluate model performance on datasets such as RoboVQA.

About Evaluation#

The evaluation process uses accuracy as the primary metric, comparing model predictions against ground-truth answers. Accuracy is computed as follows:

Accuracy = (# correct predictions) / (total questions)

For open-ended questions, a prediction is considered correct if it exactly matches the ground truth (case-insensitive string match). For multiple-choice questions, the selected option is compared against the correct choice.

Note

These scoring rules follow common practices in VLM QA literature, but users are encouraged to adapt or extend them for specific use cases (e.g partial credit, VQA-style soft accuracy).

Usage#

Run the following command to compute accuracy:

python tools/eval/calculate_accuracy.py --result_dir results --dataset robovqa
  • --result_dir: Path to the directory containing the model prediction results. This should match the --result_dir used during evaluation in evaluate.py

  • --dataset: Name of the dataset to evaluate (e.g. robovqa, av, agibot, bridgev2, holoassist)