Benchmark Example#

This guide provides instructions for evaluating models on the Cosmos-Reason1 Benchmark.

Setup#

Follow the Installation guide to install system dependencies and clone the Cosmos-Reason1 repository from GitHub
Switch to the benchmark directory:
```
cd examples/benchmark
```

Prepare the Dataset#

Request access to the AgiBotWorld-Beta dataset.
Download annotations and sample video clips:
```
# Download
hf download --repo-type dataset nvidia/Cosmos-Reason1-Benchmark --local-dir data/benchmark
# Unpack
for file in data/benchmark/**/*.tar.gz; do tar -xzf "$file" -C "$(dirname "$file")"; done
```
Note

The following will be downloaded:
- Annotations:
  - AV: For autonomous vehicles’ general description, driving difficulty, and notice
  - RoboVQA: Videos, instructions, and question-answer pairs of agents (robots, humans, humans-with-grasping-tools) executing a task.
  - AgiBot-World: A wide range of real-life tasks for robot manipulation
  - BridgeData V2: A wide array of robotic manipulation behaviors
  - HoloAssist Dataset: Crucial first-person perspectives that provide natural and immersive understanding of human actions
- Video clips:
  - AV
  - RoboVQA
[Optional] Download the full dataset. This will take a long time and requires multiple terabytes of disk space:
```
./tools/eval/process_raw_data.py --data_dir data --task benchmark
```
Note

The following will be downloaded:
- Video clips:
  - AgiBot-World
  - BridgeData V2
- HoloAssist

Run Evaluation#

Configure evaluation settings by editing the examples/benchmark/configs/evaluate.yaml file.

Evaluate the model on the dataset:

./tools/eval/evaluate.py --config configs/evaluate.yaml --data_dir data --results_dir outputs/benchmark

Calculate Accuracy#

Use the following script to calculate accuracy of the results:

./tools/eval/calculate_accuracy.py --result_dir outputs/benchmark

The script compares model predictions against ground-truth answers:

Accuracy = (# correct predictions) / (total questions)

For open-ended questions, a prediction is considered correct if it exactly matches the ground truth (case-insensitive string match). For multiple-choice questions, the selected option is compared against the correct choice.

Note

These scoring rules follow common practices in VLM QA literature, but you can adapt or extend them for specific use cases (e.g. partial credit or VQA-style soft accuracy).