Benchmark Example#
This guide provides instructions for evaluating models on the Cosmos-Reason1 Benchmark.
Setup#
Follow the Installation guide to install system dependencies and clone the Cosmos-Reason1 repository from GitHub
Switch to the
benchmark
directory:cd examples/benchmark
Prepare the Dataset#
Request access to the AgiBotWorld-Beta dataset.
Download annotations and sample video clips:
# Download hf download --repo-type dataset nvidia/Cosmos-Reason1-Benchmark --local-dir data/benchmark # Unpack for file in data/benchmark/**/*.tar.gz; do tar -xzf "$file" -C "$(dirname "$file")"; done
Note
The following will be downloaded:
Annotations:
AV
: For autonomous vehicles’ general description, driving difficulty, and noticeRoboVQA: Videos, instructions, and question-answer pairs of agents (robots, humans, humans-with-grasping-tools) executing a task.
AgiBot-World: A wide range of real-life tasks for robot manipulation
BridgeData V2: A wide array of robotic manipulation behaviors
HoloAssist Dataset: Crucial first-person perspectives that provide natural and immersive understanding of human actions
Video clips:
AV
RoboVQA
[Optional] Download the full dataset. This will take a long time and requires multiple terabytes of disk space:
./tools/eval/process_raw_data.py --data_dir data --task benchmark
Note
The following will be downloaded:
Video clips:
AgiBot-World
BridgeData V2
HoloAssist
Run Evaluation#
Configure evaluation settings by editing the
examples/benchmark/configs/evaluate.yaml
file.Evaluate the model on the dataset:
./tools/eval/evaluate.py --config configs/evaluate.yaml --data_dir data --results_dir outputs/benchmark
Calculate Accuracy#
Use the following script to calculate accuracy of the results:
./tools/eval/calculate_accuracy.py --result_dir outputs/benchmark
The script compares model predictions against ground-truth answers:
Accuracy = (# correct predictions) / (total questions)
For open-ended questions, a prediction is considered correct if it exactly matches the ground truth (case-insensitive string match). For multiple-choice questions, the selected option is compared against the correct choice.
Note
These scoring rules follow common practices in VLM QA literature, but you can adapt or extend them for specific use cases (e.g. partial credit or VQA-style soft accuracy).