Run benchmarks
Prepare benchmark data
- Request access to various gated HuggingFace datasets
- Set your HuggingFace token in your env.yaml. This is needed to authenticate to HuggingFace and authorize local download of the gated datasets above.
You can create a HF token following these instructions https://huggingface.co/docs/hub/en/security-tokens
- Prepare benchmark data using
ng_prepare_benchmark. In the command below, we prepare theaime24,aime25, andgpqabenchmark datasets.
Configure Weights & Biases benchmark result upload
Run benchmarks using an OpenAI model
- Configure the benchmark run. We set the W&B project and experiment name which is used to control where outputs are saved.
- For using
openai_model, configure your OpenAI API key and other policy model information.
- Run the benchmarks using
gpt-5-nano-2025-08-07
You can resume stopped or crashed rollouts using: