Run benchmarks | NeMo Gym

Prepare benchmark data

Request access to various gated HuggingFace datasets

Benchmark	Gated dataset to request access to
GPQA	Idavidrein/gpqa

Set your HuggingFace token in your env.yaml. This is needed to authenticate to HuggingFace and authorize local download of the gated datasets above.

$ echo "hf_token: ?\n" >> env.yaml

You can create a HF token following these instructions https://huggingface.co/docs/hub/en/security-tokens

Prepare benchmark data using ng_prepare_benchmark. In the command below, we prepare the aime24, aime25, and gpqa benchmark datasets.

$ config_paths="benchmarks/aime24/config.yaml,\
> benchmarks/aime25/config.yaml,\
> benchmarks/gpqa/config.yaml"
$ ng_prepare_benchmark "+config_paths=[$config_paths]"

Configure Weights & Biases benchmark result upload

$ echo "wandb_api_key: ?\n" >> env.yaml

Run benchmarks using an OpenAI model

Configure the benchmark run. We set the W&B project and experiment name which is used to control where outputs are saved.

$ WANDB_PROJECT=bxyu-gym-dev
$ EXPERIMENT_NAME=benchmark-dev/gpt-5-nano-2025-08-07
$ 
$ config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
> benchmarks/aime24/config.yaml,\
> benchmarks/aime25/config.yaml,\
> benchmarks/gpqa/config.yaml"

For using openai_model, configure your OpenAI API key and other policy model information.

$ echo 'openai_api_key: ?
> policy_base_url: https://api.openai.com/v1
> policy_api_key: ${openai_api_key}' >> env.yaml

Run the benchmarks using gpt-5-nano-2025-08-07

$ ng_e2e_collect_rollouts \
>     "+config_paths=[${config_paths}]" \
>     +wandb_project=$WANDB_PROJECT \
>     +wandb_name=$EXPERIMENT_NAME \
>     ++output_jsonl_fpath=results/$EXPERIMENT_NAME.jsonl \
>     ++resume_from_cache=true \
>     ++split=benchmark \
>     ++policy_model_name=gpt-5-nano-2025-08-07

You can resume stopped or crashed rollouts using:

$ ++resume_from_cache=true