Audio GRPO on AVQA#
This guide explains how to use NeMo RL to train Qwen2.5-Omni-3B with GRPO on the AVQA audio question-answering dataset, following the approach described in the R1-AQA paper, and then evaluate the trained model on the MMAU benchmark.
1. Train the Model#
Run GRPO training with the provided config:
uv run examples/run_vlm_grpo.py --config examples/configs/audio_grpo_3B_megatron.yaml
Config: examples/configs/audio_grpo_3B_megatron.yaml
Key hyperparameters:
Parameter |
Value |
|---|---|
Model |
Qwen2.5-Omni-3B |
Dataset |
AVQA (train split) |
GPUs |
8 x 1 node, Megatron backend |
Learning rate |
1e-6 |
KL penalty |
0.01 |
Generations per prompt |
8 |
Prompts per step |
8 |
Max steps |
200 |
Save period |
100 |
Reward |
format (0.2) + exact_alnum (0.8) |
2. Convert Checkpoint (Megatron to HF)#
Throughout training, checkpoints are saved to the results/audio_grpo_3B_megatron directory (specified by checkpointing.checkpoint_dir). To evaluate a checkpoint, first convert it from Megatron format to Hugging Face format:
uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \
--config results/audio_grpo_3B_megatron/step_200/config.yaml \
--megatron-ckpt-path results/audio_grpo_3B_megatron/step_200/policy/weights/iter_0000000 \
--hf-ckpt-path results/audio_grpo_3B_megatron/step_200/hf --no-strict
Replace the step number with the checkpoint you want to evaluate. Note the --extra mcore flag is required for the Megatron converter.
3. Evaluate on MMAU#
Evaluate the converted checkpoint on the MMAU benchmark:
uv run examples/run_eval.py \
--config=examples/configs/evals/mmau.yaml \
generation.model_name=results/audio_grpo_3B_megatron/step_200/hf \
data.dataset_name=TwinkStart/MMAU
Config: examples/configs/evals/mmau.yaml
Use generation.model_name to specify the path to the converted Hugging Face checkpoint.
4. Results#
Evaluating the step-200 checkpoint on MMAU, we get the following result:
============================================================
model_name='hf_iter_0000000' dataset_name='MMAU'
max_new_tokens=8000 temperature=0.0 top_p=1.0 top_k=-1 seed=42
metric=pass@1 num_tests_per_prompt=1
score=0.7210 (721.0/1000)
============================================================
As a reference, here are results comparing the baseline, the R1-AQA HuggingFace vanilla implementation, and NeMo-RL:
Model |
MMAU Score |
|---|---|
Qwen2.5-Omni-3B (baseline) |
69.8 |
Qwen2.5-Omni-3B + GRPO (HF vanilla) |
71.6 |
Qwen2.5-Omni-3B + GRPO (NeMo-RL) |
72.1 |
The NeMo-RL result (72.1) is comparable to and slightly higher than the Huggingface Transformers reference implementation (71.6), confirming that the training pipeline reproduces expected improvements over the baseline.