GRPO on DeepScaler#

This guide explains how to use NeMo RL to train long Chain of Thought (CoT) reasoning models with Group Relative Policy Optimization (GRPO). To do so, we train DeepSeek-R1-Distill-Qwen-1.5B on the DeepScaleR dataset. We then show how to use NeMo RL’s evaluation scripts to evaluate the trained model on the AIME24 benchmark.

Train the Model#

We follow the DeepScaleR recipe and train the model in three stages. In the first stage, we train with an 8K context window. In the second stage, we train with a 16K context window. In the third stage, we train with a 24K context window. To train the model using NeMo RL, use the examples/configs/grpo-deepscaler-1.5b-8K.yaml config file. This file closely matches the experiment settings in the original DeepScaleR recipe. We then train with examples/configs/grpo-deepscaler-1.5b-16K.yaml and examples/configs/grpo-deepscaler-1.5b-24K.yaml for the second and third stages, respectively.

uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-8K.yaml
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-16K.yaml policy.model_name=/path/to/8K/checkpoint/hf
uv run examples/run_grpo_math.py --config=examples/configs/grpo-deepscaler-1.5b-24K.yaml policy.model_name=/path/to/16K/checkpoint/hf

At the end of each stage, you need to specify the Hugging Face checkpoint to continue training with. To get this checkpoint, we convert a model checkpoint to a Hugging Face checkpoint with the following command:

uv run examples/converters/convert_dcp_to_hf.py --config=results/grpo-deepscaler-1.5b-8K/step_240/config.yaml --dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/policy/weights --hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/hf

When running the next command, we use the Hugging Face checkpoint as the initial checkpoint. We train with an 8K context window for 240 steps, a 16K context window for 290 steps, and a 24K context window for 50 steps. We run all experiments on a single 8XH100 80GB node. If you’re running on 8XA100 80GB, you will need at least 1 node for 8K training and 2 nodes for 16-24k training.

Training Curve#

When using the above commands, we get the following training curve:

Training Performance

Notably, we are able to achieve an average training reward of 0.65 in just 400 training steps.

Evaluate the Model#

Throughout training, the checkpoints of the model will be saved to the results folder (specified by checkpointing.checkpoint_dir). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format as before. Then, to evaluate on the AIME24 benchmark, use the following command:

uv run examples/run_eval.py \
    generation.model_name=results/grpo-deepscaler-1.5b-8K/step_240/hf \
    data.prompt_file=examples/prompts/cot.txt \
    generation.vllm_cfg.max_model_len=32768

Use generation.model_name to specify the path to the Hugging Face checkpoint. In addition, we use AIME24 as the validation dataset and calculate pass@1 on it throughout training.

Evaluation Results#

Using the above instructions to train DeepSeek-R1-Distill-Qwen-1.5B on the DeepScaleR dataset, we can track the model’s performance on the AIME24 benchmark throughout training. The following plot shows the evaluation metrics as training progresses:

AIME24 Performance

We are able to surpass OpenAI O1’s performance on the AIME24 benchmark with about 600 training steps.