Training Approaches#

Goal: Understand the differences between SFT, DPO, and GRPO, and why reinforcement learning from verifiable rewards (RLVR) puts the focus on environments.

Comparing SFT and RL#

Supervised Fine-Tuning (SFT) fits best when clear target behaviors can be provided via demonstrations (instruction-response pairs). It is effective for teaching format and style. However, SFT has limitations:

Imitation over Adaptivity: When the dataset is small, models learn to mimic the answer rather than learn the process to get there.
Data Bottlenecks: Curating high-quality demonstrations or large datasets for every edge case is expensive.
Brittleness: SFT models often struggle when scenarios fall outside their training distribution, so the dataset needs to be diverse and large.

Reinforcement Learning (RL) becomes the better choice as complexity grows. Instead of telling the model “say exactly this,” RL provides a goal and a way to verify it. This allows the model to explore reasoning paths, making it resilient to edge cases. This tends to work well for tasks like math, code, and tool calling: tasks that have a clear path to verification of answers.

Combining SFT + RL#

In practice, SFT and RL are not mutually exclusive. A hybrid strategy is often employed:

SFT for Warm-Starting RL: Use a high-quality set of demonstrations to teach the chat template, tool-calling format, and general readability. This helps RL not waste time trying to learn the format of the dataset.
RL for Scaling: Transition to RL to allow the model to explore and self-correct. This “post-training” refinement is where reasoning and robustness are truly forged.

For example, the NVIDIA Nemotron 3 family of models utilizes SFT as a substantial first stage to ground the model before moving into RL refinement. The ultimate choice depends on compute budget, data availability, and the level of generalization the agent requires. That said, the industry is generally shifting towards allocating more compute during RL stages, especially as RL environments become more sophisticated and accessible.

From Algorithms to Environments#

Traditionally, RL methods like PPO (Proximal Policy Optimization) were the standard. However, their resource intensiveness, requiring multiple complex and compute-intensive models like the reward and critic models, has driven a shift toward more scalable algorithms.

Modern workflows are increasingly adopting more efficient methods to handle different aspects of model improvement.

DPO#

Direct Preference Optimization (DPO) sidesteps the RL loop entirely, treating alignment as a classification problem on static preference data.

Reward Type: Pairwise. Relies on labeled preferences (“Response A > Response B”).
Efficiency: Computationally light and stable, making it ideal for alignment tasks (safety, tone, style).

However, DPO lacks explicit reward optimization or exploration. It learns from fixed preference pairs, preventing it from discovering new strategies or optimizing long-horizon outcomes. Because the DPO algorithm models relative output preference rather than trajectory reward, it is less effective for agentic workflows requiring multi-step reasoning and tool use.

GRPO#

To address these limitations in agentic domains, developers are turning to algorithms that leverage verifiable rewards.

Group Relative Policy Optimization (GRPO) is an optimized version of PPO, where heavy critic models are replaced by generating groups of outputs and scoring them against a deterministic verifier.

Reward Type: Typically binary (0 or 1), but supports continuous values. While it thrives when an environment can programmatically say “Yes” or “No” (for example, passing a unit test), it supports more granular scoring as well.
Efficiency: Eliminating the value model and the reward model from PPO significantly reduces memory overhead and is a key factor in scaling reasoning capabilities.

During GRPO training, rollouts are generated by interacting with an environment: complete attempts at a task, from initial prompt through tool usage to a final reward score. A group of rollouts is generated, the environment scores each one, and the algorithm uses those scores to update the model in real time.

The Rise of RLVR#

This broader shift toward verifiable correctness is distinct from any single algorithm. While verification can drive improvements even in supervised settings (such as rejection sampling), it is central to the paradigm of Reinforcement Learning from Verifiable Rewards (RLVR).

By replacing subjective scoring with explicit checks: Did the agent produce the correct answer? Did it call the right tools? RLVR moves the “center of gravity” from the optimizer to the environment. Algorithms like GRPO provide an efficient mechanism to optimize against these environmental signals.