Goal: Understand the differences between SFT, DPO, and GRPO, and why reinforcement learning from verifiable rewards (RLVR) puts the focus on environments.
Supervised Fine-Tuning (SFT) fits best when clear target behaviors can be provided via demonstrations (instruction-response pairs). It is effective for teaching format and style. However, SFT has limitations:
Reinforcement Learning (RL) becomes the better choice as complexity grows. Instead of telling the model “say exactly this,” RL provides a goal and a way to verify it. This allows the model to explore reasoning paths, making it resilient to edge cases. This tends to work well for tasks like math, code, and tool calling: tasks that have a clear path to verification of answers.
In practice, SFT and RL are not mutually exclusive. A hybrid strategy is often employed:
For example, the NVIDIA Nemotron 3 family of models utilizes SFT as a substantial first stage to ground the model before moving into RL refinement. The ultimate choice depends on compute budget, data availability, and the level of generalization the agent requires. That said, the industry is generally shifting towards allocating more compute during RL stages, especially as RL environments become more sophisticated and accessible.
Traditionally, RL methods like PPO (Proximal Policy Optimization) were the standard. However, their resource intensiveness, requiring multiple complex and compute-intensive models like the reward and critic models, has driven a shift toward more scalable algorithms.
Modern workflows are increasingly adopting more efficient methods to handle different aspects of model improvement.
Direct Preference Optimization (DPO) sidesteps the RL loop entirely, treating alignment as a classification problem on static preference data.
However, DPO lacks explicit reward optimization or exploration. It learns from fixed preference pairs, preventing it from discovering new strategies or optimizing long-horizon outcomes. Because the DPO algorithm models relative output preference rather than trajectory reward, it is less effective for agentic workflows requiring multi-step reasoning and tool use.
To address these limitations in agentic domains, developers are turning to algorithms that leverage verifiable rewards.
Group Relative Policy Optimization (GRPO) is an optimized version of PPO, where heavy critic models are replaced by generating groups of outputs and scoring them against a deterministic verifier.
During GRPO training, rollouts are generated by interacting with an environment: complete attempts at a task, from initial prompt through tool usage to a final reward score. A group of rollouts is generated, the environment scores each one, and the algorithm uses those scores to update the model in real time.
This broader shift toward verifiable correctness is distinct from any single algorithm. While verification can drive improvements even in supervised settings (such as rejection sampling), it is central to the paradigm of Reinforcement Learning from Verifiable Rewards (RLVR).
By replacing subjective scoring with explicit checks: Did the agent produce the correct answer? Did it call the right tools? RLVR moves the “center of gravity” from the optimizer to the environment. Algorithms like GRPO provide an efficient mechanism to optimize against these environmental signals.