NVIDIA NeMo Agent Toolkit Finetuning Harness for Reinforcement Learning#
Warning
Experimental Feature: The Finetuning Harness is experimental and may change in future releases. Future versions may introduce breaking changes without notice.
The NeMo Agent Toolkit provides a powerful finetuning harness designed for in-situ reinforcement learning of agentic LLM workflows. This enables iterative improvement of agents through experience, allowing models to learn from their interactions with environments, tools, and users.
Overview#
The finetuning harness is built on four foundational principles:
Principle |
Description |
|---|---|
Decoupled Architecture |
Training logic is separated from backends, allowing you to use any RL framework (OpenPipe ART, NeMo Aligner, custom implementations) |
In-Situ Training |
Train agents with the same workflow you run in production, without moving to a different development environment. |
Flexible Targeting |
Finetune specific functions or entire workflows, enabling targeted improvements in complex agentic systems. |
Composable Components |
Three pluggable components (TrajectoryBuilder, TrainerAdapter, Trainer) can be mixed, matched, and customized |
Architecture#
┌────────────────────────────────────────────────────────────────────────┐
│ Trainer │
│ (Orchestrates the finetuning loop across epochs) │
│ │
│ ┌───────────────────────┐ ┌───────────────────────────┐ │
│ │ TrajectoryBuilder │ │ TrainerAdapter │ │
│ │ │ │ │ │
│ │ - Runs evaluations │ ──────► │ - Validates trajectories │ │
│ │ - Collects episodes │ │ - Submits to backend │ │
│ │ - Computes rewards │ │ - Monitors training │ │
│ │ - Groups trajectories│ │ - Reports status │ │
│ └───────────────────────┘ └───────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Remote Training │
│ Backend │
└─────────────────────────┘
Documentation#
Guide |
Description |
|---|---|
Core concepts, RL fundamentals, curriculum learning, and architecture details |
|
How to implement custom TrajectoryBuilders, TrainerAdapters, and Trainers |
|
Using the OpenPipe ART backend for GRPO training |
Supported Backends#
Backend |
Plugin Package |
Description |
|---|---|---|
OpenPipe ART |
|
GRPO-based training with vLLM and TorchTune |
Key Features#
Curriculum Learning: Progressively introduce harder examples during training
Multi-Generation Trajectories: Collect multiple responses per example for GRPO optimization
Validation Monitoring: Periodic evaluation on held-out data to track generalization
Progress Visualization: Automatic reward plots and metrics logging
Flexible Targeting: Train specific functions or models in complex workflows
Requirements#
Training backend (e.g., OpenPipe ART server with GPU)
LLM inference endpoint with log probability support
Training dataset in JSON/JSONL format
Custom evaluator for computing rewards