GRPO with OpenPipe ART#
This guide covers the integration between the NVIDIA NeMo Agent toolkit finetuning harness and OpenPipe ART (Agent Reinforcement Trainer), an open-source framework for teaching LLMs through reinforcement learning.
About OpenPipe ART#
OpenPipe ART is designed to improve agent performance and reliability through experience. It provides:
GRPO Training: Uses Group Relative Policy Optimization, which compares multiple responses to the same prompt rather than requiring a separate value function
Async Client-Server Architecture: Separates inference from training, allowing you to run inference anywhere while training happens on GPU infrastructure
Easy Integration: Designed to work with existing LLM applications with minimal code changes
Built-in Observability: Integrations with Weights & Biases, Langfuse, and OpenPipe for monitoring and debugging
When to Use ART#
ART is well-suited for scenarios where:
You want to improve agent reliability on specific tasks
You have a way to score agent performance (even if you donβt have βcorrectβ answers)
Youβre working with agentic workflows that make decisions or take actions
You want to iterate quickly with online training
ART Architecture#
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your Application β
β β
β βββββββββββββββββββββββ β
β β Workflow β βββββ Uses model for inference β
β βββββββββββββββββββββββ β
β β β
β β Trajectories β
β βΌ β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ β
β β ARTTrajectoryBuilderββββββββββΊβ ART Backend Server β β
β β ARTTrainerAdapter β β β β
β βββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββ β
β β β β vLLM Inference Engine ββ β
β β Training request β β (serves updated weights) ββ β
β β β ββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββΊβ ββββββββββββββββββββββββββββββββ β
β β β GRPO Trainer (TorchTune) ββ β
β β β (updates model weights) ββ β
β β ββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The ART backend runs on GPU infrastructure and provides:
vLLM Inference Engine: Serves the model for inference with log probability support
GRPO Trainer: Performs weight updates based on submitted trajectories
NeMo Agent toolkit connects to this backend through the ARTTrainerAdapter, which handles the protocol for submitting trajectories and monitoring training.
Supported Agent Frameworks#
The following table highlights the current support matrix for using ART with different agent frameworks in the NeMo Agent toolkit:
Agent Framework |
Support |
|---|---|
LangChain or LangGraph |
β Supported |
Google ADK |
β Supported |
LlamaIndex |
β Supported |
All others |
π οΈ In Progress |
Installation#
Install the OpenPipe ART plugin package:
pip install nvidia-nat-openpipe-art
This provides:
openpipe_art_traj_builder: The trajectory builder implementationopenpipe_art_trainer_adapter: The trainer adapter for ARTopenpipe_art_trainer: The trainer orchestrator
Youβll also need to set up an ART backend server. See the ART documentation for server setup instructions.
Configuration#
Basic Configuration#
llms:
training_llm:
_type: openai
model_name: Qwen/Qwen2.5-3B-Instruct
base_url: http://localhost:8000/v1 # ART inference endpoint
api_key: default
temperature: 0.4
workflow:
_type: my_workflow
llm: training_llm
eval:
general:
max_concurrency: 16
output_dir: .tmp/nat/finetuning/eval
dataset:
_type: json
file_path: data/training_data.json
evaluators:
my_reward:
_type: my_custom_evaluator
trajectory_builders:
art_builder:
_type: openpipe_art_traj_builder
num_generations: 2
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
ip: "localhost"
port: 7623
name: "my_training_run"
project: "my_project"
base_model: "Qwen/Qwen2.5-3B-Instruct"
api_key: "default"
training:
learning_rate: 1e-6
trainers:
art_trainer:
_type: openpipe_art_trainer
finetuning:
enabled: true
trainer: art_trainer
trajectory_builder: art_builder
trainer_adapter: art_adapter
reward_function:
name: my_reward
num_epochs: 20
output_dir: .tmp/nat/finetuning/output
Configuration Reference#
Trajectory Builder Configuration#
trajectory_builders:
art_builder:
_type: openpipe_art_traj_builder
num_generations: 2 # Trajectories per example
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Number of trajectory generations per example. More generations provide better GRPO signal but increase computation time. |
Trainer Adapter Configuration#
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
ip: "0.0.0.0"
port: 7623
name: "training_run_name"
project: "project_name"
base_model: "Qwen/Qwen2.5-3B-Instruct"
api_key: "default"
delete_old_checkpoints: false
init_args:
max_seq_length: 8192
engine_args:
gpu_memory_utilization: 0.9
tensor_parallel_size: 1
training:
learning_rate: 1e-6
beta: 0.0
Backend Configuration
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
IP address of the ART backend server |
|
|
- |
Port of the ART backend server |
|
|
|
Name for this training run |
|
|
|
Project name for organization |
|
|
|
Base model being trained (must match server) |
|
|
|
API key for authentication |
|
|
|
Delete old checkpoints before training |
Model Initialization Arguments (init_args)
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
Maximum sequence length for the model |
vLLM Engine Arguments (engine_args)
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
Fraction of GPU memory to use (0.0-1.0) |
|
|
- |
Number of GPUs for tensor parallelism |
Training Arguments
Field |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Learning rate for GRPO updates |
|
|
|
KL penalty coefficient |
Trainer Configuration#
trainers:
art_trainer:
_type: openpipe_art_trainer
The trainer has no additional configuration options; it uses the shared finetuning configuration.
How It Works#
ARTTrajectoryBuilder#
The ARTTrajectoryBuilder collects training trajectories through the NeMo Agent toolkit evaluation system:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ARTTrajectoryBuilder Flow β
β β
β start_run() β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Launch N parallel evaluation runs (num_generations) β β
β β β β
β β Each run: β β
β β 1. Loads the training dataset β β
β β 2. Runs the workflow on each example β β
β β 3. Captures intermediate steps (with logprobs from LLM calls) β β
β β 4. Computes reward using configured evaluator β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β finalize() β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Wait for all evaluation runs to complete β β
β β β β
β β For each result: β β
β β 1. Extract reward from evaluator output β β
β β 2. Filter intermediate steps to target functions β β
β β 3. Parse steps into OpenAI message format β β
β β 4. Validate assistant messages have logprobs β β
β β 5. Group trajectories by example ID β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Return TrajectoryCollection β
β (grouped by example for GRPO) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Implementation Details:
Parallel Generation: Multiple evaluation runs execute concurrently using
asyncio.create_task(). This generates diverse trajectories for the same inputs.Log Probability Extraction: The builder parses intermediate steps to extract log probabilities from LLM responses. Messages without logprobs are skipped since they canβt be used for training.
Target Function Filtering: Only steps from functions listed in
finetuning.target_functionsare included. This lets you focus training on specific parts of complex workflows.Grouping for GRPO: Trajectories are organized as
list[list[Trajectory]]where each inner list contains all generations for a single example. This structure enables group-relative policy optimization.
The ARTTrainerAdapter Class#
The ARTTrainerAdapter converts NeMo Agent toolkit trajectories to ARTβs format and manages training:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ARTTrainerAdapter Flow β
β β
β initialize() β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Create ART Backend client β β
β β 2. Create TrainableModel with configuration β β
β β 3. Register model with backend β β
β β 4. Verify backend health β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β submit(trajectories) β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Validate episode ordering β β
β β - First message: user or system β β
β β - No consecutive assistant messages β β
β β β β
β β 2. Convert to ART TrajectoryGroup format β β
β β - EpisodeItem β dict or Choice β β
β β - Include logprobs in Choice objects β β
β β β β
β β 3. Submit via model.train() (async) β β
β β β β
β β 4. Return TrainingJobRef for tracking β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β wait_until_complete() β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Poll task status until done β β
β β Return final TrainingJobStatus β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Implementation Details:
ART Client Management: The adapter maintains an
art.Backendclient andart.TrainableModelinstance that persist across epochs.Trajectory Conversion: NeMo Agent toolkit
Trajectoryobjects are converted to ARTβsart.Trajectoryformat:# NeMo Agent toolkit format EpisodeItem(role=EpisodeItemRole.ASSISTANT, content="...", logprobs=...) # Converted to ART format Choice(index=0, logprobs=..., message={"role": "assistant", "content": "..."}, finish_reason="stop")
Message Validation: The adapter validates that conversations follow expected patterns (user/system first, no consecutive assistant messages).
Async Training: Training is submitted as an async task, allowing the trainer to monitor progress without blocking.
The ARTTrainer Class#
The ARTTrainer orchestrates the complete training loop:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ARTTrainer Flow β
β β
β initialize() β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Generate unique run ID β β
β β 2. Initialize trajectory builder β β
β β 3. Initialize trainer adapter β β
β β 4. Set up curriculum learning state β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β run(num_epochs) β
β β β
β for epoch in range(num_epochs): β
β β β
β ββββ Validation (if interval reached) ββββββββββββββββββββββββββ β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Run evaluation on validation dataset β β β
β β β Record metrics (avg_reward, etc.) β β β
β β β Store in validation history β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββ run_epoch() βββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β 1. Start trajectory collection β β β
β β β 2. Finalize and compute metrics β β β
β β β 3. Apply curriculum learning (filter groups) β β β
β β β 4. Submit to trainer adapter β β β
β β β 5. Log progress and generate plots β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββ Wait for training to complete β
β β β
β ββββ Check status, break on failure β
β β
β Return list of TrainingJobStatus β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Implementation Details:
Curriculum Learning: The trainer implements curriculum learning to progressively include harder examples:
Groups trajectories by average reward
Filters out groups with insufficient variance (no learning signal)
Starts with easiest fraction, expands at intervals
Validation: Optionally runs evaluation on a separate validation dataset to monitor generalization.
Progress Visualization: Generates reward plots (
reward_plot.png) showing training and validation reward progression.Metrics Logging: Writes detailed metrics to JSONL files for analysis:
training_metrics.jsonl: Per-epoch metricsreward_history.json: Reward progressioncurriculum_state.json: Curriculum learning state
Running Finetuning#
Prerequisites#
ART Backend Server: You need a running ART server with your model loaded. See ART documentation for setup.
LLM with Logprobs: Your LLM must return log probabilities. For vLLM, use the
--enable-log-probsflag.Training Dataset: A JSON/JSONL dataset with your training examples.
Reward Function: An evaluator that can score workflow outputs.
Running Training#
You must have OpenPipe ART plugin installed (nvidia-nat-openpipe-art), and an OpenPipe ART server running
and configured to accept training jobs.
# Basic training
nat finetune --config_file=configs/finetune.yml
# With validation
nat finetune --config_file=configs/finetune.yml \
--validation_dataset=data/val.json \
--validation_interval=5
# Override epochs
nat finetune --config_file=configs/finetune.yml \
-o finetuning.num_epochs 50
Monitoring Progress#
During training, check:
Console Output: Shows epoch progress, reward statistics, trajectory counts
Metrics Files: In your
output_dir:training_metrics.jsonl: Detailed per-epoch metricsreward_plot.png: Visual reward progressionreward_history.json: Raw reward data
ART Server Logs: Training progress from the ART side
Example console output:
INFO - Starting epoch 1 for run art_run_a1b2c3d4
INFO - Starting 2 evaluation runs for run_id: art_run_a1b2c3d4
INFO - Built 100 trajectories across 50 examples for run_id: art_run_a1b2c3d4
INFO - Submitted 100 trajectories in 50 groups for training
INFO - Epoch 1 progress logged - Avg Reward: 0.4523, Trajectories: 100
INFO - Training art_run_a1b2c3d4 completed successfully.
INFO - Completed epoch 1/20
Advanced Configuration#
Multi-GPU Training#
For larger models, configure tensor parallelism:
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
engine_args:
tensor_parallel_size: 2 # Use 2 GPUs
gpu_memory_utilization: 0.85
Memory Optimization#
If you encounter OOM errors:
trainer_adapters:
art_adapter:
_type: openpipe_art_trainer_adapter
backend:
init_args:
max_seq_length: 4096 # Reduce sequence length
engine_args:
gpu_memory_utilization: 0.7 # Leave more headroom
Curriculum Learning#
Enable curriculum learning to improve training stability:
finetuning:
curriculum_learning:
enabled: true
initial_percentile: 0.3 # Start with easiest 30%
increment_percentile: 0.2 # Add 20% each expansion
expansion_interval: 5 # Expand every 5 epochs
min_reward_diff: 0.1 # Filter no-variance groups
sort_ascending: false # Easy-to-hard
Targeting Specific Functions#
For multi-component workflows, target specific functions:
finetuning:
target_functions:
- my_agent_function
- tool_calling_function
target_model: training_llm # Only include steps from this model
Troubleshooting#
Connection Issues#
βFailed to connect to ART backendβ
Verify the server is running:
curl http://localhost:7623/healthCheck IP and port in configuration
Verify network connectivity (firewalls, etc.)
Missing Log Probabilities#
βNo valid assistant messages with logprobsβ
Ensure your LLM provider returns logprobs
For vLLM: verify
--enable-log-probsflagCheck your LLM configuration
Out of Memory#
βCUDA out of memoryβ
Reduce
gpu_memory_utilizationReduce
max_seq_lengthReduce
num_generations(fewer parallel trajectories)Increase
tensor_parallel_size(distribute across GPUs)
No Trajectories Collected#
βNo trajectories collected for epochβ
Check
target_functionsmatches your workflowVerify workflow produces intermediate steps
Check evaluator is returning rewards
Look for errors in evaluation logs
Training Not Improving#
Rewards not increasing
Increase
num_generationsfor better GRPO signalTry curriculum learning to focus on learnable examples
Adjust learning rate
Verify reward function is well-calibrated
Check for sufficient variance in trajectory groups
Examples#
The examples/finetuning/rl_with_openpipe_art directory contains a complete working example demonstrating:
Custom workflow with intermediate step tracking
Custom reward evaluator with reward shaping
Full configuration for ART integration
Training and evaluation datasets
See the exampleβs README for detailed instructions.
See Also#
Finetuning Concepts - Core concepts and RL fundamentals
Extending the Finetuning Harness - Creating custom components
OpenPipe ART Documentation - Official ART documentation
Custom Evaluators - Creating reward functions