GRPO with OpenPipe ART#

This guide covers the integration between the NVIDIA NeMo Agent toolkit finetuning harness and OpenPipe ART (Agent Reinforcement Trainer), an open-source framework for teaching LLMs through reinforcement learning.

About OpenPipe ART#

OpenPipe ART is designed to improve agent performance and reliability through experience. It provides:

  • GRPO Training: Uses Group Relative Policy Optimization, which compares multiple responses to the same prompt rather than requiring a separate value function

  • Async Client-Server Architecture: Separates inference from training, allowing you to run inference anywhere while training happens on GPU infrastructure

  • Easy Integration: Designed to work with existing LLM applications with minimal code changes

  • Built-in Observability: Integrations with Weights & Biases, Langfuse, and OpenPipe for monitoring and debugging

When to Use ART#

ART is well-suited for scenarios where:

  • You want to improve agent reliability on specific tasks

  • You have a way to score agent performance (even if you don’t have β€œcorrect” answers)

  • You’re working with agentic workflows that make decisions or take actions

  • You want to iterate quickly with online training

ART Architecture#

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           Your Application                              β”‚
β”‚                                                                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                               β”‚
β”‚   β”‚    Workflow         β”‚ ◄──── Uses model for inference                β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                               β”‚
β”‚            β”‚                                                            β”‚
β”‚            β”‚ Trajectories                                               β”‚
β”‚            β–Ό                                                            β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚ ARTTrajectoryBuilder│────────►│      ART Backend Server         β”‚   β”‚
β”‚   β”‚  ARTTrainerAdapter  β”‚         β”‚                                 β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚   β”‚
β”‚            β”‚                      β”‚  β”‚  vLLM Inference Engine      β”‚β”‚   β”‚
β”‚            β”‚ Training request     β”‚  β”‚  (serves updated weights)   β”‚β”‚   β”‚
β”‚            β”‚                      β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚   β”‚
β”‚            └─────────────────────►│  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚   β”‚
β”‚                                   β”‚  β”‚  GRPO Trainer (TorchTune)   β”‚β”‚   β”‚
β”‚                                   β”‚  β”‚  (updates model weights)    β”‚β”‚   β”‚
β”‚                                   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚   β”‚
β”‚                                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The ART backend runs on GPU infrastructure and provides:

  • vLLM Inference Engine: Serves the model for inference with log probability support

  • GRPO Trainer: Performs weight updates based on submitted trajectories

NeMo Agent toolkit connects to this backend through the ARTTrainerAdapter, which handles the protocol for submitting trajectories and monitoring training.

Supported Agent Frameworks#

The following table highlights the current support matrix for using ART with different agent frameworks in the NeMo Agent toolkit:

Agent Framework

Support

LangChain or LangGraph

βœ… Supported

Google ADK

βœ… Supported

LlamaIndex

βœ… Supported

All others

πŸ› οΈ In Progress

Installation#

Install the OpenPipe ART plugin package:

pip install nvidia-nat-openpipe-art

This provides:

  • openpipe_art_traj_builder: The trajectory builder implementation

  • openpipe_art_trainer_adapter: The trainer adapter for ART

  • openpipe_art_trainer: The trainer orchestrator

You’ll also need to set up an ART backend server. See the ART documentation for server setup instructions.

Configuration#

Basic Configuration#

llms:
  training_llm:
    _type: openai
    model_name: Qwen/Qwen2.5-3B-Instruct
    base_url: http://localhost:8000/v1  # ART inference endpoint
    api_key: default
    temperature: 0.4

workflow:
  _type: my_workflow
  llm: training_llm

eval:
  general:
    max_concurrency: 16
    output_dir: .tmp/nat/finetuning/eval
    dataset:
      _type: json
      file_path: data/training_data.json

  evaluators:
    my_reward:
      _type: my_custom_evaluator

trajectory_builders:
  art_builder:
    _type: openpipe_art_traj_builder
    num_generations: 2

trainer_adapters:
  art_adapter:
    _type: openpipe_art_trainer_adapter
    backend:
      ip: "localhost"
      port: 7623
      name: "my_training_run"
      project: "my_project"
      base_model: "Qwen/Qwen2.5-3B-Instruct"
      api_key: "default"
    training:
      learning_rate: 1e-6

trainers:
  art_trainer:
    _type: openpipe_art_trainer

finetuning:
  enabled: true
  trainer: art_trainer
  trajectory_builder: art_builder
  trainer_adapter: art_adapter
  reward_function:
    name: my_reward
  num_epochs: 20
  output_dir: .tmp/nat/finetuning/output

Configuration Reference#

Trajectory Builder Configuration#

trajectory_builders:
  art_builder:
    _type: openpipe_art_traj_builder
    num_generations: 2  # Trajectories per example

Field

Type

Default

Description

num_generations

int

2

Number of trajectory generations per example. More generations provide better GRPO signal but increase computation time.

Trainer Adapter Configuration#

trainer_adapters:
  art_adapter:
    _type: openpipe_art_trainer_adapter
    backend:
      ip: "0.0.0.0"
      port: 7623
      name: "training_run_name"
      project: "project_name"
      base_model: "Qwen/Qwen2.5-3B-Instruct"
      api_key: "default"
      delete_old_checkpoints: false
      init_args:
        max_seq_length: 8192
      engine_args:
        gpu_memory_utilization: 0.9
        tensor_parallel_size: 1
    training:
      learning_rate: 1e-6
      beta: 0.0

Backend Configuration

Field

Type

Default

Description

ip

str

-

IP address of the ART backend server

port

int

-

Port of the ART backend server

name

str

"trainer_run"

Name for this training run

project

str

"trainer_project"

Project name for organization

base_model

str

"Qwen/Qwen2.5-7B-Instruct"

Base model being trained (must match server)

api_key

str

"default"

API key for authentication

delete_old_checkpoints

bool

false

Delete old checkpoints before training

Model Initialization Arguments (init_args)

Field

Type

Default

Description

max_seq_length

int

-

Maximum sequence length for the model

vLLM Engine Arguments (engine_args)

Field

Type

Default

Description

gpu_memory_utilization

float

-

Fraction of GPU memory to use (0.0-1.0)

tensor_parallel_size

int

-

Number of GPUs for tensor parallelism

Training Arguments

Field

Type

Default

Description

learning_rate

float

5e-5

Learning rate for GRPO updates

beta

float

0.0

KL penalty coefficient

Trainer Configuration#

trainers:
  art_trainer:
    _type: openpipe_art_trainer

The trainer has no additional configuration options; it uses the shared finetuning configuration.

How It Works#

ARTTrajectoryBuilder#

The ARTTrajectoryBuilder collects training trajectories through the NeMo Agent toolkit evaluation system:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     ARTTrajectoryBuilder Flow                           β”‚
β”‚                                                                         β”‚
β”‚  start_run()                                                            β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Launch N parallel evaluation runs (num_generations)               β”‚ β”‚
β”‚  β”‚                                                                    β”‚ β”‚
β”‚  β”‚  Each run:                                                         β”‚ β”‚
β”‚  β”‚    1. Loads the training dataset                                   β”‚ β”‚
β”‚  β”‚    2. Runs the workflow on each example                            β”‚ β”‚
β”‚  β”‚    3. Captures intermediate steps (with logprobs from LLM calls)   β”‚ β”‚
β”‚  β”‚    4. Computes reward using configured evaluator                   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  finalize()                                                             β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Wait for all evaluation runs to complete                          β”‚ β”‚
β”‚  β”‚                                                                    β”‚ β”‚
β”‚  β”‚  For each result:                                                  β”‚ β”‚
β”‚  β”‚    1. Extract reward from evaluator output                         β”‚ β”‚
β”‚  β”‚    2. Filter intermediate steps to target functions                β”‚ β”‚
β”‚  β”‚    3. Parse steps into OpenAI message format                       β”‚ β”‚
β”‚  β”‚    4. Validate assistant messages have logprobs                    β”‚ β”‚
β”‚  β”‚    5. Group trajectories by example ID                             β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  Return TrajectoryCollection                                            β”‚
β”‚  (grouped by example for GRPO)                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Implementation Details:

  1. Parallel Generation: Multiple evaluation runs execute concurrently using asyncio.create_task(). This generates diverse trajectories for the same inputs.

  2. Log Probability Extraction: The builder parses intermediate steps to extract log probabilities from LLM responses. Messages without logprobs are skipped since they can’t be used for training.

  3. Target Function Filtering: Only steps from functions listed in finetuning.target_functions are included. This lets you focus training on specific parts of complex workflows.

  4. Grouping for GRPO: Trajectories are organized as list[list[Trajectory]] where each inner list contains all generations for a single example. This structure enables group-relative policy optimization.

The ARTTrainerAdapter Class#

The ARTTrainerAdapter converts NeMo Agent toolkit trajectories to ART’s format and manages training:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     ARTTrainerAdapter Flow                              β”‚
β”‚                                                                         β”‚
β”‚  initialize()                                                           β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  1. Create ART Backend client                                      β”‚ β”‚
β”‚  β”‚  2. Create TrainableModel with configuration                       β”‚ β”‚
β”‚  β”‚  3. Register model with backend                                    β”‚ β”‚
β”‚  β”‚  4. Verify backend health                                          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  submit(trajectories)                                                   β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  1. Validate episode ordering                                      β”‚ β”‚
β”‚  β”‚     - First message: user or system                                β”‚ β”‚
β”‚  β”‚     - No consecutive assistant messages                            β”‚ β”‚
β”‚  β”‚                                                                    β”‚ β”‚
β”‚  β”‚  2. Convert to ART TrajectoryGroup format                          β”‚ β”‚
β”‚  β”‚     - EpisodeItem β†’ dict or Choice                                 β”‚ β”‚
β”‚  β”‚     - Include logprobs in Choice objects                           β”‚ β”‚
β”‚  β”‚                                                                    β”‚ β”‚
β”‚  β”‚  3. Submit via model.train() (async)                               β”‚ β”‚
β”‚  β”‚                                                                    β”‚ β”‚
β”‚  β”‚  4. Return TrainingJobRef for tracking                             β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  wait_until_complete()                                                  β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Poll task status until done                                       β”‚ β”‚
β”‚  β”‚  Return final TrainingJobStatus                                    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Implementation Details:

  1. ART Client Management: The adapter maintains an art.Backend client and art.TrainableModel instance that persist across epochs.

  2. Trajectory Conversion: NeMo Agent toolkit Trajectory objects are converted to ART’s art.Trajectory format:

    # NeMo Agent toolkit format
    EpisodeItem(role=EpisodeItemRole.ASSISTANT, content="...", logprobs=...)
    
    # Converted to ART format
    Choice(index=0, logprobs=..., message={"role": "assistant", "content": "..."}, finish_reason="stop")
    
  3. Message Validation: The adapter validates that conversations follow expected patterns (user/system first, no consecutive assistant messages).

  4. Async Training: Training is submitted as an async task, allowing the trainer to monitor progress without blocking.

The ARTTrainer Class#

The ARTTrainer orchestrates the complete training loop:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        ARTTrainer Flow                                  β”‚
β”‚                                                                         β”‚
β”‚  initialize()                                                           β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  1. Generate unique run ID                                         β”‚ β”‚
β”‚  β”‚  2. Initialize trajectory builder                                  β”‚ β”‚
β”‚  β”‚  3. Initialize trainer adapter                                     β”‚ β”‚
β”‚  β”‚  4. Set up curriculum learning state                               β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β–Ό                                                                  β”‚
β”‚  run(num_epochs)                                                        β”‚
β”‚      β”‚                                                                  β”‚
β”‚  for epoch in range(num_epochs):                                        β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β”œβ”€β”€β”€ Validation (if interval reached) ─────────────────────────┐   β”‚
β”‚      β”‚                                                              β”‚   β”‚
β”‚      β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚      β”‚    β”‚  Run evaluation on validation dataset                 β”‚ β”‚   β”‚
β”‚      β”‚    β”‚  Record metrics (avg_reward, etc.)                    β”‚ β”‚   β”‚
β”‚      β”‚    β”‚  Store in validation history                          β”‚ β”‚   β”‚
β”‚      β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚      β”‚                                                              β”‚   β”‚
β”‚      β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β”œβ”€β”€β”€ run_epoch() ──────────────────────────────────────────────┐   β”‚
β”‚      β”‚                                                              β”‚   β”‚
β”‚      β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚      β”‚    β”‚  1. Start trajectory collection                       β”‚ β”‚   β”‚
β”‚      β”‚    β”‚  2. Finalize and compute metrics                      β”‚ β”‚   β”‚
β”‚      β”‚    β”‚  3. Apply curriculum learning (filter groups)         β”‚ β”‚   β”‚
β”‚      β”‚    β”‚  4. Submit to trainer adapter                         β”‚ β”‚   β”‚
β”‚      β”‚    β”‚  5. Log progress and generate plots                   β”‚ β”‚   β”‚
β”‚      β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚      β”‚                                                              β”‚   β”‚
β”‚      β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚      β”‚                                                                  β”‚
β”‚      β”œβ”€β”€β”€ Wait for training to complete                                 β”‚
β”‚      β”‚                                                                  β”‚
β”‚      └─── Check status, break on failure                                β”‚
β”‚                                                                         β”‚
β”‚  Return list of TrainingJobStatus                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Implementation Details:

  1. Curriculum Learning: The trainer implements curriculum learning to progressively include harder examples:

    • Groups trajectories by average reward

    • Filters out groups with insufficient variance (no learning signal)

    • Starts with easiest fraction, expands at intervals

  2. Validation: Optionally runs evaluation on a separate validation dataset to monitor generalization.

  3. Progress Visualization: Generates reward plots (reward_plot.png) showing training and validation reward progression.

  4. Metrics Logging: Writes detailed metrics to JSONL files for analysis:

    • training_metrics.jsonl: Per-epoch metrics

    • reward_history.json: Reward progression

    • curriculum_state.json: Curriculum learning state

Running Finetuning#

Prerequisites#

  1. ART Backend Server: You need a running ART server with your model loaded. See ART documentation for setup.

  2. LLM with Logprobs: Your LLM must return log probabilities. For vLLM, use the --enable-log-probs flag.

  3. Training Dataset: A JSON/JSONL dataset with your training examples.

  4. Reward Function: An evaluator that can score workflow outputs.

Running Training#

You must have OpenPipe ART plugin installed (nvidia-nat-openpipe-art), and an OpenPipe ART server running and configured to accept training jobs.

# Basic training
nat finetune --config_file=configs/finetune.yml

# With validation
nat finetune --config_file=configs/finetune.yml \
    --validation_dataset=data/val.json \
    --validation_interval=5

# Override epochs
nat finetune --config_file=configs/finetune.yml \
    -o finetuning.num_epochs 50

Monitoring Progress#

During training, check:

  1. Console Output: Shows epoch progress, reward statistics, trajectory counts

  2. Metrics Files: In your output_dir:

    • training_metrics.jsonl: Detailed per-epoch metrics

    • reward_plot.png: Visual reward progression

    • reward_history.json: Raw reward data

  3. ART Server Logs: Training progress from the ART side

Example console output:

INFO - Starting epoch 1 for run art_run_a1b2c3d4
INFO - Starting 2 evaluation runs for run_id: art_run_a1b2c3d4
INFO - Built 100 trajectories across 50 examples for run_id: art_run_a1b2c3d4
INFO - Submitted 100 trajectories in 50 groups for training
INFO - Epoch 1 progress logged - Avg Reward: 0.4523, Trajectories: 100
INFO - Training art_run_a1b2c3d4 completed successfully.
INFO - Completed epoch 1/20

Advanced Configuration#

Multi-GPU Training#

For larger models, configure tensor parallelism:

trainer_adapters:
  art_adapter:
    _type: openpipe_art_trainer_adapter
    backend:
      engine_args:
        tensor_parallel_size: 2  # Use 2 GPUs
        gpu_memory_utilization: 0.85

Memory Optimization#

If you encounter OOM errors:

trainer_adapters:
  art_adapter:
    _type: openpipe_art_trainer_adapter
    backend:
      init_args:
        max_seq_length: 4096  # Reduce sequence length
      engine_args:
        gpu_memory_utilization: 0.7  # Leave more headroom

Curriculum Learning#

Enable curriculum learning to improve training stability:

finetuning:
  curriculum_learning:
    enabled: true
    initial_percentile: 0.3      # Start with easiest 30%
    increment_percentile: 0.2     # Add 20% each expansion
    expansion_interval: 5         # Expand every 5 epochs
    min_reward_diff: 0.1         # Filter no-variance groups
    sort_ascending: false         # Easy-to-hard

Targeting Specific Functions#

For multi-component workflows, target specific functions:

finetuning:
  target_functions:
    - my_agent_function
    - tool_calling_function
  target_model: training_llm  # Only include steps from this model

Troubleshooting#

Connection Issues#

β€œFailed to connect to ART backend”

  1. Verify the server is running:

    curl http://localhost:7623/health
    
  2. Check IP and port in configuration

  3. Verify network connectivity (firewalls, etc.)

Missing Log Probabilities#

β€œNo valid assistant messages with logprobs”

  1. Ensure your LLM provider returns logprobs

  2. For vLLM: verify --enable-log-probs flag

  3. Check your LLM configuration

Out of Memory#

β€œCUDA out of memory”

  1. Reduce gpu_memory_utilization

  2. Reduce max_seq_length

  3. Reduce num_generations (fewer parallel trajectories)

  4. Increase tensor_parallel_size (distribute across GPUs)

No Trajectories Collected#

β€œNo trajectories collected for epoch”

  1. Check target_functions matches your workflow

  2. Verify workflow produces intermediate steps

  3. Check evaluator is returning rewards

  4. Look for errors in evaluation logs

Training Not Improving#

Rewards not increasing

  1. Increase num_generations for better GRPO signal

  2. Try curriculum learning to focus on learnable examples

  3. Adjust learning rate

  4. Verify reward function is well-calibrated

  5. Check for sufficient variance in trajectory groups

Examples#

The examples/finetuning/rl_with_openpipe_art directory contains a complete working example demonstrating:

  • Custom workflow with intermediate step tracking

  • Custom reward evaluator with reward shaping

  • Full configuration for ART integration

  • Training and evaluation datasets

See the example’s README for detailed instructions.

See Also#