Offline Training (SFT/DPO)

This tutorial is experimental and may contain bugs. Proceed with caution.

Goal: Transform your generated rollouts into high-quality training data for supervised fine-tuning (SFT) and direct preference optimization (DPO).

Time: ~20 minutes

In this tutorial, you will:

Filter and process collected rollouts
Generate SFT and DPO training datasets
Train models using offline training pipelines

Prerequisites

Completed Getting Started
Virtual environment activated

Why Offline Training?

Offline training uses pre-collected rollouts to improve AI models without real-time exploration. This approach is ideal when:

You have a working agent that demonstrates good behaviors
You want reproducible results - same data, consistent training outcomes
You need cost-effective training - no expensive exploration during training
You want to capture expert demonstrations - preserve successful patterns
You have limited compute - more efficient than reinforcement learning

The offline training pipeline: Generate rollouts → Filter and process → Train models → Deploy improved agents

Training Data Types

Supervised Fine-Tuning (SFT) Data

Purpose: Train models to follow successful agent interaction patterns

Data structure: Input-output pairs showing complete agent conversations

1 {
2   "messages": [
3     {"role": "user", "content": "What's the weather in Paris?"},
4     {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]},
5     {"role": "tool", "content": "Temperature: 22°C, sunny"},
6     {"role": "assistant", "content": "The weather in Paris is 22°C and sunny."}
7   ],
8   "quality_score": 0.95
9 }

Direct Preference Optimization (DPO) Data

Purpose: Train models to prefer better responses over worse ones

Data structure: Preference pairs with chosen vs rejected responses

1 {
2   "prompt": [{"role": "user", "content": "Solve this math problem: 2x + 5 = 13"}],
3   "chosen": [
4     {"role": "assistant", "content": "I'll solve for x step by step:\n2x + 5 = 13\n2x = 13 - 5\n2x = 8\nx = 4"}
5   ],
6   "rejected": [
7     {"role": "assistant", "content": "The answer is x = 3"}
8   ],
9   "quality_difference": 0.7
10 }

Data Preparation Overview

The offline training pipeline follows this logical flow:

Collect rollouts
- SFT data: Use consistent generation (low temperature, single rollout per task)
- DPO data: Use diverse generation (higher temperature, 2 rollouts per task for comparison)
Filter for quality - Remove poor rollouts before processing
Format for training - Convert to SFT or DPO format based on your goals

Step 1: Quality Filtering and Curation

Always filter your rollouts first before formatting them for training. Here are example approaches you can customize for your needs:

Automatic Filtering

1 def filter_rollouts(input_file: str, output_file: str, filters: Dict):
2     """Apply automatic quality filters to rollouts."""
3     with open(input_file) as f, open(output_file, 'w') as out:
4         kept = 0
5         total = 0
6         
7         for line in f:
8             rollout = json.loads(line)
9             total += 1
10             
11             # Apply filters
12             if (rollout.get('reward', 0) >= filters.get('min_reward', 0.5) and
13                 rollout.get('success', False) and
14                 len(rollout.get('output', [])) >= filters.get('min_turns', 2) and
15                 len(rollout.get('output', [])) <= filters.get('max_turns', 20)):
16                 
17                 out.write(line)
18                 kept += 1
19         
20         print(f"Kept {kept}/{total} rollouts ({kept/total*100:.1f}%)")
21 
22 # Apply filters first
23 filter_rollouts('raw_rollouts.jsonl', 'filtered_rollouts.jsonl', {
24     'min_reward': 0.7,
25     'min_turns': 3,
26     'max_turns': 15
27 })

Manual Curation (Optional)

For critical applications, sample and manually review:

1 def sample_for_review(input_file: str, sample_size: int = 50):
2     """Sample rollouts for manual review."""
3     import random
4     
5     with open(input_file) as f:
6         rollouts = [json.loads(line) for line in f]
7     
8     # Stratified sampling by reward
9     low_reward = [r for r in rollouts if r.get('reward', 0) < 0.5]
10     mid_reward = [r for r in rollouts if 0.5 <= r.get('reward', 0) < 0.8]
11     high_reward = [r for r in rollouts if r.get('reward', 0) >= 0.8]
12     
13     sample = (random.sample(low_reward, min(10, len(low_reward))) +
14               random.sample(mid_reward, min(20, len(mid_reward))) +
15               random.sample(high_reward, min(20, len(high_reward))))
16     
17     with open('manual_review_sample.jsonl', 'w') as out:
18         for rollout in sample:
19             out.write(json.dumps(rollout) + '\n')

These are example filtering approaches. Customize the criteria, thresholds, and sampling strategies based on your specific domain and quality requirements.

Step 2: Format for Training

After you have filtered, high-quality rollouts, format them for your chosen training method:

SFT Data Processing

Transform filtered rollouts into conversation format:

1 import json
2 from typing import List, Dict
3 
4 def process_sft_data(filtered_rollout_file: str, output_file: str):
5     """Convert filtered rollouts to SFT training format."""
6     with open(filtered_rollout_file) as f, open(output_file, 'w') as out:
7         for line in f:
8             rollout = json.loads(line)
9             sft_example = {
10                 "messages": rollout['output'],
11                 "reward": rollout['reward'],
12                 "task_type": rollout.get('metadata', {}).get('task_type', 'general')
13             }
14             out.write(json.dumps(sft_example) + '\n')
15 
16 # Process filtered rollouts (no additional filtering needed)
17 process_sft_data('filtered_rollouts.jsonl', 'sft_data.jsonl')

DPO Data Processing

Create preference pairs from filtered rollouts (requires 2 rollouts per task):

1 def create_dpo_pairs(filtered_rollout_file: str, output_file: str):
2     """Create preference pairs from pairs of filtered rollouts."""
3     
4     # Group rollouts by task
5     task_groups = {}
6     with open(filtered_rollout_file) as f:
7         for line in f:
8             rollout = json.loads(line)
9             task_id = rollout.get('task_id') or hash(json.dumps(rollout['responses_create_params']['input']))
10             
11             if task_id not in task_groups:
12                 task_groups[task_id] = []
13             task_groups[task_id].append(rollout)
14     
15     # Create preference pairs from pairs of rollouts
16     pairs = []
17     for task_rollouts in task_groups.values():
18         if len(task_rollouts) == 2:  # DPO works with pairs
19             rollout_1, rollout_2 = task_rollouts
20             
21             # Determine which is better based on reward
22             if rollout_1['reward'] > rollout_2['reward']:
23                 chosen, rejected = rollout_1, rollout_2
24             else:
25                 chosen, rejected = rollout_2, rollout_1
26             
27             # Only create pair if there's meaningful difference
28             quality_diff = chosen['reward'] - rejected['reward']
29             if quality_diff >= 0.1:  # Minimum difference threshold
30                 pairs.append({
31                     "prompt": chosen['responses_create_params']['input'],
32                     "chosen": chosen['output'],
33                     "rejected": rejected['output'],
34                     "quality_difference": quality_diff
35                 })
36     
37     # Save preference pairs
38     with open(output_file, 'w') as out:
39         for pair in pairs:
40             out.write(json.dumps(pair) + '\n')
41     
42     print(f"Created {len(pairs)} preference pairs")
43 
44 # Create DPO pairs from filtered rollouts
45 create_dpo_pairs('filtered_rollouts.jsonl', 'dpo_pairs.jsonl')

Training Integration

After you have your processed data (sft_data.jsonl or dpo_pairs.jsonl), you can use any post-training framework for SFT or DPO:

Standard Data Formats

SFT data follows the conversation format used by most training libraries:

1 {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

DPO data follows the preference pair format:

1 {"prompt": ["..."], "chosen": ["..."], "rejected": ["..."]}

Validation and Evaluation

Pre-Training Validation

Before training, validate your data quality by checking:

Dataset size: Sufficient examples for training objectives
Reward distribution: Reasonable range and average quality scores
Length distribution: Appropriate conversation lengths
Task diversity: Balanced representation across different task types

Post-Training Evaluation

Test your improved model by generating new rollouts on held-out evaluation tasks:

$ # Generate rollouts with improved model
$ ng_collect_rollouts +agent_name=improved_agent \
>     +input_jsonl_fpath=evaluation_tasks.jsonl \
>     +output_jsonl_fpath=post_training_rollouts.jsonl

Compare key metrics like average reward, success rate, and task-specific performance against your baseline to measure improvement.

Best Practices

1. Data Quality Over Quantity

1 # Prefer high-quality filtered data over large noisy datasets
2 filter_criteria = {
3     'min_reward': 0.8,        # High threshold for SFT
4     'min_success_rate': 0.9,
5     'require_tool_usage': True  # Domain-specific requirements
6 }

2. Balanced Datasets

1 # Ensure diverse task representation
2 def balance_dataset(input_file: str, output_file: str, max_per_category: int = 100):
3     task_counts = {}
4     balanced_data = []
5     
6     with open(input_file) as f:
7         for line in f:
8             data = json.loads(line)
9             task_type = data.get('metadata', {}).get('task_type', 'general')
10             
11             if task_counts.get(task_type, 0) < max_per_category:
12                 balanced_data.append(data)
13                 task_counts[task_type] = task_counts.get(task_type, 0) + 1
14     
15     with open(output_file, 'w') as out:
16         for data in balanced_data:
17             out.write(json.dumps(data) + '\n')

3. Iterative Improvement

$ # Iteration cycle
$ 1. Generate rollouts with current agent
$ 2. Filter and prepare training data  
$ 3. Train improved model
$ 4. Deploy and evaluate
$ 5. Use improved agent to generate better rollouts

4. Version Control

$ # Track data versions
$ mkdir -p training/v1.0/
$ mv sft_data.jsonl training/v1.0/
$ mv dpo_pairs.jsonl training/v1.0/
$ 
$ # Track model versions  
$ mkdir -p models/agent_v1.0/
$ cp -r ./results/* models/agent_v1.0/

Troubleshooting

Problem: Poor Training Data Quality

Low average rewards, inconsistent behaviors

Solutions:

Increase min_reward threshold for filtering
Generate rollouts with lower temperature (more consistent)
Add manual curation step
Improve base agent before data collection

Problem: Insufficient Data Diversity

Model overfits to limited patterns

Solutions:

Generate rollouts with higher temperature
Use more diverse input tasks
Collect data from multiple agent configurations
Balance dataset across task types

Problem: Training Instability

Loss doesn't converge, model performance degrades

Solutions:

Check data format compatibility with training framework
Reduce learning rate
Add regularization
Filter out extremely long or short conversations

What You’ve Learned

You now know how to transform rollouts into training data:

Data preparation strategies for SFT and DPO
Quality filtering and curation techniques
Evaluation methods to measure improvement
Best practices for sustainable offline training workflows

Next Steps

GRPO Training

Train with GRPO for more powerful optimization than offline methods.

Build Custom Environments

Create custom training environments with advanced verification patterns.