Offline Training (SFT/DPO)

View as Markdown

This tutorial is experimental and may contain bugs. Proceed with caution.

Goal: Transform your generated rollouts into high-quality training data for supervised fine-tuning (SFT) and direct preference optimization (DPO).

Time: ~20 minutes

In this tutorial, you will:

  1. Filter and process collected rollouts
  2. Generate SFT and DPO training datasets
  3. Train models using offline training pipelines

Prerequisites


Why Offline Training?

Offline training uses pre-collected rollouts to improve AI models without real-time exploration. This approach is ideal when:

  • You have a working agent that demonstrates good behaviors
  • You want reproducible results - same data, consistent training outcomes
  • You need cost-effective training - no expensive exploration during training
  • You want to capture expert demonstrations - preserve successful patterns
  • You have limited compute - more efficient than reinforcement learning

The offline training pipeline: Generate rollouts → Filter and process → Train models → Deploy improved agents

Training Data Types

Supervised Fine-Tuning (SFT) Data

Purpose: Train models to follow successful agent interaction patterns

Data structure: Input-output pairs showing complete agent conversations

1{
2 "messages": [
3 {"role": "user", "content": "What's the weather in Paris?"},
4 {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]},
5 {"role": "tool", "content": "Temperature: 22°C, sunny"},
6 {"role": "assistant", "content": "The weather in Paris is 22°C and sunny."}
7 ],
8 "quality_score": 0.95
9}

Direct Preference Optimization (DPO) Data

Purpose: Train models to prefer better responses over worse ones

Data structure: Preference pairs with chosen vs rejected responses

1{
2 "prompt": [{"role": "user", "content": "Solve this math problem: 2x + 5 = 13"}],
3 "chosen": [
4 {"role": "assistant", "content": "I'll solve for x step by step:\n2x + 5 = 13\n2x = 13 - 5\n2x = 8\nx = 4"}
5 ],
6 "rejected": [
7 {"role": "assistant", "content": "The answer is x = 3"}
8 ],
9 "quality_difference": 0.7
10}

Data Preparation Overview

The offline training pipeline follows this logical flow:

  1. Collect rollouts
    • SFT data: Use consistent generation (low temperature, single rollout per task)
    • DPO data: Use diverse generation (higher temperature, 2 rollouts per task for comparison)
  2. Filter for quality - Remove poor rollouts before processing
  3. Format for training - Convert to SFT or DPO format based on your goals

Step 1: Quality Filtering and Curation

Always filter your rollouts first before formatting them for training. Here are example approaches you can customize for your needs:

Automatic Filtering

1def filter_rollouts(input_file: str, output_file: str, filters: Dict):
2 """Apply automatic quality filters to rollouts."""
3 with open(input_file) as f, open(output_file, 'w') as out:
4 kept = 0
5 total = 0
6
7 for line in f:
8 rollout = json.loads(line)
9 total += 1
10
11 # Apply filters
12 if (rollout.get('reward', 0) >= filters.get('min_reward', 0.5) and
13 rollout.get('success', False) and
14 len(rollout.get('output', [])) >= filters.get('min_turns', 2) and
15 len(rollout.get('output', [])) <= filters.get('max_turns', 20)):
16
17 out.write(line)
18 kept += 1
19
20 print(f"Kept {kept}/{total} rollouts ({kept/total*100:.1f}%)")
21
22# Apply filters first
23filter_rollouts('raw_rollouts.jsonl', 'filtered_rollouts.jsonl', {
24 'min_reward': 0.7,
25 'min_turns': 3,
26 'max_turns': 15
27})

Manual Curation (Optional)

For critical applications, sample and manually review:

1def sample_for_review(input_file: str, sample_size: int = 50):
2 """Sample rollouts for manual review."""
3 import random
4
5 with open(input_file) as f:
6 rollouts = [json.loads(line) for line in f]
7
8 # Stratified sampling by reward
9 low_reward = [r for r in rollouts if r.get('reward', 0) < 0.5]
10 mid_reward = [r for r in rollouts if 0.5 <= r.get('reward', 0) < 0.8]
11 high_reward = [r for r in rollouts if r.get('reward', 0) >= 0.8]
12
13 sample = (random.sample(low_reward, min(10, len(low_reward))) +
14 random.sample(mid_reward, min(20, len(mid_reward))) +
15 random.sample(high_reward, min(20, len(high_reward))))
16
17 with open('manual_review_sample.jsonl', 'w') as out:
18 for rollout in sample:
19 out.write(json.dumps(rollout) + '\n')

These are example filtering approaches. Customize the criteria, thresholds, and sampling strategies based on your specific domain and quality requirements.

Step 2: Format for Training

After you have filtered, high-quality rollouts, format them for your chosen training method:

SFT Data Processing

Transform filtered rollouts into conversation format:

1import json
2from typing import List, Dict
3
4def process_sft_data(filtered_rollout_file: str, output_file: str):
5 """Convert filtered rollouts to SFT training format."""
6 with open(filtered_rollout_file) as f, open(output_file, 'w') as out:
7 for line in f:
8 rollout = json.loads(line)
9 sft_example = {
10 "messages": rollout['output'],
11 "reward": rollout['reward'],
12 "task_type": rollout.get('metadata', {}).get('task_type', 'general')
13 }
14 out.write(json.dumps(sft_example) + '\n')
15
16# Process filtered rollouts (no additional filtering needed)
17process_sft_data('filtered_rollouts.jsonl', 'sft_data.jsonl')

DPO Data Processing

Create preference pairs from filtered rollouts (requires 2 rollouts per task):

1def create_dpo_pairs(filtered_rollout_file: str, output_file: str):
2 """Create preference pairs from pairs of filtered rollouts."""
3
4 # Group rollouts by task
5 task_groups = {}
6 with open(filtered_rollout_file) as f:
7 for line in f:
8 rollout = json.loads(line)
9 task_id = rollout.get('task_id') or hash(json.dumps(rollout['responses_create_params']['input']))
10
11 if task_id not in task_groups:
12 task_groups[task_id] = []
13 task_groups[task_id].append(rollout)
14
15 # Create preference pairs from pairs of rollouts
16 pairs = []
17 for task_rollouts in task_groups.values():
18 if len(task_rollouts) == 2: # DPO works with pairs
19 rollout_1, rollout_2 = task_rollouts
20
21 # Determine which is better based on reward
22 if rollout_1['reward'] > rollout_2['reward']:
23 chosen, rejected = rollout_1, rollout_2
24 else:
25 chosen, rejected = rollout_2, rollout_1
26
27 # Only create pair if there's meaningful difference
28 quality_diff = chosen['reward'] - rejected['reward']
29 if quality_diff >= 0.1: # Minimum difference threshold
30 pairs.append({
31 "prompt": chosen['responses_create_params']['input'],
32 "chosen": chosen['output'],
33 "rejected": rejected['output'],
34 "quality_difference": quality_diff
35 })
36
37 # Save preference pairs
38 with open(output_file, 'w') as out:
39 for pair in pairs:
40 out.write(json.dumps(pair) + '\n')
41
42 print(f"Created {len(pairs)} preference pairs")
43
44# Create DPO pairs from filtered rollouts
45create_dpo_pairs('filtered_rollouts.jsonl', 'dpo_pairs.jsonl')

Training Integration

After you have your processed data (sft_data.jsonl or dpo_pairs.jsonl), you can use any post-training framework for SFT or DPO:

Standard Data Formats

SFT data follows the conversation format used by most training libraries:

1{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

DPO data follows the preference pair format:

1{"prompt": ["..."], "chosen": ["..."], "rejected": ["..."]}

Validation and Evaluation

Pre-Training Validation

Before training, validate your data quality by checking:

  • Dataset size: Sufficient examples for training objectives
  • Reward distribution: Reasonable range and average quality scores
  • Length distribution: Appropriate conversation lengths
  • Task diversity: Balanced representation across different task types

Post-Training Evaluation

Test your improved model by generating new rollouts on held-out evaluation tasks:

$# Generate rollouts with improved model
$ng_collect_rollouts +agent_name=improved_agent \
> +input_jsonl_fpath=evaluation_tasks.jsonl \
> +output_jsonl_fpath=post_training_rollouts.jsonl

Compare key metrics like average reward, success rate, and task-specific performance against your baseline to measure improvement.

Best Practices

1. Data Quality Over Quantity

1# Prefer high-quality filtered data over large noisy datasets
2filter_criteria = {
3 'min_reward': 0.8, # High threshold for SFT
4 'min_success_rate': 0.9,
5 'require_tool_usage': True # Domain-specific requirements
6}

2. Balanced Datasets

1# Ensure diverse task representation
2def balance_dataset(input_file: str, output_file: str, max_per_category: int = 100):
3 task_counts = {}
4 balanced_data = []
5
6 with open(input_file) as f:
7 for line in f:
8 data = json.loads(line)
9 task_type = data.get('metadata', {}).get('task_type', 'general')
10
11 if task_counts.get(task_type, 0) < max_per_category:
12 balanced_data.append(data)
13 task_counts[task_type] = task_counts.get(task_type, 0) + 1
14
15 with open(output_file, 'w') as out:
16 for data in balanced_data:
17 out.write(json.dumps(data) + '\n')

3. Iterative Improvement

$# Iteration cycle
$1. Generate rollouts with current agent
$2. Filter and prepare training data
$3. Train improved model
$4. Deploy and evaluate
$5. Use improved agent to generate better rollouts

4. Version Control

$# Track data versions
$mkdir -p training/v1.0/
$mv sft_data.jsonl training/v1.0/
$mv dpo_pairs.jsonl training/v1.0/
$
$# Track model versions
$mkdir -p models/agent_v1.0/
$cp -r ./results/* models/agent_v1.0/

Troubleshooting

Problem: Poor Training Data Quality

Low average rewards, inconsistent behaviors

Solutions:

  • Increase min_reward threshold for filtering
  • Generate rollouts with lower temperature (more consistent)
  • Add manual curation step
  • Improve base agent before data collection

Problem: Insufficient Data Diversity

Model overfits to limited patterns

Solutions:

  • Generate rollouts with higher temperature
  • Use more diverse input tasks
  • Collect data from multiple agent configurations
  • Balance dataset across task types

Problem: Training Instability

Loss doesn't converge, model performance degrades

Solutions:

  • Check data format compatibility with training framework
  • Reduce learning rate
  • Add regularization
  • Filter out extremely long or short conversations

What You’ve Learned

You now know how to transform rollouts into training data:

  • Data preparation strategies for SFT and DPO
  • Quality filtering and curation techniques
  • Evaluation methods to measure improvement
  • Best practices for sustainable offline training workflows

Next Steps