Offline Training with Rollouts (SFT/DPO)#

Goal: Transform your generated rollouts into high-quality training data for supervised fine-tuning (SFT) and direct preference optimization (DPO).

Why Offline Training?#

Offline training uses pre-collected rollouts to improve AI models without real-time exploration. This approach is ideal when:

  • You have a working agent that demonstrates good behaviors

  • You want reproducible results - same data, consistent training outcomes

  • You need cost-effective training - no expensive exploration during training

  • You want to capture expert demonstrations - preserve successful patterns

  • You have limited compute - more efficient than reinforcement learning

The offline training pipeline: Generate rollouts → Filter and process → Train models → Deploy improved agents

Training Data Types#

Supervised Fine-Tuning (SFT) Data#

Purpose: Train models to follow successful agent interaction patterns

Data structure: Input-output pairs showing complete agent conversations

{
  "messages": [
    {"role": "user", "content": "What's the weather in Paris?"},
    {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]},
    {"role": "tool", "content": "Temperature: 22°C, sunny"},
    {"role": "assistant", "content": "The weather in Paris is 22°C and sunny."}
  ],
  "quality_score": 0.95
}

Direct Preference Optimization (DPO) Data#

Purpose: Train models to prefer better responses over worse ones

Data structure: Preference pairs with chosen vs rejected responses

{
  "prompt": [{"role": "user", "content": "Solve this math problem: 2x + 5 = 13"}],
  "chosen": [
    {"role": "assistant", "content": "I'll solve for x step by step:\n2x + 5 = 13\n2x = 13 - 5\n2x = 8\nx = 4"}
  ],
  "rejected": [
    {"role": "assistant", "content": "The answer is x = 3"}
  ],
  "quality_difference": 0.7
}

Data Preparation Overview#

The offline training pipeline follows this logical flow:

  1. Collect rollouts using strategies from [Tutorial 5]

  • SFT data: Use consistent generation (low temperature, single rollout per task)

  • DPO data: Use diverse generation (higher temperature, 2 rollouts per task for comparison)

  1. Filter for quality - Remove poor rollouts before processing

  2. Format for training - Convert to SFT or DPO format based on your goals

Step 1: Quality Filtering and Curation#

Always filter your rollouts first before formatting them for training. Here are example approaches you can customize for your needs:

Automatic Filtering#

def filter_rollouts(input_file: str, output_file: str, filters: Dict):
    """Apply automatic quality filters to rollouts."""
    with open(input_file) as f, open(output_file, 'w') as out:
        kept = 0
        total = 0
        
        for line in f:
            rollout = json.loads(line)
            total += 1
            
            # Apply filters
            if (rollout.get('reward', 0) >= filters.get('min_reward', 0.5) and
                rollout.get('success', False) and
                len(rollout.get('output', [])) >= filters.get('min_turns', 2) and
                len(rollout.get('output', [])) <= filters.get('max_turns', 20)):
                
                out.write(line)
                kept += 1
        
        print(f"Kept {kept}/{total} rollouts ({kept/total*100:.1f}%)")

# Apply filters first
filter_rollouts('raw_rollouts.jsonl', 'filtered_rollouts.jsonl', {
    'min_reward': 0.7,
    'min_turns': 3,
    'max_turns': 15
})

Manual Curation (Optional)#

For critical applications, sample and manually review:

def sample_for_review(input_file: str, sample_size: int = 50):
    """Sample rollouts for manual review."""
    import random
    
    with open(input_file) as f:
        rollouts = [json.loads(line) for line in f]
    
    # Stratified sampling by reward
    low_reward = [r for r in rollouts if r.get('reward', 0) < 0.5]
    mid_reward = [r for r in rollouts if 0.5 <= r.get('reward', 0) < 0.8]
    high_reward = [r for r in rollouts if r.get('reward', 0) >= 0.8]
    
    sample = (random.sample(low_reward, min(10, len(low_reward))) +
              random.sample(mid_reward, min(20, len(mid_reward))) +
              random.sample(high_reward, min(20, len(high_reward))))
    
    with open('manual_review_sample.jsonl', 'w') as out:
        for rollout in sample:
            out.write(json.dumps(rollout) + '\n')

Note: These are example filtering approaches. Customize the criteria, thresholds, and sampling strategies based on your specific domain and quality requirements.

Step 2: Format for Training#

Once you have filtered, high-quality rollouts, format them for your chosen training method:

SFT Data Processing#

Transform filtered rollouts into conversation format:

import json
from typing import List, Dict

def process_sft_data(filtered_rollout_file: str, output_file: str):
    """Convert filtered rollouts to SFT training format."""
    with open(filtered_rollout_file) as f, open(output_file, 'w') as out:
        for line in f:
            rollout = json.loads(line)
            sft_example = {
                "messages": rollout['output'],
                "reward": rollout['reward'],
                "task_type": rollout.get('metadata', {}).get('task_type', 'general')
            }
            out.write(json.dumps(sft_example) + '\n')

# Process filtered rollouts (no additional filtering needed)
process_sft_data('filtered_rollouts.jsonl', 'sft_data.jsonl')

DPO Data Processing#

Create preference pairs from filtered rollouts (requires 2 rollouts per task):

def create_dpo_pairs(filtered_rollout_file: str, output_file: str):
    """Create preference pairs from pairs of filtered rollouts."""
    
    # Group rollouts by task
    task_groups = {}
    with open(filtered_rollout_file) as f:
        for line in f:
            rollout = json.loads(line)
            task_id = rollout.get('task_id') or hash(json.dumps(rollout['responses_create_params']['input']))
            
            if task_id not in task_groups:
                task_groups[task_id] = []
            task_groups[task_id].append(rollout)
    
    # Create preference pairs from pairs of rollouts
    pairs = []
    for task_rollouts in task_groups.values():
        if len(task_rollouts) == 2:  # DPO works with pairs
            rollout_1, rollout_2 = task_rollouts
            
            # Determine which is better based on reward
            if rollout_1['reward'] > rollout_2['reward']:
                chosen, rejected = rollout_1, rollout_2
            else:
                chosen, rejected = rollout_2, rollout_1
            
            # Only create pair if there's meaningful difference
            quality_diff = chosen['reward'] - rejected['reward']
            if quality_diff >= 0.1:  # Minimum difference threshold
                pairs.append({
                    "prompt": chosen['responses_create_params']['input'],
                    "chosen": chosen['output'],
                    "rejected": rejected['output'],
                    "quality_difference": quality_diff
                })
    
    # Save preference pairs
    with open(output_file, 'w') as out:
        for pair in pairs:
            out.write(json.dumps(pair) + '\n')
    
    print(f"Created {len(pairs)} preference pairs")

# Create DPO pairs from filtered rollouts
create_dpo_pairs('filtered_rollouts.jsonl', 'dpo_pairs.jsonl')

Training Integration#

Once you have your processed data (sft_data.jsonl or dpo_pairs.jsonl), you can use any post-training framework for SFT or DPO:

Standard Data Formats#

SFT data follows the conversation format used by most training libraries:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

DPO data follows the preference pair format:

{"prompt": ["..."], "chosen": ["..."], "rejected": ["..."]}

Validation and Evaluation#

Pre-Training Validation#

Before training, validate your data quality by checking:

  • Dataset size: Sufficient examples for training objectives

  • Reward distribution: Reasonable range and average quality scores

  • Length distribution: Appropriate conversation lengths

  • Task diversity: Balanced representation across different task types

Post-Training Evaluation#

Test your improved model by generating new rollouts on held-out evaluation tasks:

# Generate rollouts with improved model
ng_collect_rollouts +agent_name=improved_agent \
    +input_jsonl_fpath=evaluation_tasks.jsonl \
    +output_jsonl_fpath=post_training_rollouts.jsonl

Compare key metrics like average reward, success rate, and task-specific performance against your baseline to measure improvement.

Best Practices#

1. Data Quality Over Quantity#

# Prefer high-quality filtered data over large noisy datasets
filter_criteria = {
    'min_reward': 0.8,        # High threshold for SFT
    'min_success_rate': 0.9,
    'require_tool_usage': True  # Domain-specific requirements
}

2. Balanced Datasets#

# Ensure diverse task representation
def balance_dataset(input_file: str, output_file: str, max_per_category: int = 100):
    task_counts = {}
    balanced_data = []
    
    with open(input_file) as f:
        for line in f:
            data = json.loads(line)
            task_type = data.get('metadata', {}).get('task_type', 'general')
            
            if task_counts.get(task_type, 0) < max_per_category:
                balanced_data.append(data)
                task_counts[task_type] = task_counts.get(task_type, 0) + 1
    
    with open(output_file, 'w') as out:
        for data in balanced_data:
            out.write(json.dumps(data) + '\n')

3. Iterative Improvement#

# Iteration cycle
1. Generate rollouts with current agent
2. Filter and prepare training data  
3. Train improved model
4. Deploy and evaluate
5. Use improved agent to generate better rollouts

4. Version Control#

# Track data versions
mkdir -p training/v1.0/
mv sft_data.jsonl training/v1.0/
mv dpo_pairs.jsonl training/v1.0/

# Track model versions  
mkdir -p models/agent_v1.0/
cp -r ./results/* models/agent_v1.0/

Troubleshooting#

Problem: Poor Training Data Quality#

Low average rewards, inconsistent behaviors

Solutions:

  • Increase min_reward threshold for filtering

  • Generate rollouts with lower temperature (more consistent)

  • Add manual curation step

  • Improve base agent before data collection

Problem: Insufficient Data Diversity#

Model overfits to limited patterns

Solutions:

  • Generate rollouts with higher temperature

  • Use more diverse input tasks

  • Collect data from multiple agent configurations

  • Balance dataset across task types

Problem: Training Instability#

Loss doesn't converge, model performance degrades

Solutions:

  • Check data format compatibility with training framework

  • Reduce learning rate

  • Add regularization

  • Filter out extremely long or short conversations

What You’ve Learned#

You now know how to transform rollouts into training data:

  • Data preparation strategies for SFT and DPO

  • Quality filtering and curation techniques

  • Evaluation methods to measure improvement

  • Best practices for sustainable offline training workflows