Offline Training (SFT/DPO)
Offline Training (SFT/DPO)
This tutorial is experimental and may contain bugs. Proceed with caution.
Goal: Transform your generated rollouts into high-quality training data for supervised fine-tuning (SFT) and direct preference optimization (DPO).
Time: ~20 minutes
In this tutorial, you will:
- Filter and process collected rollouts
- Generate SFT and DPO training datasets
- Train models using offline training pipelines
Prerequisites
- Completed Getting Started
- Virtual environment activated
Why Offline Training?
Offline training uses pre-collected rollouts to improve AI models without real-time exploration. This approach is ideal when:
- You have a working agent that demonstrates good behaviors
- You want reproducible results - same data, consistent training outcomes
- You need cost-effective training - no expensive exploration during training
- You want to capture expert demonstrations - preserve successful patterns
- You have limited compute - more efficient than reinforcement learning
The offline training pipeline: Generate rollouts → Filter and process → Train models → Deploy improved agents
Training Data Types
Supervised Fine-Tuning (SFT) Data
Purpose: Train models to follow successful agent interaction patterns
Data structure: Input-output pairs showing complete agent conversations
Direct Preference Optimization (DPO) Data
Purpose: Train models to prefer better responses over worse ones
Data structure: Preference pairs with chosen vs rejected responses
Data Preparation Overview
The offline training pipeline follows this logical flow:
- Collect rollouts
- SFT data: Use consistent generation (low temperature, single rollout per task)
- DPO data: Use diverse generation (higher temperature, 2 rollouts per task for comparison)
- Filter for quality - Remove poor rollouts before processing
- Format for training - Convert to SFT or DPO format based on your goals
Step 1: Quality Filtering and Curation
Always filter your rollouts first before formatting them for training. Here are example approaches you can customize for your needs:
Automatic Filtering
Manual Curation (Optional)
For critical applications, sample and manually review:
These are example filtering approaches. Customize the criteria, thresholds, and sampling strategies based on your specific domain and quality requirements.
Step 2: Format for Training
After you have filtered, high-quality rollouts, format them for your chosen training method:
SFT Data Processing
Transform filtered rollouts into conversation format:
DPO Data Processing
Create preference pairs from filtered rollouts (requires 2 rollouts per task):
Training Integration
After you have your processed data (sft_data.jsonl or dpo_pairs.jsonl), you can use any post-training framework for SFT or DPO:
Standard Data Formats
SFT data follows the conversation format used by most training libraries:
DPO data follows the preference pair format:
Validation and Evaluation
Pre-Training Validation
Before training, validate your data quality by checking:
- Dataset size: Sufficient examples for training objectives
- Reward distribution: Reasonable range and average quality scores
- Length distribution: Appropriate conversation lengths
- Task diversity: Balanced representation across different task types
Post-Training Evaluation
Test your improved model by generating new rollouts on held-out evaluation tasks:
Compare key metrics like average reward, success rate, and task-specific performance against your baseline to measure improvement.
Best Practices
1. Data Quality Over Quantity
2. Balanced Datasets
3. Iterative Improvement
4. Version Control
Troubleshooting
Problem: Poor Training Data Quality
Solutions:
- Increase
min_rewardthreshold for filtering - Generate rollouts with lower temperature (more consistent)
- Add manual curation step
- Improve base agent before data collection
Problem: Insufficient Data Diversity
Solutions:
- Generate rollouts with higher temperature
- Use more diverse input tasks
- Collect data from multiple agent configurations
- Balance dataset across task types
Problem: Training Instability
Solutions:
- Check data format compatibility with training framework
- Reduce learning rate
- Add regularization
- Filter out extremely long or short conversations
What You’ve Learned
You now know how to transform rollouts into training data:
- Data preparation strategies for SFT and DPO
- Quality filtering and curation techniques
- Evaluation methods to measure improvement
- Best practices for sustainable offline training workflows