For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Documentation
    • Home
  • About
    • Concepts
    • Ecosystem
  • Get Started
    • Quickstart
    • Detailed Setup Guide
    • Install from PyPI
    • Rollout Collection
  • Agent Server
  • Model Server
    • vLLM
  • Resources Server
  • Data
    • Prepare and Validate
    • Download from Hugging Face
    • Prompt Config
  • Environment Tutorials
    • Single-Step Environment
    • Multi-Step Environment
    • Stateful Environment
    • Real-World Environment
    • Integrate external libraries
    • Aggregate Metrics
    • LLM-as-Judge Verification
  • Benchmarks
    • Run benchmarks
    • Add a benchmark
    • Design a customer evaluation
  • Training Tutorials
    • NeMo RL
    • Unsloth
    • Multi-Environment Training
    • Offline Training (SFT/DPO)
  • Model Recipes
    • Nemotron 3 Nano
    • Nemotron 3 Super
  • Infrastructure
    • Deployment Topology
    • Engineering Notes
  • Reference
    • Configuration
    • RL Framework Compatibility
    • CLI Commands
    • FAQ
  • Troubleshooting
    • Configuration Errors
  • Contribute
    • Development Setup
    • Environments
    • Integrate RL Frameworks
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Gym
On this page
  • Prerequisites
  • Why Offline Training?
  • Training Data Types
  • Supervised Fine-Tuning (SFT) Data
  • Direct Preference Optimization (DPO) Data
  • Data Preparation Overview
  • Step 1: Quality Filtering and Curation
  • Automatic Filtering
  • Manual Curation (Optional)
  • Step 2: Format for Training
  • SFT Data Processing
  • DPO Data Processing
  • Training Integration
  • Standard Data Formats
  • Validation and Evaluation
  • Pre-Training Validation
  • Post-Training Evaluation
  • Best Practices
  • 1. Data Quality Over Quantity
  • 2. Balanced Datasets
  • 3. Iterative Improvement
  • 4. Version Control
  • Troubleshooting
  • Problem: Poor Training Data Quality
  • Problem: Insufficient Data Diversity
  • Problem: Training Instability
  • What You’ve Learned
  • Next Steps
Training Tutorials

Offline Training (SFT/DPO)

||View as Markdown|
Previous

Multi-Environment Training

Next

Overview

This tutorial is experimental and may contain bugs. Proceed with caution.

Goal: Transform your generated rollouts into high-quality training data for supervised fine-tuning (SFT) and direct preference optimization (DPO).

Time: ~20 minutes

In this tutorial, you will:

  1. Filter and process collected rollouts
  2. Generate SFT and DPO training datasets
  3. Train models using offline training pipelines

Prerequisites

  • Completed Detailed Setup
  • Collected rollouts (Rollout Collection)
  • Virtual environment activated

Why Offline Training?

Offline training uses pre-collected rollouts to improve AI models without real-time exploration. This approach is ideal when:

  • You have a working agent that demonstrates good behaviors
  • You want reproducible results - same data, consistent training outcomes
  • You need cost-effective training - no expensive exploration during training
  • You want to capture expert demonstrations - preserve successful patterns
  • You have limited compute - more efficient than reinforcement learning

The offline training pipeline: Generate rollouts → Filter and process → Train models → Deploy improved agents

Training Data Types

Supervised Fine-Tuning (SFT) Data

Purpose: Train models to follow successful agent interaction patterns

Data structure: Input-output pairs showing complete agent conversations

1{
2 "messages": [
3 {"role": "user", "content": "What's the weather in Paris?"},
4 {"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Paris\"}"}}]},
5 {"role": "tool", "content": "Temperature: 22°C, sunny"},
6 {"role": "assistant", "content": "The weather in Paris is 22°C and sunny."}
7 ],
8 "quality_score": 0.95
9}

Direct Preference Optimization (DPO) Data

Purpose: Train models to prefer better responses over worse ones

Data structure: Preference pairs with chosen vs rejected responses

1{
2 "prompt": [{"role": "user", "content": "Solve this math problem: 2x + 5 = 13"}],
3 "chosen": [
4 {"role": "assistant", "content": "I'll solve for x step by step:\n2x + 5 = 13\n2x = 13 - 5\n2x = 8\nx = 4"}
5 ],
6 "rejected": [
7 {"role": "assistant", "content": "The answer is x = 3"}
8 ],
9 "quality_difference": 0.7
10}

Data Preparation Overview

The offline training pipeline follows this logical flow:

  1. Collect rollouts
    • SFT data: Use consistent generation (low temperature, single rollout per task)
    • DPO data: Use diverse generation (higher temperature, 2 rollouts per task for comparison)
  2. Filter for quality - Remove poor rollouts before processing
  3. Format for training - Convert to SFT or DPO format based on your goals

Step 1: Quality Filtering and Curation

Always filter your rollouts first before formatting them for training. Here are example approaches you can customize for your needs:

Automatic Filtering

1def filter_rollouts(input_file: str, output_file: str, filters: Dict):
2 """Apply automatic quality filters to rollouts."""
3 with open(input_file) as f, open(output_file, 'w') as out:
4 kept = 0
5 total = 0
6
7 for line in f:
8 rollout = json.loads(line)
9 total += 1
10
11 # Apply filters
12 if (rollout.get('reward', 0) >= filters.get('min_reward', 0.5) and
13 rollout.get('success', False) and
14 len(rollout.get('output', [])) >= filters.get('min_turns', 2) and
15 len(rollout.get('output', [])) <= filters.get('max_turns', 20)):
16
17 out.write(line)
18 kept += 1
19
20 print(f"Kept {kept}/{total} rollouts ({kept/total*100:.1f}%)")
21
22# Apply filters first
23filter_rollouts('raw_rollouts.jsonl', 'filtered_rollouts.jsonl', {
24 'min_reward': 0.7,
25 'min_turns': 3,
26 'max_turns': 15
27})

Manual Curation (Optional)

For critical applications, sample and manually review:

1def sample_for_review(input_file: str, sample_size: int = 50):
2 """Sample rollouts for manual review."""
3 import random
4
5 with open(input_file) as f:
6 rollouts = [json.loads(line) for line in f]
7
8 # Stratified sampling by reward
9 low_reward = [r for r in rollouts if r.get('reward', 0) < 0.5]
10 mid_reward = [r for r in rollouts if 0.5 <= r.get('reward', 0) < 0.8]
11 high_reward = [r for r in rollouts if r.get('reward', 0) >= 0.8]
12
13 sample = (random.sample(low_reward, min(10, len(low_reward))) +
14 random.sample(mid_reward, min(20, len(mid_reward))) +
15 random.sample(high_reward, min(20, len(high_reward))))
16
17 with open('manual_review_sample.jsonl', 'w') as out:
18 for rollout in sample:
19 out.write(json.dumps(rollout) + '\n')

These are example filtering approaches. Customize the criteria, thresholds, and sampling strategies based on your specific domain and quality requirements.

Step 2: Format for Training

After you have filtered, high-quality rollouts, format them for your chosen training method:

SFT Data Processing

Transform filtered rollouts into conversation format:

1import json
2from typing import List, Dict
3
4def process_sft_data(filtered_rollout_file: str, output_file: str):
5 """Convert filtered rollouts to SFT training format."""
6 with open(filtered_rollout_file) as f, open(output_file, 'w') as out:
7 for line in f:
8 rollout = json.loads(line)
9 sft_example = {
10 "messages": rollout['output'],
11 "reward": rollout['reward'],
12 "task_type": rollout.get('metadata', {}).get('task_type', 'general')
13 }
14 out.write(json.dumps(sft_example) + '\n')
15
16# Process filtered rollouts (no additional filtering needed)
17process_sft_data('filtered_rollouts.jsonl', 'sft_data.jsonl')

DPO Data Processing

Create preference pairs from filtered rollouts (requires 2 rollouts per task):

1def create_dpo_pairs(filtered_rollout_file: str, output_file: str):
2 """Create preference pairs from pairs of filtered rollouts."""
3
4 # Group rollouts by task
5 task_groups = {}
6 with open(filtered_rollout_file) as f:
7 for line in f:
8 rollout = json.loads(line)
9 task_id = rollout.get('task_id') or hash(json.dumps(rollout['responses_create_params']['input']))
10
11 if task_id not in task_groups:
12 task_groups[task_id] = []
13 task_groups[task_id].append(rollout)
14
15 # Create preference pairs from pairs of rollouts
16 pairs = []
17 for task_rollouts in task_groups.values():
18 if len(task_rollouts) == 2: # DPO works with pairs
19 rollout_1, rollout_2 = task_rollouts
20
21 # Determine which is better based on reward
22 if rollout_1['reward'] > rollout_2['reward']:
23 chosen, rejected = rollout_1, rollout_2
24 else:
25 chosen, rejected = rollout_2, rollout_1
26
27 # Only create pair if there's meaningful difference
28 quality_diff = chosen['reward'] - rejected['reward']
29 if quality_diff >= 0.1: # Minimum difference threshold
30 pairs.append({
31 "prompt": chosen['responses_create_params']['input'],
32 "chosen": chosen['output'],
33 "rejected": rejected['output'],
34 "quality_difference": quality_diff
35 })
36
37 # Save preference pairs
38 with open(output_file, 'w') as out:
39 for pair in pairs:
40 out.write(json.dumps(pair) + '\n')
41
42 print(f"Created {len(pairs)} preference pairs")
43
44# Create DPO pairs from filtered rollouts
45create_dpo_pairs('filtered_rollouts.jsonl', 'dpo_pairs.jsonl')

Training Integration

After you have your processed data (sft_data.jsonl or dpo_pairs.jsonl), you can use any post-training framework for SFT or DPO:

Standard Data Formats

SFT data follows the conversation format used by most training libraries:

1{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

DPO data follows the preference pair format:

1{"prompt": ["..."], "chosen": ["..."], "rejected": ["..."]}

Validation and Evaluation

Pre-Training Validation

Before training, validate your data quality by checking:

  • Dataset size: Sufficient examples for training objectives
  • Reward distribution: Reasonable range and average quality scores
  • Length distribution: Appropriate conversation lengths
  • Task diversity: Balanced representation across different task types

Post-Training Evaluation

Test your improved model by generating new rollouts on held-out evaluation tasks:

$# Generate rollouts with improved model
$ng_collect_rollouts +agent_name=improved_agent \
> +input_jsonl_fpath=evaluation_tasks.jsonl \
> +output_jsonl_fpath=post_training_rollouts.jsonl

Compare key metrics like average reward, success rate, and task-specific performance against your baseline to measure improvement.

Best Practices

1. Data Quality Over Quantity

1# Prefer high-quality filtered data over large noisy datasets
2filter_criteria = {
3 'min_reward': 0.8, # High threshold for SFT
4 'min_success_rate': 0.9,
5 'require_tool_usage': True # Domain-specific requirements
6}

2. Balanced Datasets

1# Ensure diverse task representation
2def balance_dataset(input_file: str, output_file: str, max_per_category: int = 100):
3 task_counts = {}
4 balanced_data = []
5
6 with open(input_file) as f:
7 for line in f:
8 data = json.loads(line)
9 task_type = data.get('metadata', {}).get('task_type', 'general')
10
11 if task_counts.get(task_type, 0) < max_per_category:
12 balanced_data.append(data)
13 task_counts[task_type] = task_counts.get(task_type, 0) + 1
14
15 with open(output_file, 'w') as out:
16 for data in balanced_data:
17 out.write(json.dumps(data) + '\n')

3. Iterative Improvement

$# Iteration cycle
$1. Generate rollouts with current agent
$2. Filter and prepare training data
$3. Train improved model
$4. Deploy and evaluate
$5. Use improved agent to generate better rollouts

4. Version Control

$# Track data versions
$mkdir -p training/v1.0/
$mv sft_data.jsonl training/v1.0/
$mv dpo_pairs.jsonl training/v1.0/
$
$# Track model versions
$mkdir -p models/agent_v1.0/
$cp -r ./results/* models/agent_v1.0/

Troubleshooting

Problem: Poor Training Data Quality

Low average rewards, inconsistent behaviors

Solutions:

  • Increase min_reward threshold for filtering
  • Generate rollouts with lower temperature (more consistent)
  • Add manual curation step
  • Improve base agent before data collection

Problem: Insufficient Data Diversity

Model overfits to limited patterns

Solutions:

  • Generate rollouts with higher temperature
  • Use more diverse input tasks
  • Collect data from multiple agent configurations
  • Balance dataset across task types

Problem: Training Instability

Loss doesn't converge, model performance degrades

Solutions:

  • Check data format compatibility with training framework
  • Reduce learning rate
  • Add regularization
  • Filter out extremely long or short conversations

What You’ve Learned

You now know how to transform rollouts into training data:

  • Data preparation strategies for SFT and DPO
  • Quality filtering and curation techniques
  • Evaluation methods to measure improvement
  • Best practices for sustainable offline training workflows

Next Steps

GRPO Training

Train with GRPO for more powerful optimization than offline methods.

Build Custom Environments

Create custom training environments with advanced verification patterns.