Synthetic Data How-To Guides#

This section provides task-focused guides for common SDG workflows. For your first run, start with Generate Your First Synthetic Dataset.

If you are new to model training or want a calmer on-ramp before tasks, read Use the SDG Skill With Confidence for how to run a productive session with a coding agent.

Run the Pipeline

Preview, generate, and customize output path and projection.

Tips for the Data Generation Pipeline
Create a Domain Dataset

Adapt the pipeline to a custom domain with a seed file and multiple category dimensions.

Create a Domain Dataset for Airlines Customer Service
Generate Tool-Call Data

Generate multi-turn conversations with OpenAI-style tool calls for tool-use SFT.

Generate Tool-Calling Data for SFT
Generate Preference Data

Generate DPO preference pairs (prompt / chosen / rejected) from rl_pref.yaml.

Generate Preference Data for DPO
Dispatch to a Cluster

Configure an env.toml profile and run SDG on Lepton or Slurm.

Dispatch SDG to a Cluster