Seed datasets let you bootstrap synthetic data generation from existing data. Instead of generating everything from scratch, you provide a dataset whose columns become available as context in your prompts and expressions—grounding your synthetic data in real-world examples.
When to Use Seed Datasets Seed datasets shine when you have real data you want to build on:
The seed data provides realism and domain specificity; Data Designer adds volume and variation.
Every column in your seed dataset becomes available as a Jinja2 variable in prompts and expressions. Data Designer automatically:
Data Designer supports multiple ways to provide seed data, including:
Load from a local file—CSV, Parquet, or JSON.
Supported Formats
.csv).parquet).json, .jsonl)Load directly from HuggingFace datasets without downloading manually.
Use an in-memory pandas DataFrame—great for preprocessing or combining multiple sources.
Serialization
DataFrameSeedSource can’t be serialized to YAML/JSON configs. Use LocalFileSeedSource if you need to save and share configurations.
Treat a directory tree as the seed dataset. Each matching file becomes one seed row, exposing file metadata you can reference in prompts and expressions.
Directory-backed seed datasets expose these columns:
source_kind — always "directory_file"source_path — full path to the matched filerelative_path — path relative to the configured directoryfile_name — basename of the matched fileFilesystem matching
file_pattern matches file names only, not relative paths. recursive=True is the default, so nested subdirectories are searched unless you turn it off.
Read matching text files into the seed dataset. Each file becomes one seed row with the same metadata as DirectorySeedSource, plus the decoded file contents in a content column.
FileContentsSeedSource exposes these seeded columns:
source_kind — always "file_contents"source_path — full path to the matched filerelative_path — path relative to the configured directoryfile_name — basename of the matched filecontent — decoded text contents of the matched fileCustom Filesystem Readers
If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom FileSystemSeedReader and pass it via DataDesigner(seed_readers=[...]). See the FileSystemSeedReader Plugins guide.
Encoding
encoding="utf-8" is the default. Set a different Python codec name if your files use another text encoding.
Parse agent rollout trace files (e.g. from ATIF, Claude Code, Codex, or Hermes Agent) into a structured seed dataset. Each trace becomes one seed row with normalized metadata and the full message history, ready for distillation or analysis pipelines.
Dedicated guide See Agent Rollout Ingestion for the rollout-specific guide, including:
path and file_patternAgentRolloutSeedSourceTrace Distillation See the Agent Rollout Trace Distillation recipe for a complete example that turns agent traces into supervised fine-tuning data.
Control how rows are read from the seed dataset.
Rows are read sequentially in their original order. Each generated record corresponds to the next row in the seed dataset. If you generate more records than exist in the seed dataset, it will cycle in order until completion.
Rows are randomly shuffled before sampling. Useful when your seed data has some ordering you want to break.
Select a subset of your seed dataset—useful for large datasets or parallel processing.
Select a specific range of row indices.
Split the dataset into N equal partitions and select one. Perfect for distributing work across multiple jobs.
Parallel Processing Run 5 parallel jobs, each with a different partition index, to process a large seed dataset in parallel:
Sampling and selection strategies work together. For example, shuffle rows within a specific partition:
Here’s a complete example generating physician notes from a symptom-to-diagnosis seed dataset:
Garbage in, garbage out. Clean your seed data before using it:
If your seed dataset has 1,000 rows and you generate 10,000 records, each seed row will be used ~10 times. Consider whether that’s appropriate for your use case.
Seed datasets are excellent for controlling the distribution of your synthetic data. Want 30% electronics, 50% clothing, 20% home goods? Curate your seed dataset to match.