Seeding SDG with External Data#
The most effective way to generate synthetic data that matches your specific domain is to seed the synthetic data generation (SDG) process with your existing (real-world) datasets. By providing real data as a foundation, you can steer the generation process to ensure the synthetic data maintains the patterns, distributions, and characteristics of your actual data.
How Seeding Works in Data Designer#
When you seed Data Designer with an external dataset, each record in your seed dataset becomes available as context during the generation process. You can reference any column from the seed data in your prompts using Jinja templating, allowing the LLM to understand and replicate the patterns in your real data.
Below we outline the steps to seed your generation process with an external dataset.
Step 1: Prepare Your Seed Dataset for Upload to the Datastore#
You will need to have your seed dataset stored in a CSV, JSONL, or Parquet file. The file will be uploaded to the datastore, which can be either the Hugging Face Hub or the NeMo Microservices Datastore, and made available for the generation process. See the Manage Entities section for information about setting up a datastore.
Step 2: Configure the Seed Dataset#
Use the with_seed_dataset()
method of the DataDesignerConfigBuilder
to configure your seed dataset:
import os
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import DataDesignerClient, DataDesignerConfigBuilder
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P
data_designer_client = DataDesignerClient(
client=NeMoMicroservices(base_url=os.environ["NEMO_MICROSERVICES_BASE_URL"])
)
config_builder = DataDesignerConfigBuilder(model_configs="path/to/your/model_configs.yaml")
# By passing in the `dataset_path` argument, we are telling Data Designer to upload
# the dataset to the datastore. If you already uploaded the dataset to the datastore,
# you do not need to pass in a path.
config_builder.with_seed_dataset(
repo_id="your-org/your-repo",
filename="my_seed_dataset.parquet",
dataset_path="path/to/your/seed_dataset.parquet",
sampling_strategy="shuffle",
with_replacement=False,
# If uploading to the NMP Datastore, you must provide the datastore's endpoint,
# which must match the endpoint in the docker-compose if in quickstart mode.
datastore={"endpoint": os.environ["NEMO_MICROSERVICES_DATASTORE_ENDPOINT"]},
)
Under the hood, the datastore is based on the
Hugging Face Hub API, with the repo_id
and filename
arguments following the same format.
There are two arguments that control how your dataset is sampled during the generation process:
Sampling Strategy#
Options for the sampling_strategy
argument:
"ordered"
(default): Maintains the original order of recordsUseful when your seed data has meaningful sequences
Records are used in the same order as they appear in the dataset
"shuffle"
: Randomly shuffles before samplingUseful when there are no correlations between records
Prevents predictable patterns in generated data
Replacement Strategy#
Options for the with_replacement
argument:
with_replacement: false
(default): Each seed record is used only onceEnsures all seed records are used exactly once
Maximum number of generated records equals the size of your seed dataset
with_replacement: true
: Records can be sampled multiple timesUseful for large-scale generation from small seed datasets
Allows the same seed record to influence multiple generated records
Step 3: Reference Seed Data in Your Prompt Templates#
Once configured, you can reference any column from your seed dataset in your LLM prompts:
config_builder.add_column(
C.LLMTextColumn(
name="generated_content",
prompt=(
"Based on the following data: "
"column_name: {{ column_in_your_seed_dataset }} "
"Generate a new, more awesome record."
),
model_alias="your-model-alias",
)
)
For end-to-end examples, we recommend following along with the Tutorial Notebooks.