Seeding SDG with External Data#

The most effective way to generate synthetic data that matches your specific domain is to seed the synthetic data generation (SDG) process with your existing (real-world) datasets. By providing real data as a foundation, you can steer the generation process to ensure the synthetic data maintains the patterns, distributions, and characteristics of your actual data.

How Seeding Works in Data Designer#

When you seed Data Designer with an external dataset, each record in your seed dataset becomes available as context during the generation process. You can reference any column from the seed data in your prompts using Jinja templating, allowing the LLM to understand and replicate the patterns in your real data.

Note

As part of deployment, the Data Designer microservice is configured with a source for seed datasets. This source must be a service compatible with the Hugging Face Hub API—for example it can be public Hugging Face itself, or NeMo Data Store. Currently only one source is supported.

Deployment configuration

Below we outline the steps to seed your generation process with an external dataset.

Step 1: Upload Your Seed Dataset to the Datastore#

Your seed dataset must exist in the configured seed dataset source in order for it to be used in Data Designer workloads. See the Manage Entities section for information about setting up a datastore.

Use the upload_seed_dataset() method of the NeMoDataDesignerClient to upload a pandas DataFrame or a CSV, Parquet, or JSON file to the datastore:

import os
from nemo_microservices.data_designer.essentials import (
    DataDesignerConfigBuilder,
    NeMoDataDesignerClient,
)

### `NEMO_MICROSERVICES_BASE_URL` is `http://localhost:8080` if you follow the quickstart deployment 
### guide and deploy Data Designer on your 
### localhost ## using docker compose
data_designer_client = NeMoDataDesignerClient(
    base_url=os.environ["NEMO_MICROSERVICES_BASE_URL"]
)

# Upload your dataset to the datastore
# You can pass a pandas DataFrame, file path str, or Path object
seed_dataset_reference = data_designer_client.upload_seed_dataset(
    dataset="path/to/your/seed_dataset.parquet",
    repo_id="your-org/your-repo",
    datastore_settings={"endpoint": os.environ["NEMO_MICROSERVICES_DATASTORE_ENDPOINT"]},
)

The upload_seed_dataset() method returns a SeedDatasetReference object that you will use in the next step to configure your seed dataset.

Step 2: Configure the Seed Dataset#

Use the with_seed_dataset() method of the DataDesignerConfigBuilder to configure your seed dataset:

config_builder = DataDesignerConfigBuilder(model_configs="path/to/your/model_configs.yaml")

# Use the seed dataset reference from the upload step
config_builder.with_seed_dataset(
    dataset_reference=seed_dataset_reference,
    sampling_strategy="shuffle",
)

There are two arguments that control how your dataset is sampled during the generation process:

Sampling Strategy#

Options for the sampling_strategy argument:

"ordered" (default): Maintains the original order of records
- Useful when your seed data has meaningful sequences
- Records are used in the same order as they appear in the dataset
"shuffle": Randomly shuffles before sampling
- Useful when there are no correlations between records
- Prevents predictable patterns in generated data

Step 3: Reference Seed Data in Your Prompt Templates#

Once configured, you can reference any column from your seed dataset in your LLM prompts:

config_builder.add_column(
    LLMTextColumnConfig(
        name="generated_content",
        prompt=(
            "Based on the following data: "
            "column_name: {{ column_in_your_seed_dataset }} "
            "Generate a new, more awesome record."
        ),
        model_alias="your-model-alias",
    )
)

For end-to-end examples, we recommend following along with the Tutorial Notebooks.