Generating Data#

Data Generation in Data Designer#

Bringing Your Data Design to Life#

Once you’ve set up your Data Designer with appropriate seeds and column definitions, you’re ready for the exciting part: generating data! This guide explains how to preview your design, create full datasets, and access your generated data.

The Data Generation Process#

Data Designer follows this straightforward workflow when generating data:

Design Phase: Define your data schema by adding columns and establishing their relationships
Preview Phase: Generate a small sample for validation
Iteration Phase: Refine your design based on preview results
Batch Generation: Scale up to create large datasets

Note

You must create at least one non-LLM generated column before you can create an LLM generated column. This is to ensure best practices of Synthetic Data Generation where you must provide seeds to your LLM generation to ensure diversity and high quality data.

Understanding the Generation Workflow

1. Design Phase

During this first phase, you define what data you want to generate by adding columns, setting up relationships, and establishing constraints.

Key activities:

Adding columns of various types (sampling-based, LLM-based)
Setting up person samplers
Defining constraints between columns
Creating templates that reference other columns

Data Designer automatically analyzes your column definitions to determine the correct generation order based on how columns reference each other.

Tip

Streamlined Workflow with DataDesignerClient For a simplified approach to the data generation workflow, consider using the DataDesignerClient wrapper:

import os
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import DataDesignerClient, DataDesignerConfigBuilder
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P


# Create a configuration builder with your model
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias="main-model",
            model=P.Model(
                    api_endpoint=P.ApiEndpoint(
                        model_id="meta/llama-3.3-70b-instruct",
                        url="https://integrate.api.nvidia.com/v1",
                        api_key="your-api-key"
                    )
                ),
            inference_parameters=P.InferenceParameters(
                temperature=0.90,
                top_p=0.99,
                max_tokens=2048,
            ),
        ),
    ]
)

# Add some columns to your configuration
config_builder.add_column(
    C.SamplerColumn(
        name="topic",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Technology", "Science", "Health"]
        )
    )
)

config_builder.add_column(
    C.LLMTextColumn(
        name="article",
        prompt="Write a short article about {{ topic }}",
        model_alias="main-model"
    )
)

# Initialize wrapper
data_designer_client = DataDesignerClient(
    client=NeMoMicroservices(base_url=os.environ['NEMO_MICROSERVICES_BASE_URL'])
)

# Preview with enhanced logging
preview_results = data_designer_client.preview(config_builder, verbose_logging=True)

# Generate and wait for completion in one call
job_results = data_designer_client.create(config_builder, num_records=1000, wait_until_done=True)
df = job_results.load_dataset()

This approach combines all phases into a more streamlined workflow while maintaining the same underlying functionality.

2. Preview Phase

The preview phase generates a small dataset (around 10 records) to help you validate your design:

# Validate that you have the right config
config_builder.validate()

# Generate a preview
preview_result = data_designer_client.preview(config_builder, verbose_logging=True)

This quick process lets you see your design in action without waiting for a full dataset generation.

Inspecting Preview Results

Data Designer provides several ways to examine your preview results:

# Method 1: Display a sample record with formatted output
preview_result.display_sample_record()

# Method 2: Access the preview dataset as a pandas DataFrame
preview_df = preview_result.dataset

These inspection methods help you assess whether your design is producing the expected data. You’ll often go through multiple design-preview-iterate cycles before you’re ready to generate a full dataset.

3. Iteration Phase

Based on preview results, you can refine your design by modifying columns, adjusting parameters, or changing templates:

# Modify a column definition. Adding a new column with a previously used name
# will replace the old config with the new one.
config_builder.add_column(
    C.LLMTextColumn(
        name="article",
        prompt="Write a more detailed article about topic {{ topic }} in about 10 sentences",
        model_alias="main-model"
    )
)

# Preview again
preview_result = data_designer_client.preview(config_builder)

This iterative cycle helps you optimize your design before generating a full dataset.

4. Batch Generation

Once your design meets your requirements, you can scale up to create a full dataset:

# Generate the full dataset
job_result = data_designer_client.create(
    config_builder,
    num_records=1000
)

Parameters for Batch Generation

num_records: The number of records to generate
wait_until_done:
- True: The function will block until the job completes
- False: The function will return immediately, and you can check status later

Checking Job Status

If you didn’t wait for completion, you can check the status later:

# Check if the job is completed
print(f"Job status: {job_result.get_job_status()}")

# Fetch the ID of the job
print(f"Job ID: {job_result.get_job().id}")

If you want to wait for completion again you can do so:

job_result.wait_until_done()

After successful generation, you can access your data as follows:

# Access the generated dataset as a pandas DataFrame
generated_df = job_result.load_dataset()

If you didn’t wait for completion or need to reconnect to a previous job:

import os
from nemo_microservices import NeMoMicroservices

# Initialize Data Designer client
data_designer_client = DataDesignerClient(
    client=NeMoMicroservices(base_url=os.environ['NEMO_MICROSERVICES_BASE_URL'])
)

# Get the job results by job ID
job_result = data_designer_client.load_job_results("your-job-id")

# Wait for completion if it's still running
job_result.wait_until_done()

# Access the dataset
generated_df = job_result.load_dataset()

Saving your Data Designer Object#

You can save your Data Designer object as a configuration by running the following code:

config = config_builder.build().to_dict()

You can create a new Data Designer object from an existing config as follows:

new_config_builder = DataDesignerConfigBuilder.from_config(config)