Preview Data Generation#

Generate a small sample dataset to validate your configuration before creating a full-scale batch job.

Prerequisites#

Before you can preview data generation, make sure that you have:

  • Obtained the base URL of your NeMo Data Designer service

  • Prepared your data generation configuration including:

  • Set the NEMO_MICROSERVICES_BASE_URL environment variable to your NeMo Data Designer service endpoint

export NEMO_MICROSERVICES_BASE_URL="https://your-data-designer-service-url"

To Preview Data Generation#

Preview mode generates a small dataset (default: 10 records, configurable up to the service’s max limit) to help you validate your design before running a full batch job.

import os
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
)

# Create a configuration builder with your model
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        ModelConfig(
            alias="main-model",
            model="meta/llama-3.3-70b-instruct",
            inference_parameters=InferenceParameters(
                temperature=0.90,
                top_p=0.99,
                max_tokens=2048,
            ),
        ),
    ]
)

# Add columns to define your data structure
config_builder.add_column(
    SamplerColumnConfig(
        name="topic",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Technology", "Science", "Health"]
        )
    )
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="article",
        prompt="Write a short article about {{ topic }}",
        model_alias="main-model"
    )
)

# Initialize client
data_designer_client = NeMoDataDesignerClient(
    base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)

# Generate preview - automatically handles streaming and returns results
preview_result = data_designer_client.preview(config_builder, num_records=10)

# Display a sample record
preview_result.display_sample_record()

# Access the preview dataset as pandas DataFrame
df = preview_result.dataset
print(f"Generated {len(df)} records")
print(df.head())

Understanding Preview Results#

Preview results include three key components:

  1. Generated Dataset: The actual synthetic data records produced

  2. Column Statistics: Distribution analysis for each column

  3. Execution Logs: Progress tracking and model usage information

Use these results to:

  • Verify that column configurations produce expected data

  • Check data quality and distributions before scaling up

  • Iterate on your configuration without waiting for large batch jobs

Next Steps#

After verifying your configuration with preview: