Preview Data Generation#
Generate a small sample dataset to validate your configuration before creating a full-scale batch job.
Prerequisites#
Before you can preview data generation, make sure that you have:
Obtained the base URL of your NeMo Data Designer service
Prepared your data generation configuration including:
Model configurations - Configure model aliases and inference parameters
Column schemas - Define your data column types and parameters
Set the
NEMO_MICROSERVICES_BASE_URLenvironment variable to your NeMo Data Designer service endpoint
export NEMO_MICROSERVICES_BASE_URL="https://your-data-designer-service-url"
To Preview Data Generation#
Preview mode generates a small dataset (default: 10 records, configurable up to the service’s max limit) to help you validate your design before running a full batch job.
import os
from nemo_microservices.data_designer.essentials import (
CategorySamplerParams,
DataDesignerConfigBuilder,
InferenceParameters,
LLMTextColumnConfig,
ModelConfig,
NeMoDataDesignerClient,
SamplerColumnConfig,
SamplerType,
)
# Create a configuration builder with your model
config_builder = DataDesignerConfigBuilder(
model_configs=[
ModelConfig(
alias="main-model",
model="meta/llama-3.3-70b-instruct",
inference_parameters=InferenceParameters(
temperature=0.90,
top_p=0.99,
max_tokens=2048,
),
),
]
)
# Add columns to define your data structure
config_builder.add_column(
SamplerColumnConfig(
name="topic",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(
values=["Technology", "Science", "Health"]
)
)
)
config_builder.add_column(
LLMTextColumnConfig(
name="article",
prompt="Write a short article about {{ topic }}",
model_alias="main-model"
)
)
# Initialize client
data_designer_client = NeMoDataDesignerClient(
base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)
# Generate preview - automatically handles streaming and returns results
preview_result = data_designer_client.preview(config_builder, num_records=10)
# Display a sample record
preview_result.display_sample_record()
# Access the preview dataset as pandas DataFrame
df = preview_result.dataset
print(f"Generated {len(df)} records")
print(df.head())
Understanding Preview Results#
Preview results include three key components:
Generated Dataset: The actual synthetic data records produced
Column Statistics: Distribution analysis for each column
Execution Logs: Progress tracking and model usage information
Use these results to:
Verify that column configurations produce expected data
Check data quality and distributions before scaling up
Iterate on your configuration without waiting for large batch jobs
Next Steps#
After verifying your configuration with preview:
Create a full data generation job to produce larger datasets
Configure model parameters to adjust generation quality
Add constraints to check generated data