Preview Data Generation#
Generate a small sample dataset to validate your configuration before creating a full-scale batch job.
Prerequisites#
Before you can preview data generation, make sure that you have:
Obtained the base URL of your NeMo Data Designer service
Prepared your data generation configuration including:
Model configurations - Configure model aliases and inference parameters
Column schemas - Define your data column types and parameters
Set the
NEMO_MICROSERVICES_BASE_URL
environment variable to your NeMo Data Designer service endpoint
export NEMO_MICROSERVICES_BASE_URL="https://your-data-designer-service-url"
To Preview Data Generation#
Preview mode generates a small dataset (default: 10 records, configurable up to the service’s max limit) to help you validate your design before running a full batch job.
Choose one of the following options to preview data generation.
import os
from nemo_microservices import NeMoMicroservices
# Initialize the client
client = NeMoMicroservices(
base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)
# Create preview request
preview_response = client.data_designer.preview(
config={
"model_configs": [
{
"alias": "main_model",
"model": "meta/llama-3.3-70b-instruct",
"inference_parameters": {
"temperature": 0.5,
"top_p": 1.0,
"max_tokens": 1024
},
}
],
"columns": [
{
"name": "topic",
"sampler_type": "category",
"params": {"values": ["Technology", "Science", "Health"]}
},
{
"name": "article",
"model_alias": "main_model",
"prompt": "Write a short article about {{ topic }}",
},
],
},
num_records=10 # Optional: defaults to service configuration
)
# Process streaming JSONL response
for message in preview_response:
if message.message_type == "dataset":
print(f"Generated dataset: {message.message}")
elif message.message_type == "analysis":
print(f"Dataset analysis: {message.message}")
elif message.message_type == "log":
print(f"[{message.extra.get('level', 'info')}] {message.message}")
curl -X POST \
"${NEMO_MICROSERVICES_BASE_URL}/v1/data-designer/preview" \
-H 'accept: application/jsonl' \
-H 'Content-Type: application/json' \
-d '{
"config": {
"model_configs": [
{
"alias": "main_model",
"model": "meta/llama-3.3-70b-instruct",
"inference_parameters": {
"temperature": 0.5,
"top_p": 1.0,
"max_tokens": 1024
}
}
],
"columns": [
{
"name": "topic",
"sampler_type": "category",
"params": {
"values": ["Technology", "Science", "Health"]
}
},
{
"name": "article",
"model_alias": "main_model",
"prompt": "Write a short article about {{ topic }}"
}
]
},
"num_records": 10
}'
Example Streaming JSONL Response
The preview endpoint returns a streaming JSONL response with multiple message types:
{"message":"Starting preview job","message_type":"log","extra":{"level":"debug"}}
{"message":"🩺 Running health checks for models...","message_type":"log","extra":{"level":"info"}}
{"message":"⏳ Processing batch 1 of 1","message_type":"log","extra":{"level":"info"}}
{"message":"🎲 Preparing samplers to generate 10 records across 2 columns","message_type":"log","extra":{"level":"info"}}
{"message":"[{\"topic\": \"Technology\", \"article\": \"...\"}, ...]","message_type":"dataset","extra":null}
{"message":"📐 Measuring dataset column statistics:","message_type":"log","extra":{"level":"info"}}
{"message":"{\"num_records\": 10, \"target_num_records\": 10, \"column_statistics\": [...]}","message_type":"analysis","extra":null}
{"message":"Preview job ended","message_type":"log","extra":{"level":"debug"}}
Message Types:
log: Execution progress and debugging information
dataset: The generated dataset as JSON array
analysis: Statistical analysis of the generated data
heartbeat: Periodic status updates during long-running operations
Tip
Simplified Preview with DataDesignerClient
Instead of manually processing streaming responses, use the NeMoDataDesignerClient
wrapper for a more convenient approach:
import os
from nemo_microservices.data_designer.essentials import (
CategorySamplerParams,
DataDesignerConfigBuilder,
InferenceParameters,
LLMTextColumnConfig,
ModelConfig,
NeMoDataDesignerClient,
SamplerColumnConfig,
SamplerType,
)
# Create a configuration builder with your model
config_builder = DataDesignerConfigBuilder(
model_configs=[
ModelConfig(
alias="main-model",
model="meta/llama-3.3-70b-instruct",
inference_parameters=InferenceParameters(
temperature=0.90,
top_p=0.99,
max_tokens=2048,
),
),
]
)
# Add columns to define your data structure
config_builder.add_column(
SamplerColumnConfig(
name="topic",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(
values=["Technology", "Science", "Health"]
)
)
)
config_builder.add_column(
LLMTextColumnConfig(
name="article",
prompt="Write a short article about {{ topic }}",
model_alias="main-model"
)
)
# Initialize wrapper
data_designer_client = NeMoDataDesignerClient(
base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)
# Generate preview - automatically handles streaming and returns results
preview_result = data_designer_client.preview(config_builder, num_records=10)
# Display a sample record
preview_result.display_sample_record()
# Access the preview dataset as pandas DataFrame
df = preview_result.dataset
print(f"Generated {len(df)} records")
print(df.head())
This wrapper automatically handles the streaming JSONL response and provides convenient methods for inspecting results.
Understanding Preview Results#
Preview results include three key components:
Generated Dataset: The actual synthetic data records produced
Column Statistics: Distribution analysis for each column
Execution Logs: Progress tracking and model usage information
Use these results to:
Verify that column configurations produce expected data
Check data quality and distributions before scaling up
Iterate on your configuration without waiting for large batch jobs
Next Steps#
After verifying your configuration with preview:
Create a full data generation job to produce larger datasets
Configure model parameters to adjust generation quality
Add constraints to check generated data