Preview Data Generation#

Generate a small sample dataset to validate your configuration before creating a full-scale batch job.

Prerequisites#

Before you can preview data generation, make sure that you have:

  • Obtained the base URL of your NeMo Data Designer service

  • Prepared your data generation configuration including:

  • Set the NEMO_MICROSERVICES_BASE_URL environment variable to your NeMo Data Designer service endpoint

export NEMO_MICROSERVICES_BASE_URL="https://your-data-designer-service-url"

To Preview Data Generation#

Preview mode generates a small dataset (default: 10 records, configurable up to the service’s max limit) to help you validate your design before running a full batch job.

Choose one of the following options to preview data generation.

import os
from nemo_microservices import NeMoMicroservices

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)

# Create preview request
preview_response = client.data_designer.preview(
    config={
        "model_configs": [
            {
                "alias": "main_model",
                "model": "meta/llama-3.3-70b-instruct",
                "inference_parameters": {
                    "temperature": 0.5,
                    "top_p": 1.0,
                    "max_tokens": 1024
                },
            }
        ],
        "columns": [
            {
                "name": "topic",
                "sampler_type": "category",
                "params": {"values": ["Technology", "Science", "Health"]}
            },
            {
                "name": "article",
                "model_alias": "main_model",
                "prompt": "Write a short article about {{ topic }}",
            },
        ],
    },
    num_records=10  # Optional: defaults to service configuration
)

# Process streaming JSONL response
for message in preview_response:
    if message.message_type == "dataset":
        print(f"Generated dataset: {message.message}")
    elif message.message_type == "analysis":
        print(f"Dataset analysis: {message.message}")
    elif message.message_type == "log":
        print(f"[{message.extra.get('level', 'info')}] {message.message}")
curl -X POST \
  "${NEMO_MICROSERVICES_BASE_URL}/v1/data-designer/preview" \
  -H 'accept: application/jsonl' \
  -H 'Content-Type: application/json' \
  -d '{
    "config": {
        "model_configs": [
            {
                "alias": "main_model",
                "model": "meta/llama-3.3-70b-instruct",
                "inference_parameters": {
                    "temperature": 0.5,
                    "top_p": 1.0,
                    "max_tokens": 1024
                }
            }
        ],
        "columns": [
            {
                "name": "topic",
                "sampler_type": "category",
                "params": {
                    "values": ["Technology", "Science", "Health"]
                }
            },
            {
                "name": "article",
                "model_alias": "main_model",
                "prompt": "Write a short article about {{ topic }}"
            }
        ]
    },
    "num_records": 10
  }'
Example Streaming JSONL Response

The preview endpoint returns a streaming JSONL response with multiple message types:

{"message":"Starting preview job","message_type":"log","extra":{"level":"debug"}}
{"message":"🩺 Running health checks for models...","message_type":"log","extra":{"level":"info"}}
{"message":"⏳ Processing batch 1 of 1","message_type":"log","extra":{"level":"info"}}
{"message":"🎲 Preparing samplers to generate 10 records across 2 columns","message_type":"log","extra":{"level":"info"}}
{"message":"[{\"topic\": \"Technology\", \"article\": \"...\"}, ...]","message_type":"dataset","extra":null}
{"message":"📐 Measuring dataset column statistics:","message_type":"log","extra":{"level":"info"}}
{"message":"{\"num_records\": 10, \"target_num_records\": 10, \"column_statistics\": [...]}","message_type":"analysis","extra":null}
{"message":"Preview job ended","message_type":"log","extra":{"level":"debug"}}

Message Types:

  • log: Execution progress and debugging information

  • dataset: The generated dataset as JSON array

  • analysis: Statistical analysis of the generated data

  • heartbeat: Periodic status updates during long-running operations

Tip

Simplified Preview with DataDesignerClient Instead of manually processing streaming responses, use the NeMoDataDesignerClient wrapper for a more convenient approach:

import os
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
)

# Create a configuration builder with your model
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        ModelConfig(
            alias="main-model",
            model="meta/llama-3.3-70b-instruct",
            inference_parameters=InferenceParameters(
                temperature=0.90,
                top_p=0.99,
                max_tokens=2048,
            ),
        ),
    ]
)

# Add columns to define your data structure
config_builder.add_column(
    SamplerColumnConfig(
        name="topic",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Technology", "Science", "Health"]
        )
    )
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="article",
        prompt="Write a short article about {{ topic }}",
        model_alias="main-model"
    )
)

# Initialize wrapper
data_designer_client = NeMoDataDesignerClient(
    base_url=os.environ['NEMO_MICROSERVICES_BASE_URL']
)

# Generate preview - automatically handles streaming and returns results
preview_result = data_designer_client.preview(config_builder, num_records=10)

# Display a sample record
preview_result.display_sample_record()

# Access the preview dataset as pandas DataFrame
df = preview_result.dataset
print(f"Generated {len(df)} records")
print(df.head())

This wrapper automatically handles the streaming JSONL response and provides convenient methods for inspecting results.

Understanding Preview Results#

Preview results include three key components:

  1. Generated Dataset: The actual synthetic data records produced

  2. Column Statistics: Distribution analysis for each column

  3. Execution Logs: Progress tracking and model usage information

Use these results to:

  • Verify that column configurations produce expected data

  • Check data quality and distributions before scaling up

  • Iterate on your configuration without waiting for large batch jobs

Next Steps#

After verifying your configuration with preview: