Download this tutorial as a Jupyter notebook
The Basics#
This tutorial demonstrates the fundamentals of Data Designer by generating a product review dataset.
For more detail, see the open-source library’s version of this tutorial.
Prerequisites#
Ensure you have set up inference by creating a model provider for build.nvidia.com.
Part 1: Build the Configuration#
Use the data_designer.config package to define your dataset schema. This code is identical whether you’re using the standalone library or the NMP service.
Tip
Already using the standalone library? This configuration code is identical. You can copy your existing config_builder code directly - only the execution step (Part 2) differs.
Define Models#
Start by defining the models you want to use:
import data_designer.config as dd
MODEL_ALIAS = "text"
model_configs = [
dd.ModelConfig(
provider="system/nvidia-build",
model="nvidia/nemotron-3-nano-30b-a3b", # Use the `served_model_name` from the provider
alias=MODEL_ALIAS,
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=1.0,
top_p=1.0,
),
)
]
config_builder = dd.DataDesignerConfigBuilder(model_configs)
Add Columns#
Define the columns for your dataset. The library documentation explains these column types in detail.
# Product category sampler
config_builder.add_column(
dd.SamplerColumnConfig(
name="product_category",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(
values=[
"Electronics",
"Clothing",
"Home & Kitchen",
"Books",
"Home Office",
],
),
)
)
# Product subcategory sampler (conditional on category)
config_builder.add_column(
dd.SamplerColumnConfig(
name="product_subcategory",
sampler_type=dd.SamplerType.SUBCATEGORY,
params=dd.SubcategorySamplerParams(
category="product_category",
values={
"Electronics": [
"Smartphones",
"Laptops",
"Headphones",
"Cameras",
"Accessories",
],
"Clothing": [
"Men's Clothing",
"Women's Clothing",
"Winter Coats",
"Activewear",
"Accessories",
],
"Home & Kitchen": [
"Appliances",
"Cookware",
"Furniture",
"Decor",
"Organization",
],
"Books": [
"Fiction",
"Non-Fiction",
"Self-Help",
"Textbooks",
"Classics",
],
"Home Office": [
"Desks",
"Chairs",
"Storage",
"Office Supplies",
"Lighting",
],
},
),
)
)
# Target age range
config_builder.add_column(
dd.SamplerColumnConfig(
name="target_age_range",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(values=["18-25", "25-35", "35-50", "50-65", "65+"]),
)
)
# Customer details using Faker
config_builder.add_column(
dd.SamplerColumnConfig(
name="customer",
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
params=dd.PersonFromFakerSamplerParams(age_range=[18, 70], locale="en_US"),
)
)
# Star rating
config_builder.add_column(
dd.SamplerColumnConfig(
name="number_of_stars",
sampler_type=dd.SamplerType.UNIFORM,
params=dd.UniformSamplerParams(low=1, high=5),
convert_to="int", # Convert the sampled float to an integer
)
)
# Review style
config_builder.add_column(
dd.SamplerColumnConfig(
name="review_style",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(
values=["rambling", "brief", "detailed", "structured with bullet points"],
weights=[1, 2, 2, 1],
),
)
)
# LLM-generated product name
config_builder.add_column(
dd.LLMTextColumnConfig(
name="product_name",
prompt=(
"You are a helpful assistant that generates product names. DO NOT add quotes around the product name.\n\n"
"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
"{{ target_age_range }} years old. Respond with only the product name, no other text."
),
model_alias=MODEL_ALIAS,
)
)
# LLM-generated customer review
config_builder.add_column(
dd.LLMTextColumnConfig(
name="customer_review",
prompt=(
"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
"The style of the review should be '{{ review_style }}'. "
"Respond with only the review, no other text."
),
model_alias=MODEL_ALIAS,
)
)
Part 2: Execute on NMP#
Now submit your configuration to the Data Designer service for execution.
Creating a Client#
The DataDesignerResource is your interface to the Data Designer service. You can access it from an existing SDK instance:
import os
from nemo_platform import NeMoPlatform
base_url = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
sdk = NeMoPlatform(base_url=base_url, workspace="default")
data_designer = sdk.data_designer
Previewing the Dataset#
Use the preview method for rapid iteration. Generate a small sample, inspect the results, adjust your configuration, and repeat:
preview = data_designer.preview(config_builder)
# Display a random sample record
preview.display_sample_record()
# Access the full preview dataset as a pandas DataFrame
df = preview.dataset
print(df.head())
# View statistical analysis
preview.analysis.to_report()
More about preview results
The PreviewResults object returned by sdk.data_designer.preview stores all its fields in memory; nothing is persisted to disk by default.
Use standard Python methods to save any preview data you want to keep around longer term.
For example, the dataset is a regular Pandas DataFrame and can be saved to disk via methods like to_csv or to_parquet.
Iterate: Adjust column configurations, prompts, or parameters in your config_builder, then run preview again until you’re satisfied with the results.
Scaling Up with Jobs#
When you’re happy with the preview, submit a larger generation job:
# Defaulting to 30 for demo speed purposes. Happy with the output? Scale it up!
job = data_designer.create(config_builder, num_records=30)
# Block until the job completes
job.wait_until_done()
# Download the generated artifacts
results = job.download_artifacts()
# Load the dataset as a pandas DataFrame
dataset = results.load_dataset()
print(dataset.head())
# Load the full analysis report
analysis = results.load_analysis()
analysis.to_report()
More about job results
The Data Designer library writes several artifacts to disk when running a full generation job, including the final dataset as parquet.
When a Data Designer job runs on NMP, the entire working directory of artifacts produced by the library is saved as a job result.
The download_artifacts method downloads this artifacts directory (stored in NMP as a .tar.gz archive),
unarchives it, and returns a DataDesignerJobResults object that can be used to load results into memory as DataFrames or other objects for programmatic inspection.
By default, download_artifacts saves the artifacts to a relative local directory named after the job.
An alternative path can be passed to download_artifacts.
What Happens Under the Hood#
When you submit a job to the Data Designer service:
Configuration Validation: The service validates your configuration and resolves column dependencies
Job Creation: A job is created and queued for execution
Distributed Execution: The service orchestrates generation across multiple workers
Inference Routing: All LLM calls are routed through the Inference Gateway to your configured model providers
Artifact Storage: Generated datasets and analysis reports are stored in NMP artifact storage
Job Completion: You can monitor job status and load results when complete
Next Steps#
Seed data: Learn how to use external datasets in the seeding tutorial
Column types: Explore all available column types in the library documentation
Advanced features: Learn about processors and validation