The Basics
🎨 Data Designer Tutorial: The Basics
📚 What you'll learn
This notebook demonstrates the basics of Data Designer by generating a simple product review dataset.
📦 Import Data Designer
-
data_designer.configprovides access to the configuration API. -
DataDesigneris the main interface for data generation.
⚙️ Initialize the Data Designer interface
-
DataDesigneris the main object responsible for managing the data generation process. -
When initialized without arguments, the default model providers are used.
🎛️ Define model configurations
-
Each
ModelConfigdefines a model that can be used during the generation process. -
The "model alias" is used to reference the model in the Data Designer config (as we will see below).
-
The "model provider" is the external service that hosts the model (see the model config docs for more details).
-
By default, we use build.nvidia.com as the model provider.
🏗️ Initialize the Data Designer Config Builder
-
The Data Designer config defines the dataset schema and generation process.
-
The config builder provides an intuitive interface for building this configuration.
-
The list of model configs is provided to the builder at initialization.
🎲 Getting started with sampler columns
-
Sampler columns offer non-LLM based generation of synthetic data.
-
They are particularly useful for steering the diversity of the generated data, as we demonstrate below.
You can view available samplers using the config builder's info property:
─────────────────────────────────────────── NeMo Data Designer Samplers ─────────────────────────────────────────── ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ Type ┃ Parameter ┃ Data Type ┃ Required ┃ Constraints ┃ ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ bernoulli │ p │ number │ ✓ │ >= 0.0, <= 1.0 │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ bernoulli_mixture │ p │ number │ ✓ │ >= 0.0, <= 1.0 │ │ │ dist_name │ string │ ✓ │ │ │ │ dist_params │ dict │ ✓ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ binomial │ n │ integer │ ✓ │ │ │ │ p │ number │ ✓ │ >= 0.0, <= 1.0 │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ category │ values │ string[] | integer[] | number[] │ ✓ │ len > 1 │ │ │ weights │ number[] | null │ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ datetime │ start │ string │ ✓ │ │ │ │ end │ string │ ✓ │ │ │ │ unit │ string │ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ gaussian │ mean │ number │ ✓ │ │ │ │ stddev │ number │ ✓ │ │ │ │ decimal_places │ integer | null │ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ person │ locale │ string │ │ │ │ │ sex │ string | null │ │ │ │ │ city │ string | string[] | null │ │ │ │ │ age_range │ integer[] │ │ len > 2, len < 2 │ │ │ select_field_values │ object | null │ │ │ │ │ with_synthetic_personas │ boolean │ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ person_from_faker │ locale │ string │ │ │ │ │ sex │ string | null │ │ │ │ │ city │ string | string[] | null │ │ │ │ │ age_range │ integer[] │ │ len > 2, len < 2 │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ poisson │ mean │ number │ ✓ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ scipy │ dist_name │ string │ ✓ │ │ │ │ dist_params │ dict │ ✓ │ │ │ │ decimal_places │ integer | null │ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ subcategory │ category │ string │ ✓ │ │ │ │ values │ dict │ ✓ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ timedelta │ dt_min │ integer │ ✓ │ >= 0 │ │ │ dt_max │ integer │ ✓ │ > 0 │ │ │ reference_column_name │ string │ ✓ │ │ │ │ unit │ string │ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ uniform │ low │ number │ ✓ │ │ │ │ high │ number │ ✓ │ │ │ │ decimal_places │ integer | null │ │ │ │ │ sampler_type │ string │ │ │ ├────────────────────┼──────────────────────────┼───────────────────────────────────┼──────────┼──────────────────┤ │ uuid │ prefix │ string | null │ │ │ │ │ short_form │ boolean │ │ │ │ │ uppercase │ boolean │ │ │ │ │ sampler_type │ string │ │ │ └────────────────────┴──────────────────────────┴───────────────────────────────────┴──────────┴──────────────────┘
Let's start designing our product review dataset by adding product category and subcategory columns.
[21:15:31] [INFO] ✅ Validation passed
Next, let's add samplers to generate data related to the customer and their review.
[21:15:31] [INFO] ✅ Validation passed
🦜 LLM-generated columns
-
The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.
-
When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.
-
As we see below, nested json fields can be accessed using dot notation.
[21:15:32] [INFO] ✅ Validation passed
🔁 Iteration is key – preview the dataset!
-
Use the
previewmethod to generate a sample of records quickly. -
Inspect the results for quality and format issues.
-
Adjust column configurations, prompts, or parameters as needed.
-
Re-run the preview until satisfied.
[21:15:32] [INFO] 🧐 Preview generation in progress
[21:15:32] [INFO] |-- 🔒 Jinja rendering engine: secure
[21:15:32] [INFO] ✅ Validation passed
[21:15:32] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[21:15:32] [INFO] 🩺 Running health checks for models...
[21:15:32] [INFO] |-- 👀 Checking 'nvidia/nemotron-3-nano-30b-a3b' in provider named 'nvidia' for model alias 'nemotron-nano-v3'...
[21:15:33] [INFO] |-- ✅ Passed!
[21:15:33] [INFO] ⚡ DATA_DESIGNER_ASYNC_ENGINE is enabled - using async task-queue preview
[21:15:33] [INFO] 📝 llm-text model config for column 'product_name'
[21:15:33] [INFO] |-- model: 'nvidia/nemotron-3-nano-30b-a3b'
[21:15:33] [INFO] |-- model alias: 'nemotron-nano-v3'
[21:15:33] [INFO] |-- model provider: 'nvidia'
[21:15:33] [INFO] |-- inference parameters:
[21:15:33] [INFO] | |-- generation_type=chat-completion
[21:15:33] [INFO] | |-- max_parallel_requests=4
[21:15:33] [INFO] | |-- extra_body={'chat_template_kwargs': {'enable_thinking': False}}[21:15:33] [INFO] | |-- temperature=1.00
[21:15:33] [INFO] | |-- top_p=1.00
[21:15:33] [INFO] | |-- max_tokens=2048
[21:15:33] [INFO] 📝 llm-text model config for column 'customer_review'
[21:15:33] [INFO] |-- model: 'nvidia/nemotron-3-nano-30b-a3b'
[21:15:33] [INFO] |-- model alias: 'nemotron-nano-v3'
[21:15:33] [INFO] |-- model provider: 'nvidia'
[21:15:33] [INFO] |-- inference parameters:
[21:15:33] [INFO] | |-- generation_type=chat-completion
[21:15:33] [INFO] | |-- max_parallel_requests=4
[21:15:33] [INFO] | |-- extra_body={'chat_template_kwargs': {'enable_thinking': False}}[21:15:33] [INFO] | |-- temperature=1.00
[21:15:33] [INFO] | |-- top_p=1.00
[21:15:33] [INFO] | |-- max_tokens=2048
[21:15:33] [INFO] ⚡️ Async generation: 2 column(s) (product_name, customer_review), 4 tasks across 1 row group(s)
[21:15:33] [INFO] 🚀 (1/1) Dispatching with 2 records
[21:15:33] [INFO] 🎲 (1/1) Preparing samplers to generate 2 records across 6 columns
[21:15:37] [INFO] 📊 Progress [3.9s]:
[21:15:37] [INFO] |-- 🌕 product_name: 2/2 (100%) 0.5 rec/s
[21:15:37] [INFO] |-- 🦁 customer_review: 2/2 (100%) 0.5 rec/s
[21:15:37] [INFO] ✅ Async generation complete [3.9s]: 4 ok, 0 failed across 2 column(s)
[21:15:37] [INFO] 📊 Model usage summary:
[21:15:37] [INFO] |-- model: nvidia/nemotron-3-nano-30b-a3b
[21:15:37] [INFO] |-- tokens: input=360, output=448, total=808, tps=204
[21:15:37] [INFO] |-- requests: success=4, failed=0, total=4, rpm=60
[21:15:37] [INFO] 📐 Measuring dataset column statistics:
[21:15:37] [INFO] |-- 🎲 column: 'product_category'
[21:15:37] [INFO] |-- 🎲 column: 'product_subcategory'
[21:15:37] [INFO] |-- 🎲 column: 'target_age_range'
[21:15:37] [INFO] |-- 🎲 column: 'customer'
[21:15:37] [INFO] |-- 🎲 column: 'number_of_stars'
[21:15:37] [INFO] |-- 🎲 column: 'review_style'
[21:15:37] [INFO] |-- 📝 column: 'product_name'
[21:15:37] [INFO] |-- 📝 column: 'customer_review'
[21:15:37] [INFO] 🙌 Preview complete!
Generated Columns ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Name ┃ Value ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ product_category │ Home Office │ ├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ │ product_subcategory │ Chairs │ ├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ │ target_age_range │ 18-25 │ ├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ │ customer │ { │ │ │ 'uuid': '22a58e4d-938c-4c7a-8905-00b6b659c589', │ │ │ 'locale': 'en_US', │ │ │ 'first_name': 'Brittany', │ │ │ 'last_name': 'Tran', │ │ │ 'middle_name': None, │ │ │ 'sex': 'Female', │ │ │ 'street_number': '43048', │ │ │ 'street_name': 'Deborah Stream', │ │ │ 'city': 'East Timothy', │ │ │ 'state': 'Virginia', │ │ │ 'postcode': '99870', │ │ │ 'age': 60, │ │ │ 'birth_date': '1966-05-04', │ │ │ 'country': 'Eritrea', │ │ │ 'marital_status': 'married_present', │ │ │ 'education_level': 'some_college', │ │ │ 'unit': '', │ │ │ 'occupation': 'Teacher, music', │ │ │ 'phone_number': '001-286-878-1827', │ │ │ 'bachelors_field': 'no_degree' │ │ │ } │ ├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ │ number_of_stars │ 3 │ ├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ │ review_style │ brief │ ├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ │ product_name │ ErgoPulse ChairMate Mini │ ├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────┤ │ customer_review │ I’m 60, live in East Timothy, VA, and just bought the ErgoPulse ChairMate Mini. It’s │ │ │ compact and the lumbar support feels decent, but the cushion is a bit firm for my │ │ │ taste and the armrests don’t adjust much. Worth a try if you need a small, │ │ │ supportive chair, but I expected a little more comfort for the price. Rating: 3 │ │ │ stars. │ └─────────────────────┴──────────────────────────────────────────────────────────────────────────────────────┘
| product_category | product_subcategory | target_age_range | customer | number_of_stars | review_style | product_name | customer_review | |
|---|---|---|---|---|---|---|---|---|
| 0 | Home Office | Chairs | 18-25 | {'uuid': '22a58e4d-938c-4c7a-8905-00b6b659c589... | 3 | brief | ErgoPulse ChairMate Mini | I’m 60, live in East Timothy, VA, and just bou... |
| 1 | Home Office | Chairs | 50-65 | {'uuid': 'd3f3069f-ff15-43ba-bc35-87a7b625c4fc... | 2 | detailed | ErgoLux Adjustable Lumbar Home Office Chair | I bought the ErgoLux Adjustable Lumbar Home Of... |
📊 Analyze the generated data
-
Data Designer automatically generates a basic statistical analysis of the generated data.
-
This analysis is available via the
analysisproperty of generation result objects.
──────────────────────────────────────── 🎨 Data Designer Dataset Profile ───────────────────────────────────────── Dataset Overview ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ number of records ┃ number of columns ┃ percent complete records ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 2 │ 8 │ 100.0% │ └─────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────────────────┘ 🎲 Sampler Columns ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ column name ┃ data type ┃ number unique values ┃ sampler type ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ product_category │ string │ 1 (50.0%) │ category │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ product_subcategory │ string │ 1 (50.0%) │ subcategory │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ target_age_range │ string │ 2 (100.0%) │ category │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ customer │ dict │ 2 (100.0%) │ person_from_faker │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ number_of_stars │ int │ 2 (100.0%) │ uniform │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ review_style │ string │ 2 (100.0%) │ category │ └────────────────────────────────┴─────────────────┴─────────────────────────────────┴────────────────────────────┘ 📝 LLM-Text Columns ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ prompt tokens ┃ completion tokens ┃ ┃ column name ┃ data type ┃ number unique values ┃ per record ┃ per record ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩ │ product_name │ string │ 2 (100.0%) │ 74.0 +/- 0.0 │ 9.0 +/- 1.4 │ ├───────────────────────┼───────────────┼────────────────────────────┼───────────────────┼────────────────────────┤ │ customer_review │ string │ 2 (100.0%) │ 72.0 +/- 2.0 │ 204.5 +/- 170.4 │ └───────────────────────┴───────────────┴────────────────────────────┴───────────────────┴────────────────────────┘ ╭────────────────────────────────────────────────── Table Notes ──────────────────────────────────────────────────╮ │ │ │ 1. All token statistics are based on a sample of max(1000, len(dataset)) records. │ │ 2. Tokens are calculated using tiktoken's cl100k_base tokenizer. │ │ │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
🆙 Scale up!
-
Happy with your preview data?
-
Use the
createmethod to submit larger Data Designer generation jobs.
[21:15:37] [INFO] 🎨 Creating Data Designer dataset
[21:15:37] [INFO] |-- 🔒 Jinja rendering engine: secure
[21:15:37] [INFO] ✅ Validation passed
[21:15:37] [INFO] ⛓️ Sorting column configs into a Directed Acyclic Graph
[21:15:37] [INFO] 🩺 Running health checks for models...
[21:15:37] [INFO] |-- 👀 Checking 'nvidia/nemotron-3-nano-30b-a3b' in provider named 'nvidia' for model alias 'nemotron-nano-v3'...
[21:15:38] [INFO] |-- ✅ Passed!
[21:15:38] [INFO] ⚡ DATA_DESIGNER_ASYNC_ENGINE is enabled - using async task-queue builder
[21:15:38] [INFO] 📝 llm-text model config for column 'product_name'
[21:15:38] [INFO] |-- model: 'nvidia/nemotron-3-nano-30b-a3b'
[21:15:38] [INFO] |-- model alias: 'nemotron-nano-v3'
[21:15:38] [INFO] |-- model provider: 'nvidia'
[21:15:38] [INFO] |-- inference parameters:
[21:15:38] [INFO] | |-- generation_type=chat-completion
[21:15:38] [INFO] | |-- max_parallel_requests=4
[21:15:38] [INFO] | |-- extra_body={'chat_template_kwargs': {'enable_thinking': False}}[21:15:38] [INFO] | |-- temperature=1.00
[21:15:38] [INFO] | |-- top_p=1.00
[21:15:38] [INFO] | |-- max_tokens=2048
[21:15:38] [INFO] 📝 llm-text model config for column 'customer_review'
[21:15:38] [INFO] |-- model: 'nvidia/nemotron-3-nano-30b-a3b'
[21:15:38] [INFO] |-- model alias: 'nemotron-nano-v3'
[21:15:38] [INFO] |-- model provider: 'nvidia'
[21:15:38] [INFO] |-- inference parameters:
[21:15:38] [INFO] | |-- generation_type=chat-completion
[21:15:38] [INFO] | |-- max_parallel_requests=4
[21:15:38] [INFO] | |-- extra_body={'chat_template_kwargs': {'enable_thinking': False}}[21:15:38] [INFO] | |-- temperature=1.00
[21:15:38] [INFO] | |-- top_p=1.00
[21:15:38] [INFO] | |-- max_tokens=2048
[21:15:38] [INFO] ⚡️ Async generation: 2 column(s) (product_name, customer_review), 20 tasks across 1 row group(s)
[21:15:38] [INFO] 🚀 (1/1) Dispatching with 10 records
[21:15:38] [INFO] 🎲 (1/1) Preparing samplers to generate 10 records across 6 columns
[21:15:43] [INFO] 📊 Progress [4.8s]:
[21:15:43] [INFO] |-- 🐔 product_name: 10/10 (100%) 2.1 rec/s
[21:15:43] [INFO] |-- ☀️ customer_review: 10/10 (100%) 2.1 rec/s
[21:15:43] [INFO] ✅ Async generation complete [4.8s]: 20 ok, 0 failed across 2 column(s)
[21:15:43] [INFO] 📊 Model usage summary:
[21:15:43] [INFO] |-- model: nvidia/nemotron-3-nano-30b-a3b
[21:15:43] [INFO] |-- tokens: input=1777, output=3136, total=4913, tps=984
[21:15:43] [INFO] |-- requests: success=20, failed=0, total=20, rpm=240
[21:15:43] [INFO] 📐 Measuring dataset column statistics:
[21:15:43] [INFO] |-- 🎲 column: 'product_category'
[21:15:43] [INFO] |-- 🎲 column: 'product_subcategory'
[21:15:43] [INFO] |-- 🎲 column: 'target_age_range'
[21:15:43] [INFO] |-- 🎲 column: 'customer'
[21:15:43] [INFO] |-- 🎲 column: 'number_of_stars'
[21:15:43] [INFO] |-- 🎲 column: 'review_style'
[21:15:43] [INFO] |-- 📝 column: 'product_name'
[21:15:43] [INFO] |-- 📝 column: 'customer_review'
| product_category | product_subcategory | target_age_range | customer | number_of_stars | review_style | product_name | customer_review | |
|---|---|---|---|---|---|---|---|---|
| 0 | Clothing | Women's Clothing | 18-25 | {'age': 42, 'bachelors_field': 'education', 'b... | 2 | structured with bullet points | Luna Threads™ Cardigan Set | **Luna Threads™ Cardigan Set – 2‑Star Review**... |
| 1 | Home & Kitchen | Furniture | 35-50 | {'age': 59, 'bachelors_field': 'education', 'b... | 3 | brief | Aurora Modular Sofa | I love the Aurora Modular Sofa's modern look a... |
| 2 | Books | Fiction | 65+ | {'age': 26, 'bachelors_field': 'stem', 'birth_... | 4 | detailed | Timeless Tales Emporium | I purchased the Timeless Tales Emporium six we... |
| 3 | Home & Kitchen | Decor | 65+ | {'age': 27, 'bachelors_field': 'no_degree', 'b... | 2 | structured with bullet points | Sunlit Heritage Wall Clock | - **Purchased:** Sunlit Heritage Wall Clock – ... |
| 4 | Books | Classics | 50-65 | {'age': 32, 'bachelors_field': 'no_degree', 'b... | 1 | detailed | The Golden Quill Classic Collection | I’m Nicholas from South Katherine, Ohio, and I... |
──────────────────────────────────────── 🎨 Data Designer Dataset Profile ───────────────────────────────────────── Dataset Overview ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ number of records ┃ number of columns ┃ percent complete records ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 10 │ 8 │ 100.0% │ └─────────────────────────────────┴─────────────────────────────────┴─────────────────────────────────────────────┘ 🎲 Sampler Columns ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ column name ┃ data type ┃ number unique values ┃ sampler type ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ product_category │ string │ 4 (40.0%) │ category │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ product_subcategory │ string │ 7 (70.0%) │ subcategory │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ target_age_range │ string │ 5 (50.0%) │ category │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ customer │ dict │ 10 (100.0%) │ person_from_faker │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ number_of_stars │ int │ 4 (40.0%) │ uniform │ ├────────────────────────────────┼─────────────────┼─────────────────────────────────┼────────────────────────────┤ │ review_style │ string │ 4 (40.0%) │ category │ └────────────────────────────────┴─────────────────┴─────────────────────────────────┴────────────────────────────┘ 📝 LLM-Text Columns ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃ ┃ ┃ prompt tokens ┃ completion tokens ┃ ┃ column name ┃ data type ┃ number unique values ┃ per record ┃ per record ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩ │ product_name │ string │ 10 (100.0%) │ 74.0 +/- 1.0 │ 5.5 +/- 1.2 │ ├───────────────────────┼───────────────┼────────────────────────────┼───────────────────┼────────────────────────┤ │ customer_review │ string │ 10 (100.0%) │ 70.0 +/- 1.9 │ 275.0 +/- 180.5 │ └───────────────────────┴───────────────┴────────────────────────────┴───────────────────┴────────────────────────┘ ╭────────────────────────────────────────────────── Table Notes ──────────────────────────────────────────────────╮ │ │ │ 1. All token statistics are based on a sample of max(1000, len(dataset)) records. │ │ 2. Tokens are calculated using tiktoken's cl100k_base tokenizer. │ │ │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
⏭️ Next Steps
Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about: