For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Overview
  • Approach 1: Faker-Based Sampling
  • What It Does
  • Features
  • Usage Example
  • Approach 2: Nemotron-Personas Datasets
  • What It Does
  • Features
  • Prerequisites
  • Nemotron-Personas Datasets Setup Instructions
  • Step 0: Obtain an NGC API Key and install the NGC CLI
  • Step 1: Set Your NGC API Key
  • Step 2 (option 1): Download Nemotron-Personas Datasets via the Data Designer CLI
  • Step 2 (option 2): Download Nemotron-Personas Datasets Directly
  • Step 3: Use PersonSampler in Your Code
  • Available Data Fields
  • Configuration Parameters
Concepts

Person Sampling in Data Designer

||View as Markdown|
Previous

Processors

Next

Message Traces

Person sampling in Data Designer allows you to generate synthetic person data for your datasets. There are two distinct approaches, each with different capabilities and use cases.

Overview

Data Designer provides two ways to generate synthetic people:

  1. Faker-based sampling - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case
  2. Nemotron-Personas datasets - Demographically accurate, rich persona data

Approach 1: Faker-Based Sampling

What It Does

Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case.

Features

  • Gives you access to person attributes that Faker exposes
  • Quick to set up with no additional downloads
  • Generates random names, emails, addresses, phone numbers, etc.
  • Supports all Faker-supported locales
  • Not demographically grounded - data patterns don’t reflect real-world demographics

Usage Example

1import data_designer.config as dd
2
3config_builder.add_column(
4 dd.SamplerColumnConfig(
5 name="customer",
6 sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
7 params=dd.PersonFromFakerSamplerParams(
8 locale="en_US",
9 age_range=[25, 65],
10 sex="Female",
11 ),
12 )
13)

For mor details, see the documentation for SamplerColumnConfig and PersonFromFakerSamplerParams.


Approach 2: Nemotron-Personas Datasets

What It Does

Uses curated Nemotron-Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.

The NGC datasets are extended versions of the open-source Nemotron-Personas datasets on HuggingFace, with additional fields and enhanced data quality.

Supported locales:

  • en_US: United States
  • en_IN: India (English)
  • en_SG: Singapore (English)
  • fr_FR: France (French)
  • hi_Deva_IN: India (Devanagari script)
  • hi_Latn_IN: India (Latin script)
  • ja_JP: Japan
  • ko_KR: South Korea (Korean)
  • pt_BR: Brazil (Portuguese)

Features

  • Demographically accurate personal details: Names, ages, sex, marital status, education, occupation based on census data
  • Rich persona details: Comprehensive behavioral profiles including:
    • Big Five personality traits with scores
    • Cultural backgrounds and narratives
    • Skills and hobbies
    • Career goals and aspirations
    • Context-specific personas (professional, financial, healthcare, sports, arts, travel, culinary, etc.)
  • Consistent, referenceable attributes across your dataset
  • Grounded in real-world demographic distributions

Prerequisites

To use the extended Nemotron-Personas datasets with Data Designer, you need to download them from NGC and move them to the Data Designer managed assets directory.

See below for step-by-step instructions.

Nemotron-Personas Datasets Setup Instructions

Step 0: Obtain an NGC API Key and install the NGC CLI

To download the Nemotron-Personas datasets from NGC, you will need to obtain an NGC API key and install the NGC CLI.

  1. NGC API Key: Obtain from NVIDIA GPU Cloud
  2. NGC CLI: NGC CLI

Step 1: Set Your NGC API Key

$export NGC_API_KEY="your-ngc-api-key-here"

Step 2 (option 1): Download Nemotron-Personas Datasets via the Data Designer CLI

Once you have the NGC CLI and your NGC API key set up, you can download the datasets via the Data Designer CLI.

You can pass the locales you want to download as arguments to the CLI command:

$data-designer download personas --locale en_US --locale ja_JP

Or you can use the interactive mode to select the locales you want to download:

$data-designer download personas

Step 2 (option 2): Download Nemotron-Personas Datasets Directly

Use the NGC CLI to download the datasets:

$# For Nemotron-Personas USA
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us"
$
$# For Nemotron-Personas IN
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in"
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in"
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in"
$
$# For Nemotron-Personas FR
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-fr_fr"
$
$# For Nemotron-Personas JP
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp"
$
$# For Nemotron-Personas KR
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ko_kr"
$
$# For Nemotron-Personas SG
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_sg"
$
$# For Nemotron-Personas BR
$ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-pt_br"

Then move the downloaded dataset to the Data Designer managed assets directory:

$mkdir -p ~/.data-designer/managed-assets/datasets/
$mv nemotron-personas-dataset-*/*.parquet ~/.data-designer/managed-assets/datasets/

Step 3: Use PersonSampler in Your Code

1import data_designer.config as dd
2
3config_builder.add_column(
4 dd.SamplerColumnConfig(
5 name="customer",
6 sampler_type=dd.SamplerType.PERSON,
7 params=dd.PersonSamplerParams(
8 locale="en_US",
9 sex="Female",
10 age_range=[25, 45],
11 with_synthetic_personas=True,
12 ),
13 )
14)

For more details, see the documentation for SamplerColumnConfig and PersonSamplerParams.

Available Data Fields

Core Fields (all locales):

FieldTypeNotes
uuidUUIDUnique identifier
first_namestring
middle_namestring
last_namestring
sexenum”Male” or “Female”
birth_datedateDerived: year, month, day
street_numberint
street_namestring
unitstringAddress line 2
citystring
regionstringAlias: state
districtstringAlias: county
postcodestringAlias: zipcode
countrystring
phone_numberPhoneNumberDerived: area_code, country_code, prefix, line_number
marital_statusstringValues: never_married, married_present, separated, widowed, divorced
education_levelstring or None
bachelors_fieldstring or None
occupationstring or None
email_addressstring
national_idstring

France-Specific Fields (fr_FR):

  • household_type - Household composition (e.g., single person, couple with/without children)
  • monthly_income_eur - Estimated monthly income in euros
  • first_name_heritage - Cultural origin of the first name
  • name_heritage - Cultural, linguistic, or geographic origin of the surname
  • is_first_gen_immigrant - Whether the individual is a first-generation immigrant to France

Japan-Specific Fields (ja_JP):

  • area

Korea-Specific Fields (ko_KR):

  • economic_activity_status - Employment / economic activity status
  • family_type - Household / family composition type
  • housing_type - Dwelling type (apartment, detached home, etc.)
  • housing_tenure - Owned vs rented, etc.
  • income_bracket - Income range
  • military_status - Military service status
  • drinking_status - Drinking frequency / status
  • smoking_status - Smoking frequency / status
  • blood_pressure_status - Blood pressure health indicator
  • blood_sugar_status - Blood sugar health indicator
  • bmi_status - BMI health indicator
  • waist_status - Waist-circumference health indicator

Brazil-Specific Fields (pt_BR):

  • race - Census-reported race

Singapore-Specific Fields (en_SG):

  • industry - Industry of employment
  • preferred_english_name - Preferred English-form name

English Locales Shared Fields (en_US, en_SG):

  • ethnic_background - Self-identified ethnic background

Religion Fields (en_IN, hi_Deva_IN, hi_Latn_IN, en_SG, pt_BR):

  • religion - Census-reported religion

India Locales Fields (en_IN, hi_Deva_IN, hi_Latn_IN):

  • education_degree - Census-reported education degree
  • first_language - Native language
  • second_language - Second language (if applicable)
  • third_language - Third language (if applicable)
  • zone - Urban vs rural

With Synthetic Personas Enabled:

  • Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels
  • Cultural background narratives
  • Skills and competencies
  • Hobbies and interests
  • Career goals
  • Context-specific personas (professional, financial, healthcare, sports, arts & entertainment, travel, culinary, etc.)

Japan-specific persona fields (ja_JP):

  • aspects
  • digital_skill

Korea-specific persona fields (ko_KR):

  • family_persona

Religious persona fields (en_IN, hi_Deva_IN, hi_Latn_IN, en_SG, pt_BR):

  • religious_persona
  • religious_background

India-locales persona fields (en_IN, hi_Deva_IN, hi_Latn_IN):

  • linguistic_persona
  • linguistic_background

Configuration Parameters

ParameterTypeDescription
localestrLanguage/region code - must be one of: “en_US”, “en_IN”, “en_SG”, “fr_FR”, “hi_Deva_IN”, “hi_Latn_IN”, “ja_JP”, “ko_KR”, “pt_BR”
sexstr (optional)Filter by “Male” or “Female”
citystr or list[str] (optional)Filter by specific city or cities within locale
age_rangelist[int] (optional)Two-element list [min_age, max_age] (default: [18, 114])
with_synthetic_personasbool (optional)Include rich personality profiles (default: False)
select_field_valuesdict (optional)Custom field-based filtering (e.g., {"state": ["NY", "CA"], "education_level": ["bachelors"]})