Person Sampling in Data Designer | NVIDIA NeMo Data Designer

Person sampling in Data Designer allows you to generate synthetic person data for your datasets. There are two distinct approaches, each with different capabilities and use cases.

Overview

Data Designer provides two ways to generate synthetic people:

Faker-based sampling - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case
Nemotron-Personas datasets - Demographically accurate, rich persona data

Approach 1: Faker-Based Sampling

What It Does

Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case.

Features

Gives you access to person attributes that Faker exposes
Quick to set up with no additional downloads
Generates random names, emails, addresses, phone numbers, etc.
Supports all Faker-supported locales
Not demographically grounded - data patterns don’t reflect real-world demographics

Usage Example

1 import data_designer.config as dd
2 
3 config_builder.add_column(
4     dd.SamplerColumnConfig(
5         name="customer",
6         sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
7         params=dd.PersonFromFakerSamplerParams(
8             locale="en_US",
9             age_range=[25, 65],
10             sex="Female",
11         ),
12     )
13 )

Use SamplerColumnConfig with PersonFromFakerSamplerParams when you need locale-aware synthetic person fields.

Approach 2: Nemotron-Personas Datasets

What It Does

Uses curated Nemotron-Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.

The NGC datasets are extended versions of the open-source Nemotron-Personas datasets on HuggingFace, with additional fields and enhanced data quality.

Supported locales:

en_US: United States
en_IN: India (English)
en_SG: Singapore (English)
fr_FR: France (French)
hi_Deva_IN: India (Devanagari script)
hi_Latn_IN: India (Latin script)
ja_JP: Japan
ko_KR: South Korea (Korean)
pt_BR: Brazil (Portuguese)

Features

Demographically accurate personal details: Names, ages, sex, marital status, education, occupation based on census data
Rich persona details: Comprehensive behavioral profiles including:
- Big Five personality traits with scores
- Cultural backgrounds and narratives
- Skills and hobbies
- Career goals and aspirations
- Context-specific personas (professional, financial, healthcare, sports, arts, travel, culinary, etc.)
Consistent, referenceable attributes across your dataset
Grounded in real-world demographic distributions

Prerequisites

To use the extended Nemotron-Personas datasets with Data Designer, you need to download them from NGC and move them to the Data Designer managed assets directory.

See below for step-by-step instructions.

Nemotron-Personas Datasets Setup Instructions

Step 0: Obtain an NGC API Key and install the NGC CLI

To download the Nemotron-Personas datasets from NGC, you will need to obtain an NGC API key and install the NGC CLI.

NGC API Key: Obtain from NVIDIA GPU Cloud
NGC CLI: NGC CLI

Step 1: Create the default NGC CLI config

Configure the NGC CLI with your API key. When prompted, paste the API key you obtained from NGC. This creates the default ~/.ngc/config file that Data Designer checks before downloading persona datasets.

$ ngc config set

Step 2 (option 1): Download Nemotron-Personas Datasets via the Data Designer CLI

Once you have configured the NGC CLI, you can download the datasets via the Data Designer CLI.

You can pass the locales you want to download as arguments to the CLI command:

$ data-designer download personas --locale en_US --locale ja_JP

Or you can use the interactive mode to select the locales you want to download:

$ data-designer download personas

Step 2 (option 2): Download Nemotron-Personas Datasets Directly

Use the configured NGC CLI to download the datasets:

$ # For Nemotron-Personas USA
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us"
$ 
$ # For Nemotron-Personas IN
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in"
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in"
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in"
$ 
$ # For Nemotron-Personas FR
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-fr_fr"
$ 
$ # For Nemotron-Personas JP
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp"
$ 
$ # For Nemotron-Personas KR
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ko_kr"
$ 
$ # For Nemotron-Personas SG
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_sg"
$ 
$ # For Nemotron-Personas BR
$ ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-pt_br"

Then move the downloaded dataset to the Data Designer managed assets directory:

$ mkdir -p ~/.data-designer/managed-assets/datasets/
$ mv nemotron-personas-dataset-*/*.parquet ~/.data-designer/managed-assets/datasets/

Step 3: Use PersonSampler in Your Code

1 import data_designer.config as dd
2 
3 config_builder.add_column(
4     dd.SamplerColumnConfig(
5         name="customer",
6         sampler_type=dd.SamplerType.PERSON,
7         params=dd.PersonSamplerParams(
8             locale="en_US",
9             sex="Female",
10             age_range=[25, 45],
11             with_synthetic_personas=True,
12         ),
13     )
14 )

Use SamplerColumnConfig with PersonSamplerParams when you need richer personas from curated datasets.

Available Data Fields

Core Fields (all locales):

Field	Type	Notes
`uuid`	UUID	Unique identifier
`first_name`	string
`middle_name`	string
`last_name`	string
`sex`	enum	”Male” or “Female”
`birth_date`	date	Derived: year, month, day
`street_number`	int
`street_name`	string
`unit`	string	Address line 2
`city`	string
`region`	string	Alias: state
`district`	string	Alias: county
`postcode`	string	Alias: zipcode
`country`	string
`phone_number`	PhoneNumber	Derived: area_code, country_code, prefix, line_number
`marital_status`	string	Values: never_married, married_present, separated, widowed, divorced
`education_level`	string or None
`bachelors_field`	string or None
`occupation`	string or None
`email_address`	string
`national_id`	string

France-Specific Fields (fr_FR):

household_type - Household composition (e.g., single person, couple with/without children)
monthly_income_eur - Estimated monthly income in euros
first_name_heritage - Cultural origin of the first name
name_heritage - Cultural, linguistic, or geographic origin of the surname
is_first_gen_immigrant - Whether the individual is a first-generation immigrant to France

Japan-Specific Fields (ja_JP):

area

Korea-Specific Fields (ko_KR):

economic_activity_status - Employment / economic activity status
family_type - Household / family composition type
housing_type - Dwelling type (apartment, detached home, etc.)
housing_tenure - Owned vs rented, etc.
income_bracket - Income range
military_status - Military service status
drinking_status - Drinking frequency / status
smoking_status - Smoking frequency / status
blood_pressure_status - Blood pressure health indicator
blood_sugar_status - Blood sugar health indicator
bmi_status - BMI health indicator
waist_status - Waist-circumference health indicator

Brazil-Specific Fields (pt_BR):

race - Census-reported race

Singapore-Specific Fields (en_SG):

industry - Industry of employment
preferred_english_name - Preferred English-form name

English Locales Shared Fields (en_US, en_SG):

ethnic_background - Self-identified ethnic background

Religion Fields (en_IN, hi_Deva_IN, hi_Latn_IN, en_SG, pt_BR):

religion - Census-reported religion

India Locales Fields (en_IN, hi_Deva_IN, hi_Latn_IN):

education_degree - Census-reported education degree
first_language - Native language
second_language - Second language (if applicable)
third_language - Third language (if applicable)
zone - Urban vs rural

With Synthetic Personas Enabled:

Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels
Cultural background narratives
Skills and competencies
Hobbies and interests
Career goals
Context-specific personas (professional, financial, healthcare, sports, arts & entertainment, travel, culinary, etc.)

Japan-specific persona fields (ja_JP):

aspects
digital_skill

Korea-specific persona fields (ko_KR):

family_persona

Religious persona fields (en_IN, hi_Deva_IN, hi_Latn_IN, en_SG, pt_BR):

religious_persona
religious_background

India-locales persona fields (en_IN, hi_Deva_IN, hi_Latn_IN):

linguistic_persona
linguistic_background

Configuration Parameters

Parameter	Type	Description
`locale`	str	Language/region code - must be one of: “en_US”, “en_IN”, “en_SG”, “fr_FR”, “hi_Deva_IN”, “hi_Latn_IN”, “ja_JP”, “ko_KR”, “pt_BR”
`sex`	str (optional)	Filter by “Male” or “Female”
`city`	str or list[str] (optional)	Filter by specific city or cities within locale
`age_range`	list[int] (optional)	Two-element list [min_age, max_age] (default: [18, 114])
`with_synthetic_personas`	bool (optional)	Include rich personality profiles (default: False)
`select_field_values`	dict (optional)	Custom field-based filtering (e.g., `{"state": ["NY", "CA"], "education_level": ["bachelors"]}`)