Sampling-Based Columns#

Sampling-based columns generate data through statistical sampling methods, distributions, and predefined datasets.

Before You Start#

Before getting started, ensure you have the Data Designer client and configuration builder set up:

import os
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import DataDesignerClient, DataDesignerConfigBuilder
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

data_designer_client = DataDesignerClient(
    client=NeMoMicroservices(base_url=os.environ["NEMO_MICROSERVICES_BASE_URL"])
)

config_builder = DataDesignerConfigBuilder(model_configs="path/to/your/model_configs.yaml")

Using Conditional Parameters#

All sampling-based columns support conditional parameters that change based on other column values:

Simplified API

config_builder.add_column(
    name="pet_type",
    type="category",
    params={"values": ["dog", "cat", "fish"], "weights": [0.5, 0.3, 0.2]},
    conditional_params={
        "number_of_pets == 0": {"values": ["none"]}
    }
)

Typed API

config_builder.add_column(
    SamplerColumn(
        name="pet_type",
        type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["dog", "cat", "fish"], weights=[0.5, 0.3, 0.2]),
        conditional_params={
            "number_of_pets == 0": CategorySamplerParams(values=["none"])
        }
    )
)

Reference Table#

Simplified API Type	Typed API Equivalent	Description
`"category"`	`SamplerType.CATEGORY`	Categorical values
`"subcategory"`	`SamplerType.SUBCATEGORY`	Dependent categories
`"uuid"`	`SamplerType.UUID`	Unique identifiers
`"uniform"`	`SamplerType.UNIFORM`	Uniform distribution
`"gaussian"`	`SamplerType.GAUSSIAN`	Normal distribution
`"poisson"`	`SamplerType.POISSON`	Poisson distribution
`"bernoulli"`	`SamplerType.BERNOULLI`	Binary outcomes
`"bernoulli_mixture"`	`SamplerType.BERNOULLI_MIXTURE`	Mixed distribution
`"binomial"`	`SamplerType.BINOMIAL`	Number of successes
`"scipy"`	`SamplerType.SCIPY`	SciPy distributions
`"datetime"`	`SamplerType.DATETIME`	Date/time values
`"timedelta"`	`SamplerType.TIMEDELTA`	Time intervals
`"person"`	`SamplerType.PERSON`	Person entities

Available Person Attributes#

When referencing person samplers in prompt templates or jinja templates, these attributes are available:

first_name: First name
last_name: Last name
email: Email address
phone: Phone number
address: Street address
city: City name
state: State/province
zip_code: Postal code
country: Country
date_of_birth: Date of birth
age: Age (calculated from date of birth)
sex: Gender

Sampling-Based Columns#

Before You Start#

Sampling-Based Column Types#

Category#

Subcategory#

UUID#

Uniform Distribution#

Gaussian Distribution#

Poisson Distribution#

Bernoulli Distribution#

Bernoulli Mixture Distribution#

Binomial Distribution#

SciPy Sampler#

DateTime#

TimeDelta#

Person#

Using Conditional Parameters#

Reference Table#

Available Person Attributes#