For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
      • Overview
      • models
      • mcp
      • column_configs
      • config_builder
      • data_designer_config
      • run_config
      • sampler_params
      • validator_params
      • seeds
      • processors
      • analysis
      • Config API
        • Analysis
          • Column Profilers
          • Column Statistics
          • Dataset Profiler
          • Utils
        • Base
        • Column Configs
        • Column Types
        • Config Builder
        • Custom Column
        • Data Designer Config
        • Dataset Metadata
        • Default Model Settings
        • Errors
        • Exportable Config
        • Fingerprint
        • Interface
        • Mcp
        • Models
        • Preview Results
        • Processor Types
        • Processors
        • Run Config
        • Sampler Constraints
        • Sampler Params
        • Seed
        • Seed Source
        • Seed Source Dataframe
        • Seed Source Types
        • Testing
        • Utils
        • Validator Params
        • Version
  • Dev Notes
    • Overview
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
    • Data Designer Got Skills
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Module Contents
  • Classes
  • Data
  • API
Code ReferenceConfigConfig APIAnalysis

data_designer.config.analysis.column_statistics

||View as Markdown|
Previous

Column Profilers

Next

Dataset Profiler

Module Contents

Classes

NameDescription
MissingValuestr(object=”) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
ColumnDistributionTypestr(object=”) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
BaseColumnStatisticsAbstract base class for all column statistics types.
GeneralColumnStatisticsContainer for general statistics applicable to all column types.
LLMTextColumnStatisticsContainer for statistics on LLM-generated text columns.
LLMCodeColumnStatisticsContainer for statistics on LLM-generated code columns.
LLMStructuredColumnStatisticsContainer for statistics on LLM-generated structured JSON columns.
LLMJudgedColumnStatisticsContainer for statistics on LLM-as-a-judge quality assessment columns.
SamplerColumnStatisticsContainer for statistics on sampler-generated columns.
SeedDatasetColumnStatisticsContainer for statistics on columns sourced from seed datasets.
ExpressionColumnStatisticsContainer for statistics on expression-based derived columns.
ValidationColumnStatisticsContainer for statistics on validation result columns.
CategoricalHistogramDataContainer for categorical distribution histogram data.
CategoricalDistributionContainer for computed categorical distribution statistics.
NumericalDistributionContainer for computed numerical distribution statistics.

Data

ColumnStatisticsT DEFAULT_COLUMN_STATISTICS_MAP

API

1class data_designer.config.analysis.column_statistics.MissingValue

Bases: str, enum.Enum

1CALCULATION_FAILED = --
1OUTPUT_FORMAT_ERROR = output_format_error
1class data_designer.config.analysis.column_statistics.ColumnDistributionType

Bases: str, enum.Enum

1CATEGORICAL = categorical
1NUMERICAL = numerical
1TEXT = text
1OTHER = other
1UNKNOWN = unknown
1class data_designer.config.analysis.column_statistics.BaseColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: pydantic.BaseModel, abc.ABC

Abstract base class for all column statistics types.

Serves as a container for computed statistics across different column types in Data-Designer-generated datasets. Subclasses hold column-specific statistical results and provide methods for formatting these results for display in reports.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1model_config = ConfigDict(...)
1create_report_row_data() -> dict[str, str]

Creates a formatted dictionary of statistics for display in reports.

Returns:

dict[str, str]

Dictionary mapping display labels to formatted statistic values.

1class data_designer.config.analysis.column_statistics.GeneralColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.BaseColumnStatistics

Container for general statistics applicable to all column types.

Holds core statistical measures that apply universally across all column types, including null counts, unique values, and data type information. Serves as the base for more specialized column statistics classes that store additional column-specific metrics.

Parameters:

column_name

Name of the column being analyzed.

num_records

Total number of records in the column.

num_null

Number of null/missing values in the column.

num_unique

Number of distinct values in the column. If a value is not hashable, it is converted to a string.

pyarrow_dtype

PyArrow data type of the column as a string.

simple_dtype

Simplified human-readable data type label.

column_type

Discriminator field, always “general” for this statistics type.

Attributes:

column_name

Name of the column being analyzed.

num_records

Total number of records in the column.

num_null

Number of null/missing values in the column.

num_unique

Number of distinct values in the column. If a value is not hashable, it is converted to a string.

pyarrow_dtype

PyArrow data type of the column as a string.

simple_dtype

Simplified human-readable data type label.

column_type

Discriminator field, always “general” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1column_name: str
1num_records: int | data_designer.config.analysis.column_statistics.MissingValue
1num_null: int | data_designer.config.analysis.column_statistics.MissingValue
1num_unique: int | data_designer.config.analysis.column_statistics.MissingValue
1pyarrow_dtype: str
1simple_dtype: str
1column_type: typing.Literal[general] = general
1general_statistics_ensure_python_integers(v: int | data_designer.config.analysis.column_statistics.MissingValue) -> int | data_designer.config.analysis.column_statistics.MissingValuegeneral_statistics_ensure_python_integers(v: int | data_designer.config.analysis.column_statistics.MissingValue) -> int | data_designer.config.analysis.column_statistics.MissingValue
1percent_null: float | data_designer.config.analysis.column_statistics.MissingValue
1percent_unique: float | data_designer.config.analysis.column_statistics.MissingValue
1_general_display_row: dict[str, str]
1create_report_row_data() -> dict[str, str]
1_is_missing_value(v: float | int | data_designer.config.analysis.column_statistics.MissingValue) -> bool
1class data_designer.config.analysis.column_statistics.LLMTextColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.GeneralColumnStatistics

Container for statistics on LLM-generated text columns.

Inherits general statistics plus token usage metrics specific to LLM text generation. Stores both prompt and completion token consumption data.

Parameters:

output_tokens_mean

Mean number of output tokens generated per record.

output_tokens_median

Median number of output tokens generated per record.

output_tokens_stddev

Standard deviation of output tokens per record.

input_tokens_mean

Mean number of input tokens used per record.

input_tokens_median

Median number of input tokens used per record.

input_tokens_stddev

Standard deviation of input tokens per record.

column_type

Discriminator field, always “llm-text” for this statistics type.

Attributes:

output_tokens_mean

Mean number of output tokens generated per record.

output_tokens_median

Median number of output tokens generated per record.

output_tokens_stddev

Standard deviation of output tokens per record.

input_tokens_mean

Mean number of input tokens used per record.

input_tokens_median

Median number of input tokens used per record.

input_tokens_stddev

Standard deviation of input tokens per record.

column_type

Discriminator field, always “llm-text” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1output_tokens_mean: float | data_designer.config.analysis.column_statistics.MissingValue
1output_tokens_median: float | data_designer.config.analysis.column_statistics.MissingValue
1output_tokens_stddev: float | data_designer.config.analysis.column_statistics.MissingValue
1input_tokens_mean: float | data_designer.config.analysis.column_statistics.MissingValue
1input_tokens_median: float | data_designer.config.analysis.column_statistics.MissingValue
1input_tokens_stddev: float | data_designer.config.analysis.column_statistics.MissingValue
1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_TEXT.value]
1llm_column_ensure_python_floats(v: float | int | data_designer.config.analysis.column_statistics.MissingValue) -> float | int | data_designer.config.analysis.column_statistics.MissingValuellm_column_ensure_python_floats(v: float | int | data_designer.config.analysis.column_statistics.MissingValue) -> float | int | data_designer.config.analysis.column_statistics.MissingValue
1create_report_row_data() -> dict[str, typing.Any]
1class data_designer.config.analysis.column_statistics.LLMCodeColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.LLMTextColumnStatistics

Container for statistics on LLM-generated code columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from columns that generate code snippets in specific programming languages.

Parameters:

column_type

Discriminator field, always “llm-code” for this statistics type.

Attributes:

column_type

Discriminator field, always “llm-code” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_CODE.value]
1class data_designer.config.analysis.column_statistics.LLMStructuredColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.LLMTextColumnStatistics

Container for statistics on LLM-generated structured JSON columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from columns that generate structured data conforming to JSON schemas or Pydantic models.

Parameters:

column_type

Discriminator field, always “llm-structured” for this statistics type.

Attributes:

column_type

Discriminator field, always “llm-structured” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_STRUCTURED.value]
1class data_designer.config.analysis.column_statistics.LLMJudgedColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.LLMTextColumnStatistics

Container for statistics on LLM-as-a-judge quality assessment columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from columns that evaluate and score other generated content based on defined criteria.

Parameters:

column_type

Discriminator field, always “llm-judge” for this statistics type.

Attributes:

column_type

Discriminator field, always “llm-judge” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_JUDGE.value]
1class data_designer.config.analysis.column_statistics.SamplerColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.GeneralColumnStatistics

Container for statistics on sampler-generated columns.

Inherits general statistics plus sampler-specific information including the sampler type used and the empirical distribution of generated values. Stores both categorical and numerical distribution results.

Parameters:

sampler_type

Type of sampler used to generate this column (e.g., “uniform”, “category”, “gaussian”, “person”).

distribution_type

Classification of the column’s distribution (categorical, numerical, text, other, or unknown).

distribution

Empirical distribution statistics for the generated values. Can be CategoricalDistribution (for discrete values), NumericalDistribution (for continuous values), or MissingValue if distribution could not be computed.

column_type

Discriminator field, always “sampler” for this statistics type.

Attributes:

sampler_type

Type of sampler used to generate this column (e.g., “uniform”, “category”, “gaussian”, “person”).

distribution_type

Classification of the column’s distribution (categorical, numerical, text, other, or unknown).

distribution

Empirical distribution statistics for the generated values. Can be CategoricalDistribution (for discrete values), NumericalDistribution (for continuous values), or MissingValue if distribution could not be computed.

column_type

Discriminator field, always “sampler” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1sampler_type: data_designer.config.sampler_params.SamplerType
1distribution_type: data_designer.config.analysis.column_statistics.ColumnDistributionType
1distribution: CategoricalDistribution | NumericalDistribution | data_designer.config.analysis.column_statistics.MissingValue | None
1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.SAMPLER.value]
1create_report_row_data() -> dict[str, str]
1class data_designer.config.analysis.column_statistics.SeedDatasetColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.GeneralColumnStatistics

Container for statistics on columns sourced from seed datasets.

Inherits general statistics and stores statistics computed from columns that originate from existing data provided via the seed dataset functionality.

Parameters:

column_type

Discriminator field, always “seed-dataset” for this statistics type.

Attributes:

column_type

Discriminator field, always “seed-dataset” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.SEED_DATASET.value]
1class data_designer.config.analysis.column_statistics.ExpressionColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.GeneralColumnStatistics

Container for statistics on expression-based derived columns.

Inherits general statistics and stores statistics computed from columns that are derived from columns that are derived from Jinja2 expressions referencing other column values.

Parameters:

column_type

Discriminator field, always “expression” for this statistics type.

Attributes:

column_type

Discriminator field, always “expression” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.EXPRESSION.value]
1class data_designer.config.analysis.column_statistics.ValidationColumnStatistics(
2 /,
3 **data: typing.Any
4)

Bases: data_designer.config.analysis.column_statistics.GeneralColumnStatistics

Container for statistics on validation result columns.

Inherits general statistics plus validation-specific metrics including the count and percentage of records that passed validation. Stores results from validation logic (Python, SQL, local callable, or remote) executed against target columns.

Parameters:

num_valid_records

Number of records that passed validation.

column_type

Discriminator field, always “validation” for this statistics type.

Attributes:

num_valid_records

Number of records that passed validation.

column_type

Discriminator field, always “validation” for this statistics type.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1num_valid_records: int | data_designer.config.analysis.column_statistics.MissingValue
1column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.VALIDATION.value]
1code_validation_column_ensure_python_integers(v: int | data_designer.config.analysis.column_statistics.MissingValue) -> int | data_designer.config.analysis.column_statistics.MissingValuecode_validation_column_ensure_python_integers(v: int | data_designer.config.analysis.column_statistics.MissingValue) -> int | data_designer.config.analysis.column_statistics.MissingValue
1percent_valid: float | data_designer.config.analysis.column_statistics.MissingValue
1create_report_row_data() -> dict[str, str]
1class data_designer.config.analysis.column_statistics.CategoricalHistogramData(
2 /,
3 **data: typing.Any
4)

Bases: pydantic.BaseModel

Container for categorical distribution histogram data.

Stores the computed frequency distribution of categorical values.

Parameters:

categories

List of unique category values that appear in the data.

counts

List of occurrence counts for each category.

Attributes:

categories

List of unique category values that appear in the data.

counts

List of occurrence counts for each category.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1categories: list[float | int | str]
1counts: list[int]
1ensure_python_types() -> typing_extensions.Self

Ensure numerical values are Python objects rather than Numpy types.

1from_series(series: pandas.Series) -> typing_extensions.Self
1class data_designer.config.analysis.column_statistics.CategoricalDistribution(
2 /,
3 **data: typing.Any
4)

Bases: pydantic.BaseModel

Container for computed categorical distribution statistics.

Parameters:

most_common_value

The category value that appears most frequently in the data.

least_common_value

The category value that appears least frequently in the data.

histogram

Complete frequency distribution showing all categories and their counts.

Attributes:

most_common_value

The category value that appears most frequently in the data.

least_common_value

The category value that appears least frequently in the data.

histogram

Complete frequency distribution showing all categories and their counts.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1most_common_value: str | int
1least_common_value: str | int
1histogram: data_designer.config.analysis.column_statistics.CategoricalHistogramData
1ensure_python_types(v: str | int) -> str | int
1from_series(series: pandas.Series) -> typing_extensions.Self
1class data_designer.config.analysis.column_statistics.NumericalDistribution(
2 /,
3 **data: typing.Any
4)

Bases: pydantic.BaseModel

Container for computed numerical distribution statistics.

Parameters:

min

Minimum value in the distribution.

max

Maximum value in the distribution.

mean

Arithmetic mean (average) of all values.

stddev

Standard deviation measuring the spread of values around the mean.

median

Median value of the distribution.

Attributes:

min

Minimum value in the distribution.

max

Maximum value in the distribution.

mean

Arithmetic mean (average) of all values.

stddev

Standard deviation measuring the spread of values around the mean.

median

Median value of the distribution.

Initialization:

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

1min: float | int
1max: float | int
1mean: float
1stddev: float
1median: float
1ensure_python_types(v: float | int) -> float | int
1from_series(series: pandas.Series) -> typing_extensions.Self
ColumnStatisticsT
typing_extensions.TypeAlias
1DEFAULT_COLUMN_STATISTICS_MAP