> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/datadesigner/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/datadesigner/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/datadesigner/_mcp/server.

# data\_designer.config.analysis.column\_statistics

## Module Contents

### Classes

| Name                                                                                                          | Description                                                                 |
| ------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| [`MissingValue`](#data_designerconfiganalysiscolumn_statisticsmissingvalue)                                   | str(object='') -> str str(bytes\_or\_buffer\[, encoding\[, errors]]) -> str |
| [`ColumnDistributionType`](#data_designerconfiganalysiscolumn_statisticscolumndistributiontype)               | str(object='') -> str str(bytes\_or\_buffer\[, encoding\[, errors]]) -> str |
| [`BaseColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsbasecolumnstatistics)                   | Abstract base class for all column statistics types.                        |
| [`GeneralColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsgeneralcolumnstatistics)             | Container for general statistics applicable to all column types.            |
| [`LLMTextColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsllmtextcolumnstatistics)             | Container for statistics on LLM-generated text columns.                     |
| [`LLMCodeColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsllmcodecolumnstatistics)             | Container for statistics on LLM-generated code columns.                     |
| [`LLMStructuredColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsllmstructuredcolumnstatistics) | Container for statistics on LLM-generated structured JSON columns.          |
| [`LLMJudgedColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsllmjudgedcolumnstatistics)         | Container for statistics on LLM-as-a-judge quality assessment columns.      |
| [`SamplerColumnStatistics`](#data_designerconfiganalysiscolumn_statisticssamplercolumnstatistics)             | Container for statistics on sampler-generated columns.                      |
| [`SeedDatasetColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsseeddatasetcolumnstatistics)     | Container for statistics on columns sourced from seed datasets.             |
| [`ExpressionColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsexpressioncolumnstatistics)       | Container for statistics on expression-based derived columns.               |
| [`ValidationColumnStatistics`](#data_designerconfiganalysiscolumn_statisticsvalidationcolumnstatistics)       | Container for statistics on validation result columns.                      |
| [`CategoricalHistogramData`](#data_designerconfiganalysiscolumn_statisticscategoricalhistogramdata)           | Container for categorical distribution histogram data.                      |
| [`CategoricalDistribution`](#data_designerconfiganalysiscolumn_statisticscategoricaldistribution)             | Container for computed categorical distribution statistics.                 |
| [`NumericalDistribution`](#data_designerconfiganalysiscolumn_statisticsnumericaldistribution)                 | Container for computed numerical distribution statistics.                   |

### Data

[`ColumnStatisticsT`](#data_designerconfiganalysiscolumn_statisticscolumnstatisticst)
[`DEFAULT_COLUMN_STATISTICS_MAP`](#data_designerconfiganalysiscolumn_statisticsdefault_column_statistics_map)

### API

```python
class data_designer.config.analysis.column_statistics.MissingValue
```

**Bases**: `str`, `enum.Enum`

```python
CALCULATION_FAILED = --
```

```python
OUTPUT_FORMAT_ERROR = output_format_error
```

```python
class data_designer.config.analysis.column_statistics.ColumnDistributionType
```

**Bases**: `str`, `enum.Enum`

```python
CATEGORICAL = categorical
```

```python
NUMERICAL = numerical
```

```python
TEXT = text
```

```python
OTHER = other
```

```python
UNKNOWN = unknown
```

```python
class data_designer.config.analysis.column_statistics.BaseColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `pydantic.BaseModel`, `abc.ABC`

Abstract base class for all column statistics types.

Serves as a container for computed statistics across different column types in
Data-Designer-generated datasets. Subclasses hold column-specific statistical results
and provide methods for formatting these results for display in reports.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
model_config = ConfigDict(...)
```

```python
create_report_row_data() -> dict[str, str]
```

Creates a formatted dictionary of statistics for display in reports.

**Returns:**

`dict[str, str]`

Dictionary mapping display labels to formatted statistic values.

```python
class data_designer.config.analysis.column_statistics.GeneralColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.BaseColumnStatistics`

Container for general statistics applicable to all column types.

Holds core statistical measures that apply universally across all column types,
including null counts, unique values, and data type information. Serves as the base
for more specialized column statistics classes that store additional column-specific metrics.

**Parameters:**

Name of the column being analyzed.

Total number of records in the column.

Number of null/missing values in the column.

Number of distinct values in the column. If a value is not hashable, it is converted to a string.

PyArrow data type of the column as a string.

Simplified human-readable data type label.

Discriminator field, always "general" for this statistics type.

**Attributes:**

Name of the column being analyzed.

Total number of records in the column.

Number of null/missing values in the column.

Number of distinct values in the column. If a value is not hashable, it is converted to a string.

PyArrow data type of the column as a string.

Simplified human-readable data type label.

Discriminator field, always "general" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
column_name: str
```

```python
num_records: int | data_designer.config.analysis.column_statistics.MissingValue
```

```python
num_null: int | data_designer.config.analysis.column_statistics.MissingValue
```

```python
num_unique: int | data_designer.config.analysis.column_statistics.MissingValue
```

```python
pyarrow_dtype: str
```

```python
simple_dtype: str
```

```python
column_type: typing.Literal[general] = general
```

```python
general_statistics_ensure_python_integers(v: int | data_designer.config.analysis.column_statistics.MissingValue) -> int | data_designer.config.analysis.column_statistics.MissingValue
```

```python
percent_null: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
percent_unique: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
_general_display_row: dict[str, str]
```

```python
create_report_row_data() -> dict[str, str]
```

```python
_is_missing_value(v: float | int | data_designer.config.analysis.column_statistics.MissingValue) -> bool
```

```python
class data_designer.config.analysis.column_statistics.LLMTextColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.GeneralColumnStatistics`

Container for statistics on LLM-generated text columns.

Inherits general statistics plus token usage metrics specific to LLM text generation.
Stores both prompt and completion token consumption data.

**Parameters:**

Mean number of output tokens generated per record.

Median number of output tokens generated per record.

Standard deviation of output tokens per record.

Mean number of input tokens used per record.

Median number of input tokens used per record.

Standard deviation of input tokens per record.

Discriminator field, always "llm-text" for this statistics type.

**Attributes:**

Mean number of output tokens generated per record.

Median number of output tokens generated per record.

Standard deviation of output tokens per record.

Mean number of input tokens used per record.

Median number of input tokens used per record.

Standard deviation of input tokens per record.

Discriminator field, always "llm-text" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
output_tokens_mean: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
output_tokens_median: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
output_tokens_stddev: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
input_tokens_mean: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
input_tokens_median: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
input_tokens_stddev: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_TEXT.value]
```

```python
llm_column_ensure_python_floats(v: float | int | data_designer.config.analysis.column_statistics.MissingValue) -> float | int | data_designer.config.analysis.column_statistics.MissingValue
```

```python
create_report_row_data() -> dict[str, typing.Any]
```

```python
class data_designer.config.analysis.column_statistics.LLMCodeColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.LLMTextColumnStatistics`

Container for statistics on LLM-generated code columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores
statistics from columns that generate code snippets in specific programming languages.

**Parameters:**

Discriminator field, always "llm-code" for this statistics type.

**Attributes:**

Discriminator field, always "llm-code" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_CODE.value]
```

```python
class data_designer.config.analysis.column_statistics.LLMStructuredColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.LLMTextColumnStatistics`

Container for statistics on LLM-generated structured JSON columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from
columns that generate structured data conforming to JSON schemas or Pydantic models.

**Parameters:**

Discriminator field, always "llm-structured" for this statistics type.

**Attributes:**

Discriminator field, always "llm-structured" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_STRUCTURED.value]
```

```python
class data_designer.config.analysis.column_statistics.LLMJudgedColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.LLMTextColumnStatistics`

Container for statistics on LLM-as-a-judge quality assessment columns.

Inherits all token usage metrics from LLMTextColumnStatistics. Stores statistics from
columns that evaluate and score other generated content based on defined criteria.

**Parameters:**

Discriminator field, always "llm-judge" for this statistics type.

**Attributes:**

Discriminator field, always "llm-judge" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.LLM_JUDGE.value]
```

```python
class data_designer.config.analysis.column_statistics.SamplerColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.GeneralColumnStatistics`

Container for statistics on sampler-generated columns.

Inherits general statistics plus sampler-specific information including the sampler type
used and the empirical distribution of generated values. Stores both categorical and
numerical distribution results.

**Parameters:**

Type of sampler used to generate this column (e.g., "uniform", "category",
"gaussian", "person").

Classification of the column's distribution (categorical, numerical,
text, other, or unknown).

Empirical distribution statistics for the generated values. Can be
CategoricalDistribution (for discrete values), NumericalDistribution (for continuous
values), or MissingValue if distribution could not be computed.

Discriminator field, always "sampler" for this statistics type.

**Attributes:**

Type of sampler used to generate this column (e.g., "uniform", "category",
"gaussian", "person").

Classification of the column's distribution (categorical, numerical,
text, other, or unknown).

Empirical distribution statistics for the generated values. Can be
CategoricalDistribution (for discrete values), NumericalDistribution (for continuous
values), or MissingValue if distribution could not be computed.

Discriminator field, always "sampler" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
sampler_type: data_designer.config.sampler_params.SamplerType
```

```python
distribution_type: data_designer.config.analysis.column_statistics.ColumnDistributionType
```

```python
distribution: CategoricalDistribution | NumericalDistribution | data_designer.config.analysis.column_statistics.MissingValue | None
```

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.SAMPLER.value]
```

```python
create_report_row_data() -> dict[str, str]
```

```python
class data_designer.config.analysis.column_statistics.SeedDatasetColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.GeneralColumnStatistics`

Container for statistics on columns sourced from seed datasets.

Inherits general statistics and stores statistics computed from columns that originate
from existing data provided via the seed dataset functionality.

**Parameters:**

Discriminator field, always "seed-dataset" for this statistics type.

**Attributes:**

Discriminator field, always "seed-dataset" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.SEED_DATASET.value]
```

```python
class data_designer.config.analysis.column_statistics.ExpressionColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.GeneralColumnStatistics`

Container for statistics on expression-based derived columns.

Inherits general statistics and stores statistics computed from columns that are derived
from columns that are derived from Jinja2 expressions referencing other column values.

**Parameters:**

Discriminator field, always "expression" for this statistics type.

**Attributes:**

Discriminator field, always "expression" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.EXPRESSION.value]
```

```python
class data_designer.config.analysis.column_statistics.ValidationColumnStatistics(
    /,
    **data: typing.Any
)
```

**Bases**: `data_designer.config.analysis.column_statistics.GeneralColumnStatistics`

Container for statistics on validation result columns.

Inherits general statistics plus validation-specific metrics including the count and
percentage of records that passed validation. Stores results from validation logic
(Python, SQL, local callable, or remote) executed against target columns.

**Parameters:**

Number of records that passed validation.

Discriminator field, always "validation" for this statistics type.

**Attributes:**

Number of records that passed validation.

Discriminator field, always "validation" for this statistics type.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
num_valid_records: int | data_designer.config.analysis.column_statistics.MissingValue
```

```python
column_type: typing.Literal[data_designer.config.column_types.DataDesignerColumnType.VALIDATION.value]
```

```python
code_validation_column_ensure_python_integers(v: int | data_designer.config.analysis.column_statistics.MissingValue) -> int | data_designer.config.analysis.column_statistics.MissingValue
```

```python
percent_valid: float | data_designer.config.analysis.column_statistics.MissingValue
```

```python
create_report_row_data() -> dict[str, str]
```

```python
class data_designer.config.analysis.column_statistics.CategoricalHistogramData(
    /,
    **data: typing.Any
)
```

**Bases**: `pydantic.BaseModel`

Container for categorical distribution histogram data.

Stores the computed frequency distribution of categorical values.

**Parameters:**

List of unique category values that appear in the data.

List of occurrence counts for each category.

**Attributes:**

List of unique category values that appear in the data.

List of occurrence counts for each category.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
categories: list[float | int | str]
```

```python
counts: list[int]
```

```python
ensure_python_types() -> typing_extensions.Self
```

Ensure numerical values are Python objects rather than Numpy types.

```python
from_series(series: pandas.Series) -> typing_extensions.Self
```

```python
class data_designer.config.analysis.column_statistics.CategoricalDistribution(
    /,
    **data: typing.Any
)
```

**Bases**: `pydantic.BaseModel`

Container for computed categorical distribution statistics.

**Parameters:**

The category value that appears most frequently in the data.

The category value that appears least frequently in the data.

Complete frequency distribution showing all categories and their counts.

**Attributes:**

The category value that appears most frequently in the data.

The category value that appears least frequently in the data.

Complete frequency distribution showing all categories and their counts.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
most_common_value: str | int
```

```python
least_common_value: str | int
```

```python
histogram: data_designer.config.analysis.column_statistics.CategoricalHistogramData
```

```python
ensure_python_types(v: str | int) -> str | int
```

```python
from_series(series: pandas.Series) -> typing_extensions.Self
```

```python
class data_designer.config.analysis.column_statistics.NumericalDistribution(
    /,
    **data: typing.Any
)
```

**Bases**: `pydantic.BaseModel`

Container for computed numerical distribution statistics.

**Parameters:**

Minimum value in the distribution.

Maximum value in the distribution.

Arithmetic mean (average) of all values.

Standard deviation measuring the spread of values around the mean.

Median value of the distribution.

**Attributes:**

Minimum value in the distribution.

Maximum value in the distribution.

Arithmetic mean (average) of all values.

Standard deviation measuring the spread of values around the mean.

Median value of the distribution.

**Initialization:**

Create a new model by parsing and validating input data from keyword arguments.

Raises \[`ValidationError`]\[pydantic\_core.ValidationError] if the input data cannot be
validated to form a valid model.

`self` is explicitly positional-only to allow `self` as a field name.

```python
min: float | int
```

```python
max: float | int
```

```python
mean: float
```

```python
stddev: float
```

```python
median: float
```

```python
ensure_python_types(v: float | int) -> float | int
```

```python
from_series(series: pandas.Series) -> typing_extensions.Self
```

```python
DEFAULT_COLUMN_STATISTICS_MAP
```