Fingerprint | NVIDIA NeMo Data Designer

Deterministic content-addressable fingerprint for a workflow config.

The fingerprint identifies the data-relevant portion of a DataDesignerConfig so that two configs producing the same dataset hash to the same value, while configs differing only in environment, runtime, or post-generation analysis hash to different values when they should and to the same value when they shouldn’t.

The hash is computed over a canonical JSON dump of the config (Pydantic model_dump(mode="json")) with non-identity fields removed. Column order is part of identity (DAG ordering); alias-keyed lookup tables (model_configs, tool_configs) are sorted by alias so their internal order is irrelevant. Empty/None optional collections are canonicalized to a single representation so that builder-API and YAML-loaded configs producing identical datasets fingerprint identically.

The normalization scheme is versioned via CONFIG_HASH_VERSION. Persist the version alongside the hash so future scheme changes can be detected as “unknown identity” rather than “definite mismatch”.

Module Contents

Functions

Name	Description
`fingerprint_config`	Compute a deterministic fingerprint of a workflow config.
`_drop_keys`	None
`_drop_empty_optional`	Drop keys whose value is `None` or an empty list.
`_normalize_model_config`	None
`_normalize_tool_config`	None
`_normalize_seed_config`	None
`_enrich_custom_columns`	Replace each custom column’s serialized `generator_function` (just the bare `__name__`) with a richer identity dict that includes `__qualname__`, `__module__`, and the `@custom_column_generator()` decorator metadata.
`_normalize_config_dict`	None

Data

CONFIG_HASH_VERSION CONFIG_HASH_ALGO _EXCLUDED_TOP_LEVEL_KEYS _EXCLUDED_MODEL_KEYS _EXCLUDED_INFERENCE_KEYS _EXCLUDED_TOOL_CONFIG_KEYS _EXCLUDED_HF_SEED_KEYS _TOP_LEVEL_OPTIONAL_COLLECTIONS _TOOL_CONFIG_OPTIONAL_COLLECTIONS

API

1 CONFIG_HASH_VERSION = 1

1 CONFIG_HASH_ALGO = sha256

_EXCLUDED_TOP_LEVEL_KEYS

frozenset[str]Defaults to frozenset(...)

_EXCLUDED_MODEL_KEYS

frozenset[str]Defaults to frozenset(...)

_EXCLUDED_INFERENCE_KEYS

frozenset[str]Defaults to frozenset(...)

_EXCLUDED_TOOL_CONFIG_KEYS

frozenset[str]Defaults to frozenset(...)

_EXCLUDED_HF_SEED_KEYS

frozenset[str]Defaults to frozenset(...)

_TOP_LEVEL_OPTIONAL_COLLECTIONS

frozenset[str]Defaults to frozenset(...)

_TOOL_CONFIG_OPTIONAL_COLLECTIONS

frozenset[str]Defaults to frozenset(...)

1 data_designer.config.fingerprint.fingerprint_config(config: data_designer.config.data_designer_config.DataDesignerConfig) -> dict[str, str | int]

Compute a deterministic fingerprint of a workflow config.

The fingerprint is content-addressable: identical configs (modulo excluded fields) produce identical hashes across processes, Python versions, and module load orders. Changing any identity-relevant field changes the hash; changing an excluded field does not.

Identity-relevant fields:

columns - names, types, generator params, processors, validators, skip/drop flags. Column order is part of identity (DAG ordering).
model_configs - alias, model, provider, sampling-relevant inference params (temperature, top_p, max_tokens, extra_body). Sorted by alias.
tool_configs - alias, providers, allow_tools, max_tool_call_turns (the set of MCP tools shapes generation). Sorted by tool_alias.
seed_config - source path, sampling strategy, selection strategy.
constraints, top-level processors.

See module-level constants for the canonical excluded-fields table.

Custom column generators contribute their function’s __name__, __qualname__, __module__, generator_params, and the decorator metadata set by @custom_column_generator() (required_columns, side_effect_columns, model_aliases).

Limitation: closures captured via factory functions (e.g. make_gen(factor) returning a gen whose body references factor) share __name__, __qualname__, __module__, and source text, so two closures with different captured state will fingerprint identically. The fingerprint cannot see closure cell values.

Parameters:

config

data_designer.config.data_designer_config.DataDesignerConfig

The workflow config to fingerprint.

Returns:

dict[str, str | int]

A dict with config_hash ("sha256:..."), config_hash_algo, and config_hash_version suitable for embedding in dataset metadata.

1 data_designer.config.fingerprint._drop_keys(
2     source: dict[str, typing.Any],
3     keys: collections.abc.Iterable[str]
4 ) -> dict[str, typing.Any]

1 data_designer.config.fingerprint._drop_empty_optional(
2     source: dict[str, typing.Any],
3     keys: collections.abc.Iterable[str]
4 ) -> dict[str, typing.Any]

Drop keys whose value is None or an empty list.

None and [] are user-equivalent for optional collection fields; this collapses both to “absent” before hashing.

1 data_designer.config.fingerprint._normalize_model_config(model_config: dict[str, typing.Any]) -> dict[str, typing.Any]

1 data_designer.config.fingerprint._normalize_tool_config(tool_config: dict[str, typing.Any]) -> dict[str, typing.Any]

1 data_designer.config.fingerprint._normalize_seed_config(seed_config: dict[str, typing.Any]) -> dict[str, typing.Any]

1 data_designer.config.fingerprint._enrich_custom_columns(
2     config: data_designer.config.data_designer_config.DataDesignerConfig,
3     columns_dump: list[dict[str, typing.Any]]
4 ) -> list[dict[str, typing.Any]]

Replace each custom column’s serialized generator_function (just the bare __name__) with a richer identity dict that includes __qualname__, __module__, and the @custom_column_generator() decorator metadata.

Walks config.columns and columns_dump in lockstep so positional correspondence is reliable.

1 data_designer.config.fingerprint._normalize_config_dict(
2     config_dict: dict[str, typing.Any],
3     config: data_designer.config.data_designer_config.DataDesignerConfig
4 ) -> dict[str, typing.Any]