data_designer.config.fingerprint
data_designer.config.fingerprint
data_designer.config.fingerprint
Deterministic content-addressable fingerprint for a workflow config.
The fingerprint identifies the data-relevant portion of a DataDesignerConfig
so that two configs producing the same dataset hash to the same value, while
configs differing only in environment, runtime, or post-generation analysis
hash to different values when they should and to the same value when they
shouldn’t.
The hash is computed over a canonical JSON dump of the config (Pydantic
model_dump(mode="json")) with non-identity fields removed. Column order is
part of identity (DAG ordering); alias-keyed lookup tables (model_configs,
tool_configs) are sorted by alias so their internal order is irrelevant.
Empty/None optional collections are canonicalized to a single representation
so that builder-API and YAML-loaded configs producing identical datasets
fingerprint identically.
The normalization scheme is versioned via CONFIG_HASH_VERSION. Persist the
version alongside the hash so future scheme changes can be detected as
“unknown identity” rather than “definite mismatch”.
CONFIG_HASH_VERSION
CONFIG_HASH_ALGO
_EXCLUDED_TOP_LEVEL_KEYS
_EXCLUDED_MODEL_KEYS
_EXCLUDED_INFERENCE_KEYS
_EXCLUDED_TOOL_CONFIG_KEYS
_EXCLUDED_HF_SEED_KEYS
_TOP_LEVEL_OPTIONAL_COLLECTIONS
_TOOL_CONFIG_OPTIONAL_COLLECTIONS
Compute a deterministic fingerprint of a workflow config.
The fingerprint is content-addressable: identical configs (modulo excluded fields) produce identical hashes across processes, Python versions, and module load orders. Changing any identity-relevant field changes the hash; changing an excluded field does not.
Identity-relevant fields:
columns - names, types, generator params, processors, validators,
skip/drop flags. Column order is part of identity (DAG ordering).model_configs - alias, model, provider, sampling-relevant inference
params (temperature, top_p, max_tokens, extra_body). Sorted by alias.tool_configs - alias, providers, allow_tools, max_tool_call_turns
(the set of MCP tools shapes generation). Sorted by tool_alias.seed_config - source path, sampling strategy, selection strategy.constraints, top-level processors.See module-level constants for the canonical excluded-fields table.
Custom column generators contribute their function’s __name__,
__qualname__, __module__, generator_params, and the decorator
metadata set by @custom_column_generator() (required_columns,
side_effect_columns, model_aliases).
Limitation: closures captured via factory functions (e.g. make_gen(factor)
returning a gen whose body references factor) share __name__,
__qualname__, __module__, and source text, so two closures with
different captured state will fingerprint identically. The fingerprint
cannot see closure cell values.
Parameters:
The workflow config to fingerprint.
Returns:
dict[str, str | int]
A dict with config_hash ("sha256:..."), config_hash_algo, and
config_hash_version suitable for embedding in dataset metadata.
Drop keys whose value is None or an empty list.
None and [] are user-equivalent for optional collection fields; this
collapses both to “absent” before hashing.
Replace each custom column’s serialized generator_function (just the
bare __name__) with a richer identity dict that includes __qualname__,
__module__, and the @custom_column_generator() decorator metadata.
Walks config.columns and columns_dump in lockstep so positional
correspondence is reliable.