data_designer.interface.data_designer
data_designer.interface.data_designer
data_designer.interface.data_designer
logger
_interface_runtime_initialized
DEFAULT_SECRET_RESOLVER
DEFAULT_SEED_READERS
Run one-time runtime initialization for the interface package.
Bases: data_designer.config.interface.DataDesignerInterface[data_designer.interface.results.DatasetCreationResults]
Main interface for creating datasets with Data Designer.
This class provides the primary interface for building synthetic datasets using Data Designer configurations. It manages model providers, artifact storage, and orchestrates the dataset creation and profiling processes.
Parameters:
Path where generated artifacts will be stored. If not
provided, artifacts are stored in an artifacts directory under the
current working directory.
Optional list of model providers for LLM generation. If None, uses default providers.
Resolver for handling secrets and credentials. If None, uses the default composite resolver, which checks environment variables and plaintext values.
Optional list of seed readers. If None, uses default readers.
Path to the managed assets directory. This is used to point
to the location of managed datasets and other assets used during dataset generation.
If not provided, will check for an environment variable called DATA_DESIGNER_MANAGED_ASSETS_PATH.
If the environment variable is not set, will use the default managed assets directory, which
is defined in data_designer.config.utils.constants.
Optional custom reader for person datasets. If provided, this reader will be used instead of the default local reader. This allows clients to customize how managed datasets are accessed (e.g., using custom fsspec clients for S3 or other remote storage).
Optional list of MCP provider configurations to enable tool-calling for LLM generation columns. Supports both MCPProvider (remote SSE or Streamable HTTP) and LocalStdioMCPProvider (local subprocess).
Get information about the Data Designer interface.
Returns:
Any
InterfaceInfo object with information about the Data Designer interface.
Connect to a configured MCP provider and return the names of its available tools.
Parameters:
The name field of an MCP provider passed to the constructor.
Timeout in seconds for the MCP handshake. Defaults to 10.
Returns:
list[str]
A list of tool name strings exposed by the MCP server.
Raises:
If no provider with the given name was configured.
Create dataset and save results to the local artifact storage.
This method orchestrates the full dataset creation pipeline including building the dataset according to the configuration, profiling the generated data, and storing artifacts.
Parameters:
The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).
Number of records to generate.
Name of the dataset. This name will be used as the dataset folder name in the artifact path directory. If a non-empty directory with the same name already exists, dataset will be saved to a new directory with a datetime stamp. For example, if the dataset name is “awesome_dataset” and a directory with the same name already exists, the dataset will be saved to a new directory with the name “awesome_dataset_2025-01-01_12-00-00”.
Controls how interrupted runs are handled.
ResumeMode.NEVER (default): always start a fresh generation run.ResumeMode.ALWAYS: resume from the last completed batch (sync) or row group
(async). buffer_size must match the original run. num_records may be
equal to or greater than what was already generated (you can extend the dataset);
num_records less than actual records so far raises DatasetGenerationError.
If no checkpoint exists yet (interrupted before the first batch finished), silently
restarts from the beginning. Raises if the stored config is incompatible.ResumeMode.IF_POSSIBLE: like ALWAYS when the current config fingerprint
matches the stored config; otherwise starts a fresh run without raising an error.In all resume modes, in-flight partial results from the interrupted run are discarded before generation continues.
Returns:
data_designer.interface.results.DatasetCreationResults
DatasetCreationResults object with methods for loading the generated dataset, analysis results, and displaying sample records for inspection.
Raises:
If an error occurs during dataset generation.
If an error occurs during dataset profiling.
Generate preview dataset for fast iteration on your Data Designer configuration.
All preview results are stored in memory. Once you are satisfied with the preview,
use the create method to generate data at a larger scale and save results to disk.
Parameters:
The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).
Number of records to generate.
Returns:
data_designer.config.preview_results.PreviewResults
PreviewResults object with methods for inspecting the results.
Raises:
If an error occurs during preview dataset generation.
If preview terminated via the early-shutdown gate
with zero records produced. Subclass of DataDesignerGenerationError.
If an error occurs during preview dataset profiling.
Validate the Data Designer configuration as defined by the DataDesignerConfigBuilder with the configured engine components (SecretResolver, SeedReaders, etc.).
Parameters:
The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).
Returns:
None
None if the configuration is valid.
Raises:
If the configuration is invalid.
Get the default model configurations.
Returns:
list[data_designer.config.models.ModelConfig]
List of default model configurations.
Get the default model providers.
Returns:
list[data_designer.config.models.ModelProvider]
List of default model providers.
Get the secret resolver used by this DataDesigner instance.
Returns:
Any
The SecretResolver instance handling credentials and secrets.
Get the resolved model provider registry.
Returns:
Any
The ModelProviderRegistry containing the providers and default
resolved at construction time. The default is taken from the
first user-supplied provider when model_providers was passed
to the constructor; otherwise from the YAML’s default: key
when set, falling back to the first provider in the YAML list.
Get the runtime configuration applied to dataset generation.
Returns:
Any
The active RunConfig instance. Note that RunConfig normalizes
some fields on construction (e.g., shutdown_error_rate becomes
1.0 when disable_early_shutdown=True), so the returned
object may not exactly equal the one originally passed to
set_run_config.
Set the runtime configuration for dataset generation.
Parameters:
A RunConfig instance containing runtime settings such as
early shutdown behavior, batch sizing via buffer_size, and non-inference worker
concurrency via non_inference_max_parallel_workers.
Notes:
When disable_early_shutdown=True, DataDesigner will never terminate generation early
due to error-rate thresholds. Errors are still tracked for reporting.
Get a dict of ModelFacade instances for custom column development.
Use this to experiment with custom column generator functions outside of
the full pipeline. The returned dict matches the models argument passed
to 3-arg custom column functions.
Parameters:
List of model aliases to include in the dict.
Returns:
dict[str, data_designer.engine.models.facade.ModelFacade]
Dict mapping alias to ModelFacade instance.
Pick the model-client mode that matches the engine the run will use.
The async engine is the default, but allow_resize=True columns force
a sync-engine fallback (see DatasetBuilder._resolve_async_compatibility).
Without aligning the client mode here, those runs would create async-only
clients and then call sync methods on them — raising SyncClientUnavailableError
from inside the sync engine. Match the client mode to the actual engine
choice so the fallback path is functional.