data_designer.interface.results

Module Contents

Classes

Name	Description
`DatasetCreationResults`	Results container for a Data Designer dataset creation run.

Functions

Name	Description
`_export_jsonl`	Write batch_files to output as JSONL, one record per line.
`_export_csv`	Write batch_files to output as CSV with a single header row.
`_export_parquet`	Write batch_files to output as a single Parquet file.

Data

ExportFormat SUPPORTED_EXPORT_FORMATS

API

1 ExportFormat

SUPPORTED_EXPORT_FORMATS

tuple[str, ...]Defaults to get_args(...)

1 class data_designer.interface.results.DatasetCreationResults(
2     *,
3     artifact_storage: data_designer.engine.storage.artifact_storage.ArtifactStorage,
4     analysis: data_designer.config.analysis.dataset_profiler.DatasetProfilerResults,
5     config_builder: data_designer.config.config_builder.DataDesignerConfigBuilder,
6     dataset_metadata: data_designer.config.dataset_metadata.DatasetMetadata,
7     task_traces: list[data_designer.engine.dataset_builders.utils.task_model.TaskTrace] | None = None
8 )

Bases: data_designer.config.utils.visualization.WithRecordSamplerMixin

Results container for a Data Designer dataset creation run.

This class provides access to the generated dataset, profiling analysis, and visualization utilities. It is returned by the DataDesigner.create() method and implements ResultsProtocol of the DataDesigner interface.

Resume scope: methods that read from the artifact directory (load_dataset, count_records, load_analysis, export, push_to_hub) reflect the full dataset on disk, including rows produced by earlier create() calls that the current invocation resumed. Per-run observability — task_traces and any model-usage / telemetry side effects emitted during the call — is scoped to the current invocation only, because the original run’s in-memory state is not persisted across process boundaries.

Initialization:

Creates a new instance with results based on a dataset creation run.

Parameters:

artifact_storage

data_designer.engine.storage.artifact_storage.ArtifactStorage

Storage manager for accessing generated artifacts.

analysis

data_designer.config.analysis.dataset_profiler.DatasetProfilerResults

Profiling results for the generated dataset.

config_builder

data_designer.config.config_builder.DataDesignerConfigBuilder

Configuration builder used to create the dataset.

dataset_metadata

data_designer.config.dataset_metadata.DatasetMetadata

Metadata about the generated dataset (e.g., seed column names).

task_traces

list[data_designer.engine.dataset_builders.utils.task_model.TaskTrace] | NoneDefaults to None

Optional list of TaskTrace objects from the async scheduler. Resume note: only contains traces for the current invocation; traces from earlier create() calls that this run resumed are not retained.

1 load_analysis() -> data_designer.config.analysis.dataset_profiler.DatasetProfilerResults

Load the profiling analysis results for the generated dataset.

Returns:

data_designer.config.analysis.dataset_profiler.DatasetProfilerResults

DatasetProfilerResults containing statistical analysis and quality metrics for configured columns in the generated dataset.

1 load_dataset() -> pandas.DataFrame

Load the generated dataset as a pandas DataFrame.

Returns:

pandas.DataFrame

A pandas DataFrame containing the full generated dataset.

1 count_records() -> int

Return the total number of records in the generated dataset.

Counts rows by reading Parquet file metadata only — no data pages are loaded, so memory usage is constant regardless of dataset size.

Returns:

int

Total row count across all batch parquet files.

1 load_processor_dataset(processor_name: str) -> pandas.DataFrame

Load the dataset generated by a processor.

This only works for processors that write their artifacts in Parquet format.

Parameters:

processor_name

str

The name of the processor to load the dataset from.

Returns:

pandas.DataFrame

A pandas DataFrame containing the dataset generated by the processor.

1 get_path_to_processor_artifacts(processor_name: str) -> pathlib.Path

Get the path to the artifacts generated by a processor.

Parameters:

processor_name

str

The name of the processor to load the artifact from.

Returns:

pathlib.Path

The path to the artifacts.

1 export(
2     path: pathlib.Path | str,
3     *,
4     format: data_designer.interface.results.ExportFormat | None = None
5 ) -> pathlib.Path

Export the generated dataset to a single file by streaming batch files.

The output format is inferred from the file extension when format is omitted. Pass format explicitly to override the extension (e.g. write a .txt file as JSONL).

Unlike :meth:load_dataset, this method never materialises the full dataset in memory — it reads batch parquet files one at a time and appends each to the output file, keeping peak memory proportional to a single batch.

Parameters:

path

pathlib.Path | str

Output file path. The exact path is used as-is; the extension is not rewritten.

format

data_designer.interface.results.ExportFormat | NoneDefaults to None

Output format. One of 'jsonl', 'csv', or 'parquet'. When omitted, the format is inferred from the file extension.

Returns:

pathlib.Path

Path to the written file.

Raises:

InvalidFileFormatError

If the format cannot be determined or is not one of the supported values.

ArtifactStorageError

If no batch parquet files are found.

Example:

1 >>> results = data_designer.create(config, num_records=1000)
2 >>> results.export("output.jsonl")
3 PosixPath('output.jsonl')
4 >>> results.export("output.csv")
5 PosixPath('output.csv')
6 >>> results.export("output.txt", format="jsonl")
7 PosixPath('output.txt')

1 push_to_hub(
2     repo_id: str,
3     description: str,
4     *,
5     token: str | None = None,
6     private: bool = False,
7     tags: list[str] | None = None
8 ) -> str

Push dataset to HuggingFace Hub.

Uploads all artifacts including:

Main parquet batch files (data subset)
Processor output batch files ({processor_name} subsets)
Configuration (builder_config.json)
Metadata (metadata.json)
Auto-generated dataset card (README.md)

Parameters:

repo_id

str

HuggingFace repo ID (e.g., “username/my-dataset”)

description

str

Custom description text for the dataset card. Appears after the title.

token

str | NoneDefaults to None

HuggingFace API token. If None, the token is automatically resolved from HF_TOKEN environment variable or cached credentials from hf auth login.

private

boolDefaults to False

Create private repo

1	>>> results = data_designer.create(config, num_records=1000)
2	>>> description = "This dataset contains synthetic conversations for training chatbots."
3	>>> results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"])
4	'https://huggingface.co/datasets/username/my-synthetic-dataset'

1	data_designer.interface.results._export_jsonl(
2	batch_files: list[pathlib.Path],
3	output: pathlib.Path
4	) -> None

1	data_designer.interface.results._export_csv(
2	batch_files: list[pathlib.Path],
3	output: pathlib.Path
4	) -> None

1	data_designer.interface.results._export_parquet(
2	batch_files: list[pathlib.Path],
3	output: pathlib.Path
4	) -> None

1	>>> results = data_designer.create(config, num_records=1000)
2	>>> description = "This dataset contains synthetic conversations for training chatbots."
3	>>> results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"])
4	'https://huggingface.co/datasets/username/my-synthetic-dataset'

1	class data_designer.interface.results.DatasetCreationResults(
2	*,
3	artifact_storage: data_designer.engine.storage.artifact_storage.ArtifactStorage,
4	analysis: data_designer.config.analysis.dataset_profiler.DatasetProfilerResults,
5	config_builder: data_designer.config.config_builder.DataDesignerConfigBuilder,
6	dataset_metadata: data_designer.config.dataset_metadata.DatasetMetadata,
7	task_traces: list[data_designer.engine.dataset_builders.utils.task_model.TaskTrace] \| None = None
8	)

1	export(
2	path: pathlib.Path \| str,
3	*,
4	format: data_designer.interface.results.ExportFormat \| None = None
5	) -> pathlib.Path

1	>>> results = data_designer.create(config, num_records=1000)
2	>>> results.export("output.jsonl")
3	PosixPath('output.jsonl')
4	>>> results.export("output.csv")
5	PosixPath('output.csv')
6	>>> results.export("output.txt", format="jsonl")
7	PosixPath('output.txt')

1	push_to_hub(
2	repo_id: str,
3	description: str,
4	*,
5	token: str \| None = None,
6	private: bool = False,
7	tags: list[str] \| None = None
8	) -> str