data_designer.interface.results
data_designer.interface.results
data_designer.interface.results
ExportFormat
SUPPORTED_EXPORT_FORMATS
Bases: data_designer.config.utils.visualization.WithRecordSamplerMixin
Results container for a Data Designer dataset creation run.
This class provides access to the generated dataset, profiling analysis, and visualization utilities. It is returned by the DataDesigner.create() method and implements ResultsProtocol of the DataDesigner interface.
Resume scope: methods that read from the artifact directory (load_dataset,
count_records, load_analysis, export, push_to_hub) reflect the
full dataset on disk, including rows produced by earlier create() calls
that the current invocation resumed. Per-run observability — task_traces
and any model-usage / telemetry side effects emitted during the call — is
scoped to the current invocation only, because the original run’s in-memory
state is not persisted across process boundaries.
Initialization:
Creates a new instance with results based on a dataset creation run.
Parameters:
Storage manager for accessing generated artifacts.
Profiling results for the generated dataset.
Configuration builder used to create the dataset.
Metadata about the generated dataset (e.g., seed column names).
Optional list of TaskTrace objects from the async scheduler.
Resume note: only contains traces for the current invocation; traces
from earlier create() calls that this run resumed are not
retained.
Load the profiling analysis results for the generated dataset.
Returns:
data_designer.config.analysis.dataset_profiler.DatasetProfilerResults
DatasetProfilerResults containing statistical analysis and quality metrics for configured columns in the generated dataset.
Load the generated dataset as a pandas DataFrame.
Returns:
pandas.DataFrame
A pandas DataFrame containing the full generated dataset.
Return the total number of records in the generated dataset.
Counts rows by reading Parquet file metadata only — no data pages are loaded, so memory usage is constant regardless of dataset size.
Returns:
int
Total row count across all batch parquet files.
Load the dataset generated by a processor.
This only works for processors that write their artifacts in Parquet format.
Parameters:
The name of the processor to load the dataset from.
Returns:
pandas.DataFrame
A pandas DataFrame containing the dataset generated by the processor.
Get the path to the artifacts generated by a processor.
Parameters:
The name of the processor to load the artifact from.
Returns:
pathlib.Path
The path to the artifacts.
Export the generated dataset to a single file by streaming batch files.
The output format is inferred from the file extension when format is
omitted. Pass format explicitly to override the extension (e.g. write a
.txt file as JSONL).
Unlike :meth:load_dataset, this method never materialises the full dataset
in memory — it reads batch parquet files one at a time and appends each to
the output file, keeping peak memory proportional to a single batch.
Parameters:
Output file path. The exact path is used as-is; the extension is not rewritten.
Output format. One of 'jsonl', 'csv', or 'parquet'.
When omitted, the format is inferred from the file extension.
Returns:
pathlib.Path
Path to the written file.
Raises:
If the format cannot be determined or is not one of the supported values.
If no batch parquet files are found.
Example:
Push dataset to HuggingFace Hub.
Uploads all artifacts including:
Parameters:
HuggingFace repo ID (e.g., “username/my-dataset”)
Custom description text for the dataset card. Appears after the title.
HuggingFace API token. If None, the token is automatically
resolved from HF_TOKEN environment variable or cached credentials
from hf auth login.
Create private repo
Additional custom tags for the dataset.
Returns:
str
URL to the uploaded dataset
Example:
Write batch_files to output as JSONL, one record per line.
Each batch is appended in turn so peak memory stays proportional to one batch.
Write batch_files to output as CSV with a single header row.
Write batch_files to output as a single Parquet file.
Schemas are unified across batches before writing so that columns with minor
type drift (e.g. int64 vs float64 across batches) are cast to a
consistent schema rather than causing a write error.
Raises:
If batch schemas have incompatible column names or types that cannot be unified or cast.