Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data.
When to Use Processors Processors handle transformations that don’t fit the “column” model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations.
Each processor:
Processors can run at three stages, determined by which callback methods they implement:
Each batch carries the full dataset schema during generation. Post-batch schema changes such as column dropping only alter past batches, so all columns remain accessible to generators while building follow-up batches.
The async engine (default) enforces row-count invariance in process_before_batch() and process_after_batch() — a processor returning a different row count raises DatasetGenerationError. Run row-filtering or expansion logic in process_after_generation(), which operates on the final dataset and supports row-count changes. The legacy sync engine (opt-out via DATA_DESIGNER_ASYNC_ENGINE=0) is permissive about row-count changes at all stages.
process_after_generation() runs once on the entire generated dataset, not once per buffer. It loads the final parquet dataset, applies the processor, deletes the previous parquet files, and writes a new chunked result. Because this can change row counts, schemas, and row-group boundaries, Data Designer treats a dataset as terminal for resume after this stage has completed. Re-running with the same target is a no-op; extending the dataset requires a fresh run.
A processor can implement any combination of these callbacks. The built-in processors use process_after_batch() by default.
Removes specified columns from the output dataset. Dropped columns are saved separately in the dropped-columns directory for reference.
Dropping Columns is More Easily Achieved via drop = True
The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting drop = True when configuring a column will accomplish the same.
Configuration:
Behavior:
column_names are removed from the outputdrop=True when this processor is addedUse Cases:
Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset.
Configuration:
Behavior:
template becomes a column in the transformed datasetprocessors-outputs/{name}/ directoryTemplate Capabilities:
{{ column_name }}{{ text | upper }}, {{ text | lower }}, {{ text | trim }}["{{ col1 }}", "{{ col2 }}"]Use Cases:
Add processors to your configuration using the builder’s add_processor method:
Processors execute in the order they’re added. Plan accordingly when one processor’s output affects another.
You can extend Data Designer with custom processors via the plugin system. A processor plugin is a Python package that provides:
ProcessorConfig with a processor_type: Literal["your-type"] discriminatorProcessor that overrides the desired callback methodsPlugin instance connecting the twoOnce installed, plugin processors are automatically discovered and can be used with add_processor() like built-in processors.
Entry point configuration in pyproject.toml:
See the plugins overview for the full guide on creating plugins.