For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
    • Data Designer Got Skills
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Overview
  • Processor Types
  • 🗑️ Drop Columns Processor
  • 🔄 Schema Transform Processor
  • Using Processors
  • Execution Order
  • Processor Plugins
  • Configuration Parameters
  • Common Parameters
  • DropColumnsProcessorConfig
  • SchemaTransformProcessorConfig
Concepts

Processors

||View as Markdown|
Previous

Validators

Next

Person Sampling in Data Designer

Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data.

When to Use Processors Processors handle transformations that don’t fit the “column” model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations.

Overview

Each processor:

  • Receives the complete batch DataFrame
  • Applies its transformation
  • Passes the result to the next processor (or to output)

Processors can run at three stages, determined by which callback methods they implement:

StageWhen it runsCallback methodUse cases
Pre-batchAfter seed columns, before dependent columnsprocess_before_batch()Transform seed data before other columns are generated
Post-batchAfter each batch completesprocess_after_batch()Drop columns, transform schema per batch
After generationOnce, on final dataset after all batchesprocess_after_generation()Deduplicate, aggregate statistics, final cleanup
Full Schema Available During Generation

Each batch carries the full dataset schema during generation. Post-batch schema changes such as column dropping only alter past batches, so all columns remain accessible to generators while building follow-up batches.

Row-count changes under the async engine

The async engine (default) enforces row-count invariance in process_before_batch() and process_after_batch() — a processor returning a different row count raises DatasetGenerationError. Run row-filtering or expansion logic in process_after_generation(), which operates on the final dataset and supports row-count changes. The legacy sync engine (opt-out via DATA_DESIGNER_ASYNC_ENGINE=0) is permissive about row-count changes at all stages.

Resume after process_after_generation

process_after_generation() runs once on the entire generated dataset, not once per buffer. It loads the final parquet dataset, applies the processor, deletes the previous parquet files, and writes a new chunked result. Because this can change row counts, schemas, and row-group boundaries, Data Designer treats a dataset as terminal for resume after this stage has completed. Re-running with the same target is a no-op; extending the dataset requires a fresh run.

A processor can implement any combination of these callbacks. The built-in processors use process_after_batch() by default.

Processor Types

🗑️ Drop Columns Processor

Removes specified columns from the output dataset. Dropped columns are saved separately in the dropped-columns directory for reference.

Dropping Columns is More Easily Achieved via drop = True The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting drop = True when configuring a column will accomplish the same.

Configuration:

1import data_designer.config as dd
2
3processor = dd.DropColumnsProcessorConfig(
4 name="remove_intermediate",
5 column_names=["temp_calculation", "raw_input", "debug_info"],
6)

Behavior:

  • Columns specified in column_names are removed from the output
  • Original values are preserved in a separate parquet file
  • Missing columns produce a warning but don’t fail the build
  • Column configs are automatically marked with drop=True when this processor is added

Use Cases:

  • Removing intermediate columns used only for LLM context
  • Cleaning up debug or validation columns before final output
  • Separating sensitive data from the main dataset

🔄 Schema Transform Processor

Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset.

Configuration:

1import data_designer.config as dd
2
3processor = dd.SchemaTransformProcessorConfig(
4 name="chat_format",
5 template={
6 "messages": [
7 {"role": "user", "content": "{{ question }}"},
8 {"role": "assistant", "content": "{{ answer }}"},
9 ],
10 "metadata": "{{ category | upper }}",
11 },
12)

Behavior:

  • Each key in template becomes a column in the transformed dataset
  • Values are Jinja2 templates with access to all columns in the batch
  • Complex structures (lists, nested dicts) are supported
  • Output is saved to the processors-outputs/{name}/ directory
  • The original dataset passes through unchanged

Template Capabilities:

  • Variable substitution: {{ column_name }}
  • Filters: {{ text | upper }}, {{ text | lower }}, {{ text | trim }}
  • Nested structures: Arbitrarily deep JSON structures
  • Lists: ["{{ col1 }}", "{{ col2 }}"]

Use Cases:

  • Converting flat columns to chat message format
  • Restructuring data for specific model training formats
  • Creating derived views without modifying the source dataset

Using Processors

Add processors to your configuration using the builder’s add_processor method:

1import data_designer.config as dd
2
3builder = dd.DataDesignerConfigBuilder()
4
5# ... add columns ...
6
7# Drop intermediate columns
8builder.add_processor(
9 dd.DropColumnsProcessorConfig(
10 name="cleanup",
11 column_names=["scratch_work", "raw_context"],
12 )
13)
14
15# Transform to chat format
16builder.add_processor(
17 dd.SchemaTransformProcessorConfig(
18 name="chat_format",
19 template={
20 "messages": [
21 {"role": "user", "content": "{{ question }}"},
22 {"role": "assistant", "content": "{{ answer }}"},
23 ],
24 },
25 )
26)

Execution Order

Processors execute in the order they’re added. Plan accordingly when one processor’s output affects another.

Processor Plugins

You can extend Data Designer with custom processors via the plugin system. A processor plugin is a Python package that provides:

  • A config class inheriting from ProcessorConfig with a processor_type: Literal["your-type"] discriminator
  • An implementation class inheriting from Processor that overrides the desired callback methods
  • A Plugin instance connecting the two

Once installed, plugin processors are automatically discovered and can be used with add_processor() like built-in processors.

1from my_processor_plugin.config import MyProcessorConfig
2
3builder.add_processor(
4 MyProcessorConfig(
5 name="my_processor",
6 # ... plugin-specific parameters ...
7 )
8)

Entry point configuration in pyproject.toml:

1[project.entry-points."data_designer.plugins"]
2my-processor = "my_plugin.plugin:my_processor_plugin"

See the plugins overview for the full guide on creating plugins.

Configuration Parameters

Common Parameters

ParameterTypeDescription
namestrIdentifier for the processor, used in output directory names

DropColumnsProcessorConfig

ParameterTypeDescription
column_nameslist[str]Columns to remove from output

SchemaTransformProcessorConfig

ParameterTypeDescription
templatedict[str, Any]Jinja2 template defining the output schema. Must be JSON-serializable.