For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Code Reference
    • Overview
  • Dev Notes
    • Overview
    • Prompt Sensitivity
    • Retriever SDG Toolkit
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Quick Start
  • Function Signatures
  • Generation Strategies
  • The Decorator
  • Models Dict
  • Configuration
  • Resizing (1:N and N:1)
  • Multi-Turn Example
  • Development Testing
  • See Also
Concepts

Custom Columns

||View as Markdown|
Previous

Inference Parameters

Next

Validators

Custom columns let you implement your own generation logic using Python functions. Use them for multi-step LLM workflows, external API integration, or any scenario requiring full programmatic control. For reusable, distributable components, see Plugins instead.

Quick Start

1import data_designer.config as dd
2
3@dd.custom_column_generator(required_columns=["name"])
4def create_greeting(row: dict) -> dict:
5 row["greeting"] = f"Hello, {row['name']}!"
6 return row
7
8config_builder.add_column(
9 dd.CustomColumnConfig(
10 name="greeting",
11 generator_function=create_greeting,
12 )
13)

Function Signatures

Three signatures are supported. Parameter names are validated:

ArgsSignatureUse Case
1fn(row) -> dictSimple transforms
2fn(row, generator_params) -> dictWith typed params
3fn(row, generator_params, models) -> dictLLM access via models dict

For full_column strategy, use df instead of row.

For LLM access without params, use generator_params: None:

1@dd.custom_column_generator(required_columns=["name"], model_aliases=["my-model"])
2def generate_message(row: dict, generator_params: None, models: dict) -> dict:
3 response, _ = models["my-model"].generate(prompt=f"Greet {row['name']}")
4 row["greeting"] = response
5 return row

Model aliases are validated before generation starts. If an alias doesn’t exist in your config, an error is raised during the health check.

Generation Strategies

StrategyInputUse Case
cell_by_cell (default)row: dictLLM calls, row-by-row logic
full_columndf: DataFrameVectorized DataFrame operations

Recommendation: Use cell_by_cell for LLM calls. The framework handles parallelization automatically. Use full_column only for vectorized operations that don’t involve LLM calls.

For full_column, set generation_strategy=dd.GenerationStrategy.FULL_COLUMN.

Concurrent dispatch

Sync cell_by_cell generators are dispatched concurrently across rows under the async engine. Module-level mutable state (counters, caches, non-thread-safe HTTP clients) needs synchronization or per-row instantiation. For network-bound work, prefer async def fn(row) — the engine runs it directly on its event loop and skips the thread bridge.

The Decorator

1@dd.custom_column_generator(
2 required_columns=["col1"], # DAG ordering
3 side_effect_columns=["extra"], # Additional columns created
4 model_aliases=["model1"], # Required for LLM access
5)

Models Dict

The third argument is a dict of ModelFacade instances, keyed by alias. You must declare all models required in your custom column generator in model_aliases - this populates the models dict and enables health checks before generation starts.

1@dd.custom_column_generator(model_aliases=["my-model"])
2def my_generator(row: dict, generator_params: None, models: dict) -> dict:
3 model = models["my-model"]
4 response, trace = model.generate(
5 prompt="...",
6 parser=my_custom_parser, # optional, defaults to identity
7 system_prompt="...",
8 max_correction_steps=3,
9 )
10 row["result"] = response
11 return row

This gives you direct access to all ModelFacade capabilities: custom parsers, correction loops, structured output, tool use, etc.

Configuration

ParameterTypeRequiredDescription
namestrYesColumn name
generator_functionCallableYesDecorated function
generation_strategyGenerationStrategyNoCELL_BY_CELL or FULL_COLUMN
generator_paramsBaseModelNoTyped params passed to function
allow_resizeboolNoAllow 1:N or N:1 generation

Resizing (1:N and N:1)

FULL_COLUMN: Set allow_resize=True and return a DataFrame with more or fewer rows than the input:

1@dd.custom_column_generator(
2 required_columns=["topic"],
3 side_effect_columns=["variation_id"],
4)
5def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:
6 rows = []
7 for _, row in df.iterrows():
8 for i in range(3): # Generate 3 variations per input
9 rows.append({
10 "topic": row["topic"],
11 "question": f"Question {i+1} about {row['topic']}",
12 "variation_id": i,
13 })
14 return pd.DataFrame(rows)
15
16dd.CustomColumnConfig(
17 name="question",
18 generator_function=expand_topics,
19 generation_strategy=dd.GenerationStrategy.FULL_COLUMN,
20 allow_resize=True,
21)

CELL_BY_CELL: With allow_resize=True, your function may return a single row (dict) or multiple rows (list[dict]). Return [] to drop that input row.

1@dd.custom_column_generator(required_columns=["id"])
2def expand_row(row: dict) -> list[dict]:
3 return [
4 {**row, "variant": "a"},
5 {**row, "variant": "b"},
6 ]
7
8dd.CustomColumnConfig(
9 name="variant",
10 generator_function=expand_row,
11 generation_strategy=dd.GenerationStrategy.CELL_BY_CELL,
12 allow_resize=True,
13)

Use cases:

  • Expansion (1:N): Generate multiple variations per input
  • Retraction (N:1): Filter, aggregate, or deduplicate records (FULL_COLUMN) or return [] per row (CELL_BY_CELL)

Multi-Turn Example

1@dd.custom_column_generator(
2 required_columns=["topic"],
3 side_effect_columns=["draft", "critique"],
4 model_aliases=["writer", "editor"],
5)
6def writer_editor(row: dict, generator_params: None, models: dict) -> dict:
7 draft, _ = models["writer"].generate(prompt=f"Write about '{row['topic']}'")
8 critique, _ = models["editor"].generate(prompt=f"Critique: {draft}")
9 revised, _ = models["writer"].generate(prompt=f"Revise based on: {critique}\n\nOriginal: {draft}")
10
11 row["final_text"] = revised
12 row["draft"] = draft
13 row["critique"] = critique
14 return row

Development Testing

Test generators with real LLM calls without running the full pipeline:

1data_designer = DataDesigner()
2models = data_designer.get_models(["my-model"])
3result = my_generator({"name": "Alice"}, None, models)

In unit tests that mock model clients, use MagicMock(spec=ModelFacade) so async methods are auto-detected:

1from unittest.mock import MagicMock
2from data_designer.engine.models.facade import ModelFacade
3
4mock_model = MagicMock(spec=ModelFacade)

Mocking only generate() will silently no-op under the async engine because the bridge routes through agenerate().

See Also

  • Column Configs Reference
  • Plugins Overview