For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Welcome
    • Contributing
  • Concepts
    • Columns
    • Seed Datasets
    • Agent Rollout Ingestion
    • Custom Columns
    • Validators
    • Processors
    • Workflow Chaining
    • Person Sampling
    • Traces
    • Architecture & Performance
    • Deployment Options
    • Security
  • Tutorials
    • Overview
    • The Basics
    • Structured Outputs, Jinja Expressions, and Conditional Generation
    • Seeding with an External Dataset
    • Providing Images as Context
    • Generating Images
    • Image-to-Image Editing
  • Recipes
    • Recipe Cards
  • Plugins
    • Overview
    • Example Plugin
    • FileSystemSeedReader Plugins
    • Discover
  • Dev Notes
    • Overview
    • Designing Nemotron-Personas
    • Prompt Sensitivity
    • Retriever SDG Toolkit
    • Have It Your Way
    • VLM Long Document Understanding
    • Push Datasets to Hugging Face Hub
    • Text-to-SQL for Nemotron Super
    • Async All the Way Down
    • Owning the Model Stack
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Data Designer
On this page
  • Basic shape
  • Stage outputs
  • Processor-only stages
  • Current limits
Concepts

Workflow Chaining

||View as Markdown|
Previous

Processors

Next

Person Sampling in Data Designer

Experimental Feature

Workflow chaining is currently experimental and under active development. The documentation, examples, workflow API, metadata schema, and artifact layout are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting a discussion on GitHub.

Workflow chaining lets you split a dataset build into named stages. Each stage runs a normal DataDesigner.create() call, writes its own artifact directory, and hands a selected parquet output to the next stage as a LocalFileSeedSource.

Use it when one generation step naturally depends on the cleaned or reshaped output of another step, especially when a processor-only stage is clearer than mixing all transformations into one config.

Basic shape

1import data_designer.config as dd
2from data_designer.interface import DataDesigner
3
4data_designer = DataDesigner()
5
6drafts = (
7 dd.DataDesignerConfigBuilder(model_configs=[fast_model])
8 .with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet"))
9 .add_column(
10 name="chunk_summary",
11 column_type="llm_text",
12 model_alias="fast",
13 prompt="Summarize this passage:\n\n{{ text }}",
14 )
15 .add_column(
16 name="question",
17 column_type="llm_text",
18 model_alias="fast",
19 prompt="Write a question about this passage:\n\n{{ chunk_summary }}",
20 )
21 .add_column(
22 name="answer",
23 column_type="llm_text",
24 model_alias="fast",
25 prompt="Answer {{ question }} using this passage:\n\n{{ text }}",
26 )
27)
28
29chatml = dd.DataDesignerConfigBuilder().add_processor(
30 dd.SchemaTransformProcessorConfig(
31 name="chatml",
32 template={
33 "messages": [
34 {"role": "user", "content": "{{ question }}"},
35 {"role": "assistant", "content": "{{ answer }}"},
36 ],
37 },
38 )
39)
40
41workflow = data_designer.compose_workflow(name="doc-qa")
42workflow.add_stage(
43 "drafts",
44 drafts,
45 num_records=100,
46 output_processors=[
47 dd.DropColumnsProcessorConfig(
48 name="drop_scratch",
49 column_names=["text", "chunk_summary"],
50 )
51 ],
52)
53workflow.add_stage("chatml", chatml, output="processor:chatml")
54
55results = workflow.run()
56training_rows = results.load_dataset()
57results.export("chatml.jsonl")

Stage outputs

A stage can expose different views of its data:

SurfaceWhat it returns
results["stage_name"]The effective DatasetCreationResults for that stage. If the stage uses output_processors, this points at the output-processor run.
results.load_stage_output("stage_name")The selected output handed to downstream stages. This follows output="processor:<name>" and on_success.
results.load_dataset()The selected output from the final stage.

Processors added with config_builder.add_processor(...) run inside the stage and usually create side artifacts. They do not automatically change what the next stage receives. Use output_processors=[...] when a processor should define the stage boundary output.

Processor-only stages

Stages can be processor-only when they receive seed data from an upstream stage:

1cleanup = dd.DataDesignerConfigBuilder().add_processor(
2 dd.DropColumnsProcessorConfig(
3 name="drop_private_fields",
4 column_names=["email", "raw_notes"],
5 )
6)
7
8workflow.add_stage("cleanup", cleanup)

This is useful for final cleanup, schema transforms, and format-specific export preparation.

Current limits

  • Stages are linear. DAGs, parallel branches, and joins are planned separately.
  • Stage-level resume is not implemented yet.
  • push_to_hub() does not support selected processor or callback outputs yet. Use export() for the selected workflow output.
  • on_success callbacks are trusted user code. If a callback returns a path, Data Designer reads that path as the next stage input.
  • The artifact layout is intended for inspection, but it is not yet a stable public contract.