Workflow Chaining | NVIDIA NeMo Data Designer

Experimental Feature

Workflow chaining is currently experimental and under active development. The documentation, examples, workflow API, metadata schema, and artifact layout are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting a discussion on GitHub.

Workflow chaining lets you split a dataset build into named stages. Each stage runs a normal DataDesigner.create() call, writes its own artifact directory, and hands a selected parquet output to the next stage as a LocalFileSeedSource.

Use it when one generation step naturally depends on the cleaned or reshaped output of another step, especially when a processor-only stage is clearer than mixing all transformations into one config.

Basic shape

1 import data_designer.config as dd
2 from data_designer.interface import DataDesigner
3 
4 data_designer = DataDesigner()
5 
6 drafts = (
7     dd.DataDesignerConfigBuilder(model_configs=[fast_model])
8     .with_seed_dataset(dd.LocalFileSeedSource(path="parsed_docs/*.parquet"))
9     .add_column(
10         name="chunk_summary",
11         column_type="llm_text",
12         model_alias="fast",
13         prompt="Summarize this passage:\n\n{{ text }}",
14     )
15     .add_column(
16         name="question",
17         column_type="llm_text",
18         model_alias="fast",
19         prompt="Write a question about this passage:\n\n{{ chunk_summary }}",
20     )
21     .add_column(
22         name="answer",
23         column_type="llm_text",
24         model_alias="fast",
25         prompt="Answer {{ question }} using this passage:\n\n{{ text }}",
26     )
27 )
28 
29 chatml = dd.DataDesignerConfigBuilder().add_processor(
30     dd.SchemaTransformProcessorConfig(
31         name="chatml",
32         template={
33             "messages": [
34                 {"role": "user", "content": "{{ question }}"},
35                 {"role": "assistant", "content": "{{ answer }}"},
36             ],
37         },
38     )
39 )
40 
41 workflow = data_designer.compose_workflow(name="doc-qa")
42 workflow.add_stage(
43     "drafts",
44     drafts,
45     num_records=100,
46     output_processors=[
47         dd.DropColumnsProcessorConfig(
48             name="drop_scratch",
49             column_names=["text", "chunk_summary"],
50         )
51     ],
52 )
53 workflow.add_stage("chatml", chatml, output="processor:chatml")
54 
55 results = workflow.run()
56 training_rows = results.load_dataset()
57 results.export("chatml.jsonl")

Stage outputs

A stage can expose different views of its data:

Surface	What it returns
`results["stage_name"]`	The effective `DatasetCreationResults` for that stage. If the stage uses `output_processors`, this points at the output-processor run.
`results.load_stage_output("stage_name")`	The selected output handed to downstream stages. This follows `output="processor:<name>"` and `on_success`.
`results.load_dataset()`	The selected output from the final stage.

Processors added with config_builder.add_processor(...) run inside the stage and usually create side artifacts. They do not automatically change what the next stage receives. Use output_processors=[...] when a processor should define the stage boundary output.

Processor-only stages

Stages can be processor-only when they receive seed data from an upstream stage:

1 cleanup = dd.DataDesignerConfigBuilder().add_processor(
2     dd.DropColumnsProcessorConfig(
3         name="drop_private_fields",
4         column_names=["email", "raw_notes"],
5     )
6 )
7 
8 workflow.add_stage("cleanup", cleanup)

This is useful for final cleanup, schema transforms, and format-specific export preparation.

Postprocessing hooks

Use output_processors for structured transforms that can be expressed as processor configs. Use on_success when a stage boundary needs arbitrary Python code, such as filtering rows before the next stage runs.

The callback receives the completed stage artifact directory and must return a parquet file or directory that can seed downstream stages.

1 from pathlib import Path
2 
3 import pandas as pd
4 
5 
6 def keep_disagreements(stage_path: Path) -> Path:
7     df = pd.read_parquet(stage_path / "parquet-files")
8     df = df[df["judge_a"] != df["judge_b"]]
9 
10     out = stage_path / "callback-outputs" / "disagreements-v1"
11     out.mkdir(parents=True, exist_ok=True)
12     df.to_parquet(out / "data.parquet", index=False)
13     return out
14 
15 
16 workflow = data_designer.compose_workflow(name="judge-disagreements")
17 workflow.add_stage("candidates", candidates, num_records=10_000)
18 workflow.add_stage(
19     "judged",
20     judges,
21     on_success=keep_disagreements,
22     on_success_version="disagreements-v1",
23     allow_empty=True,
24 )
25 workflow.add_stage("enriched", enriched)

on_success_version is part of the stage resume identity. Change it when the callback’s output semantics change. If a callback returns zero rows, the workflow raises by default; set allow_empty=True to mark that stage as completed empty and skip downstream stages.

Changing row counts between stages

Each stage has a fixed requested row count while it runs. To resize a workflow, change the selected output at a stage boundary and let the next stage seed from that output.

Filtering is the shrink case: a callback can write fewer rows than the stage generated, and the next stage defaults to that filtered row count when num_records is omitted.

Growing is the explode case. If a downstream stage asks for more rows than the upstream selected output contains, the seed reader cycles through the seed rows in order.

1 workflow = data_designer.compose_workflow(name="persona-conversations")
2 workflow.add_stage("personas", personas, num_records=100)
3 workflow.add_stage("conversations", conversations, num_records=1_000)

The conversations stage receives 100 persona rows as its seed and requests 1,000 output rows. Data Designer reuses persona rows in order, so each persona seeds about 10 conversation rows. Add downstream sampler or LLM columns when each repeated seed row should produce distinct outputs.

For custom upsampling, make the expanded dataset the selected output of the upstream stage:

1 def repeat_each_persona(stage_path: Path) -> Path:
2     df = pd.read_parquet(stage_path / "parquet-files")
3 
4     expanded = df.reset_index(drop=True)
5     expanded = expanded.loc[expanded.index.repeat(10)].copy()
6     expanded["variant_id"] = expanded.groupby(level=0).cumcount()
7     expanded = expanded.reset_index(drop=True)
8 
9     out = stage_path / "callback-outputs" / "repeat-personas-v1"
10     out.mkdir(parents=True, exist_ok=True)
11     expanded.to_parquet(out / "data.parquet", index=False)
12     return out
13 
14 
15 workflow = data_designer.compose_workflow(name="custom-persona-conversations")
16 workflow.add_stage(
17     "personas",
18     personas,
19     num_records=100,
20     on_success=repeat_each_persona,
21     on_success_version="repeat-personas-v1",
22 )
23 workflow.add_stage("conversations", conversations)

In this version, conversations defaults to the 1,000-row callback output and can use variant_id to diversify prompts.

Resume

Workflow names are durable artifact identities. Reusing the same name with resume=ResumeMode.IF_POSSIBLE reuses compatible completed stages, resumes a matching partial stage through DataDesigner.create(..., resume=ResumeMode.ALWAYS), and reruns the first changed or missing stage plus its descendants.

1 from data_designer.interface import ResumeMode
2 
3 results = workflow.run(resume=ResumeMode.IF_POSSIBLE)

Use ResumeMode.ALWAYS for strict resume before the first recovered checkpoint. A changed stage or missing selected output raises instead of starting fresh. If a matching partial stage resumes successfully, descendants are recreated from that stage’s current output.

Review gates

Use targets to materialize an intermediate stage without running the rest of the workflow. export_stage() writes the selected stage output for review. After review, pass the approved parquet as a stage output override and resume the downstream target.

1 draft_results = workflow.run(targets="drafts")
2 draft_results.export_stage("drafts", "drafts_for_review.parquet")
3 
4 results = workflow.run(
5     targets="expanded",
6     resume=ResumeMode.IF_POSSIBLE,
7     stage_output_overrides={"drafts": "approved.parquet"},
8 )

If the reviewed data replaces a stage’s selected output in place, run with resume=ResumeMode.IF_POSSIBLE and rerun_from="expanded" to rebuild that stage and its descendants from the current boundary output.

Current limits

Stages are linear. DAGs, parallel branches, and joins are planned separately.
push_to_hub() does not support selected processor or callback outputs yet. Use export() for the selected workflow output.
on_success callbacks are trusted user code. If a callback returns a path, Data Designer reads that path as the next stage input.
The artifact layout is intended for inspection, but it is not yet a stable public contract.