nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base

View as Markdown

NDD-backed base stage for NemotronCC synthetic data generation.

This module re-implements the BaseSyntheticStage interface on top of DataDesignerStage (NeMo Data Designer) instead of using LLMClient/AsyncLLMClient directly. Child stages (WikipediaParaphrasingStage, DistillStage, etc.) can inherit from this class with the same field-based API (system_prompt, prompt, input_field, output_field) and gain NDD execution automatically.

Module Contents

Classes

NameDescription
NDDBaseSyntheticStageBase class for NemotronCC synthetic stages backed by NeMo Data Designer.

Data

_FORMATTED_PROMPT_COL

API

class nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage(
config_builder: data_designer.config.DataDesignerConfigBuilder | None = None,
data_designer_config_file: str | None = None,
model_providers: list | None = None,
verbose: bool = False,
system_prompt: str | None = None,
prompt: str | None = None,
input_field: str | None = None,
output_field: str | None = None,
model_alias: str | None = None,
model_configs: list | None = None
)
Dataclass

Bases: DataDesignerStage

Base class for NemotronCC synthetic stages backed by NeMo Data Designer.

Parameters

system_prompt : str | None Optional system prompt prepended to every LLM call. prompt : str | None User prompt template. Must contain {document} which will be replaced by the value of input_field at runtime. input_field : str | None Column name in the input DataFrame whose value is substituted into the prompt template. output_field : str | None Column name where the LLM response is stored in the output DataFrame. model_alias : str | None NDD model alias that maps to a ModelConfig entry. model_configs : list | None List of data_designer.config.ModelConfig objects. If not provided, NDD will use its default model configuration. model_providers : list | None Optional list of data_designer.config.models.ModelProvider for custom endpoints. Forwarded to DataDesignerStage. verbose : bool When False (default), suppress NDD log output.

config_builder
DataDesignerConfigBuilder | None = field(default=None, repr=False)
data_designer_config_file
str | None = None
input_field
str | None = None
model_alias
str | None = None
model_configs
list | None = None
model_providers
list | None = None
output_field
str | None = None
prompt
str | None = None
system_prompt
str | None = None
verbose
bool = False
nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage.__post_init__() -> None
nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage._build_config_from_prompt() -> None

Auto-build a DataDesignerConfigBuilder from stage fields.

Skipped when config_builder or data_designer_config_file is already provided (advanced usage).

nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage._process_llm_prompt(
sample: dict
) -> str

Process the input sample to create the LLM prompt.

Called per-row before NDD generation. Child classes can override this to customise prompt formatting.

nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage._process_llm_response(
response: list[str]
) -> str

Process a single response from the LLM.

Called per-row after NDD generation. Child classes can override this to customise response parsing.

nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base.NDDBaseSyntheticStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch
nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.base._FORMATTED_PROMPT_COL = '_ndd_formatted_prompt'