nemo_curator.stages.synthetic.nemotron_cc.base

View as Markdown

This module contains a simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

Module Contents

Classes

NameDescription
BaseSyntheticStageA simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

API

class nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage(
system_prompt: str = None,
prompt: str = None,
input_field: str = None,
output_field: str = None,
client: nemo_curator.models.client.llm_client.AsyncLLMClient | nemo_curator.models.client.llm_client.LLMClient = None, client: nemo_curator.models.client.llm_client.AsyncLLMClient | nemo_curator.models.client.llm_client.LLMClient = None,
model_name: str = None,
generation_config: nemo_curator.models.client.llm_client.GenerationConfig | None = None,
name: str = 'NemotronCCBaseStage'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

A simple stage for generating synthetic data. It takes in Empty task and a prompt and produces the output in form of a DocumentBatch.

client
AsyncLLMClient | LLMClient = None
generation_config
GenerationConfig | None = None
input_field
str = None
model_name
str = None
name
str = 'NemotronCCBaseStage'
output_field
str = None
prompt
str = None
system_prompt
str = None
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage.__post_init__() -> None
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage._generate_responses_async(
df: pandas.DataFrame
) -> list[str]
async

Generate responses asynchronously using concurrent requests.

nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage._process_async(
df: pandas.DataFrame
) -> list[str]

Process samples using async client (concurrent).

This method handles both cases:

  • Normal case: No event loop exists, creates one with asyncio.run()
  • Edge case: Called from async context, runs in separate thread
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage._process_llm_prompt(
sample: dict
) -> str

Process the input sample to create the LLM prompt.

nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage._process_llm_response(
response: list[str]
) -> str

Process a single response from the LLM.

nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage._process_sync(
df: pandas.DataFrame
) -> list[str]

Process DataFrame using synchronous sequential processing.

nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch
nemo_curator.stages.synthetic.nemotron_cc.base.BaseSyntheticStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None