nemo_curator.stages.math.modifiers.llm_cleanup

View as Markdown

Module Contents

Classes

NameDescription
LLMCleanupStageLLM-based text cleanup stage using vLLM for distributed inference.

API

class nemo_curator.stages.math.modifiers.llm_cleanup.LLMCleanupStage(
model: str | nemo_curator.models.vllm_model.VLLMModel,
system_prompt: str,
text_field: str = 'text',
output_field: str = 'cleaned_text',
max_model_len: int | None = None,
classification: bool = False,
temperature: float = 0.7,
top_p: float = 0.8,
top_k: int = 20,
min_p: float = 0.0,
max_tokens: int | None = None,
cache_dir: str | None = None,
n_tokens_field: str = 'n_tokens'
)

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

LLM-based text cleanup stage using vLLM for distributed inference.

This stage uses a VLLMModel wrapper to generate cleaned text from input prompts. It handles filtering, sorting, prompt formatting, and output field management.

_model_kwargs
model_name
= model.model
name
resources
= Resources(cpus=1.0, gpus=1.0)
nemo_curator.stages.math.modifiers.llm_cleanup.LLMCleanupStage._initialize_model() -> None

Create and initialize the VLLMModel.

nemo_curator.stages.math.modifiers.llm_cleanup.LLMCleanupStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.modifiers.llm_cleanup.LLMCleanupStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.math.modifiers.llm_cleanup.LLMCleanupStage.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch
nemo_curator.stages.math.modifiers.llm_cleanup.LLMCleanupStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Load tokenizer per worker. Falls back to full init if setup_on_node was not called.

nemo_curator.stages.math.modifiers.llm_cleanup.LLMCleanupStage.setup_on_node(
_node_info: nemo_curator.backends.base.NodeInfo | None = None,
_worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None

Download weights and initialize vLLM once per node to avoid torch.compile race conditions.