stages.deduplication.semantic.identify_duplicates#

Module Contents#

Classes#

IdentifyDuplicatesStage

Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage.

API#

class stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask, nemo_curator.tasks.FileGroupTask]

Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage.

eps: float#

None

output_path: str#

None

process(
task: nemo_curator.tasks.FileGroupTask,
) nemo_curator.tasks.FileGroupTask#

Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out

process_batch(
tasks: list[nemo_curator.tasks.FileGroupTask],
) list[nemo_curator.tasks.FileGroupTask]#

Process a batch of tasks and combine results into fewer output files.

This allows processing multiple clusters together and optionally partitioning by ID ranges for more efficient reading.

Args: tasks: List of FileGroupTask containing pairwise similarity results

Returns: List of FileGroupTask with combined filtered results

read_kwargs: dict[str, Any] | None#

None

verbose: bool#

False

write_kwargs: dict[str, Any] | None#

None