`stages.deduplication.semantic.identify_duplicates`#

Module Contents#

Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage.

class stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask, nemo_curator.tasks.FileGroupTask]

Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage.

process( task: nemo_curator.tasks.FileGroupTask, ) → nemo_curator.tasks.FileGroupTask#: Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out

process_batch( tasks: list[nemo_curator.tasks.FileGroupTask], ) → list[nemo_curator.tasks.FileGroupTask]#

Process a batch of tasks and combine results into fewer output files.

This allows processing multiple clusters together and optionally partitioning by ID ranges for more efficient reading.

Args: tasks: List of FileGroupTask containing pairwise similarity results

Returns: List of FileGroupTask with combined filtered results