nemo_curator.stages.deduplication.semantic.identify_duplicates
nemo_curator.stages.deduplication.semantic.identify_duplicates
Module Contents
Classes
API
Dataclass
Bases: ProcessingStage[FileGroupTask, FileGroupTask]
Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage.
_num_row_groups_hint
eps
output_path
read_kwargs
verbose
write_kwargs
Initialize parent class after dataclass initialization.
Process a batch of tasks and combine results into fewer output files.
This allows processing multiple clusters together and optionally partitioning by ID ranges for more efficient reading.
Parameters:
tasks
List of FileGroupTask containing pairwise similarity results
Returns: list[FileGroupTask]
List of FileGroupTask with combined filtered results