nemo_curator.stages.deduplication.semantic.identify_duplicates

View as Markdown

Module Contents

Classes

NameDescription
IdentifyDuplicatesStageStage for batch removal of similar documents with optional ID-based partitioning.

API

class nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage(
output_path: str,
eps: float,
_num_row_groups_hint: int | None = None,
verbose: bool = False,
read_kwargs: dict[str, typing.Any] | None = None,
write_kwargs: dict[str, typing.Any] | None = None
)
Dataclass

Bases: ProcessingStage[FileGroupTask, FileGroupTask]

Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage.

_num_row_groups_hint
int | None = None
eps
float
output_path
str
read_kwargs
dict[str, Any] | None = None
verbose
bool = False
write_kwargs
dict[str, Any] | None = None
nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage.__post_init__()

Initialize parent class after dataclass initialization.

nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.FileGroupTask
nemo_curator.stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage.process_batch(
tasks: list[nemo_curator.tasks.FileGroupTask]
) -> list[nemo_curator.tasks.FileGroupTask]

Process a batch of tasks and combine results into fewer output files.

This allows processing multiple clusters together and optionally partitioning by ID ranges for more efficient reading.

Parameters:

tasks
list[FileGroupTask]

List of FileGroupTask containing pairwise similarity results

Returns: list[FileGroupTask]

List of FileGroupTask with combined filtered results