stages.deduplication.semantic.identify_duplicates
#
Module Contents#
Classes#
Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage. |
API#
- class stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage#
Bases:
nemo_curator.stages.base.ProcessingStage
[nemo_curator.tasks.FileGroupTask
,nemo_curator.tasks.FileGroupTask
]Stage for batch removal of similar documents with optional ID-based partitioning. It is a CPU-only stage.
- eps: float#
None
- output_path: str#
None
- process(
- task: nemo_curator.tasks.FileGroupTask,
Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out
- process_batch(
- tasks: list[nemo_curator.tasks.FileGroupTask],
Process a batch of tasks and combine results into fewer output files.
This allows processing multiple clusters together and optionally partitioning by ID ranges for more efficient reading.
Args: tasks: List of FileGroupTask containing pairwise similarity results
Returns: List of FileGroupTask with combined filtered results
- read_kwargs: dict[str, Any] | None#
None
- verbose: bool#
False
- write_kwargs: dict[str, Any] | None#
None