nemo_curator.stages.deduplication.exact.workflow
nemo_curator.stages.deduplication.exact.workflow
Module Contents
Classes
Data
API
Bases: WorkflowBase
A pipeline that performs exact deduplication of a dataset. It consists of the following stages:
- FilePartitioningStage Groups input files into smaller groups that can be processed in parallel.
- ExactDuplicateIdentification Finds exact duplicates in a given column by hashing the column.
- Removal (Optional) Currently not implemented.
executor_config
Run the deduplication pipeline.
executor: RayActorPoolExecutor | None Executor to use for the pipeline. If not provided, the default RayActorPoolExecutor will be used.
Parameters:
initial_tasks
Returns: WorkflowRunResult
WorkflowRunResult object containing the results and timing information