nemo_curator.stages.deduplication.semantic.workflow
nemo_curator.stages.deduplication.semantic.workflow
End-to-End Semantic Deduplication Pipeline for Ray Curator.
This module contains the complete semantic deduplication workflow:
- K-means clustering on embedding data (always uses RayActorPoolExecutor)
- Pairwise similarity computation within clusters + duplicate identification (configurable executor)
Module Contents
Classes
Data
API
Bases: WorkflowBase
End-to-End Semantic Deduplication Workflow. It consists of the following stages:
- KMeansStage Takes the input path (embeddings) and clusters the embeddings into n_clusters. Writes data partitioned by centroid to cache_path.
- PairwiseStage Computes pairwise similarity between all embeddings in each cluster. Takes the output of KMeansStage and computes pairwise similarity between all embeddings in each cluster. This is written to cache_path.
- IdentifyDuplicatesStage (optional) Identifies duplicates based on the pairwise similarity scores. Runs only if eps is provided. This is written to output_path.
Log workflow configuration.
Run K-means clustering stage (always uses RayActorPoolExecutor).
Run pairwise similarity + duplicate identification stage.
Setup output directories with fsspec compliance.
Validate the configuration.
Run the complete semantic deduplication pipeline.
Parameters:
Executor for kmeans stage. Defaults to RayActorPoolExecutor().
Executor for pairwise stage. Defaults to XennaExecutor().
Returns: WorkflowRunResult
WorkflowRunResult object containing the results and timing information