nemo_curator.stages.text.deduplication.semantic
nemo_curator.stages.text.deduplication.semantic
Monolithic Text Semantic Deduplication Workflow.
This module contains a complete end-to-end workflow for text semantic deduplication:
- Embedding generation from text data
- Semantic deduplication using clustering and pairwise similarity
- Optional duplicate removal based on identified duplicates
Module Contents
Classes
API
Monolithic workflow for end-to-end text semantic deduplication.
This workflow combines:
- Text embedding generation (configurable executor)
- Semantic deduplication (configurable executor for pairwise stage)
- Duplicate removal (configurable executor)
Supports flexible executor configuration - can use a single executor for all stages or different executors for different phases.
Initialize the text semantic deduplication workflow.
Initialize parent class after dataclass initialization.
Log workflow configuration.
Run duplicate removal stage.
Run embedding generation stage.
Run semantic deduplication stage.
Setup output directories.
Validate workflow configuration.
Run the complete text semantic deduplication workflow.
Returns: WorkflowRunResult
WorkflowRunResult object containing the results and timing information from all stages