nemo_curator.stages.deduplication.semantic.pairwise_io
nemo_curator.stages.deduplication.semantic.pairwise_io
Module Contents
Classes
API
Bases: ProcessingStage[_EmptyTask, FileGroupTask]
Stage that partitions input files into PairwiseFileGroupTasks for deduplication.
This stage takes an EmptyTask as input and outputs partition-aware file groups. It reads parquet files partitioned by centroid (from kmeans output) and creates one PairwiseFileGroupTask per centroid partition.
fs
name
path_normalizer
resources
Process the EmptyTask to create PairwiseFileGroupTasks.
Parameters:
task
EmptyTask input (ignored, used for triggering the stage)
Returns: list[FileGroupTask]
List of PairwiseFileGroupTask, each containing partitioned file groups per centroid
Ray stage specification for this stage.