nemo_curator.stages.deduplication.semantic.pairwise_io

View as Markdown

Module Contents

Classes

NameDescription
ClusterWiseFilePartitioningStageStage that partitions input files into PairwiseFileGroupTasks for deduplication.

API

class nemo_curator.stages.deduplication.semantic.pairwise_io.ClusterWiseFilePartitioningStage(
input_path: str,
storage_options: dict[str, typing.Any] | None = None
)

Bases: ProcessingStage[_EmptyTask, FileGroupTask]

Stage that partitions input files into PairwiseFileGroupTasks for deduplication.

This stage takes an EmptyTask as input and outputs partition-aware file groups. It reads parquet files partitioned by centroid (from kmeans output) and creates one PairwiseFileGroupTask per centroid partition.

fs
AbstractFileSystem | None = None
name
= 'pairwise_file_partitioning'
path_normalizer
= lambda x: x
resources
= Resources(cpus=0.5)
nemo_curator.stages.deduplication.semantic.pairwise_io.ClusterWiseFilePartitioningStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.deduplication.semantic.pairwise_io.ClusterWiseFilePartitioningStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.deduplication.semantic.pairwise_io.ClusterWiseFilePartitioningStage.process(
_: nemo_curator.tasks._EmptyTask
) -> list[nemo_curator.tasks.FileGroupTask]

Process the EmptyTask to create PairwiseFileGroupTasks.

Parameters:

task

EmptyTask input (ignored, used for triggering the stage)

Returns: list[FileGroupTask]

List of PairwiseFileGroupTask, each containing partitioned file groups per centroid

nemo_curator.stages.deduplication.semantic.pairwise_io.ClusterWiseFilePartitioningStage.ray_stage_spec() -> dict[str, typing.Any]

Ray stage specification for this stage.

nemo_curator.stages.deduplication.semantic.pairwise_io.ClusterWiseFilePartitioningStage.setup(
_: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.deduplication.semantic.pairwise_io.ClusterWiseFilePartitioningStage.xenna_stage_spec() -> dict[str, typing.Any]