stages.deduplication.fuzzy.buckets_to_edges#

Module Contents#

Classes#

BucketsToEdgesStage

Stage that takes in a file consiting of LSH bucket ids and document ids belonging to the bucket and outputs a file consisting of edges between documents with the same bucket id.

API#

class stages.deduplication.fuzzy.buckets_to_edges.BucketsToEdgesStage(
output_path: str,
doc_id_field: str = CURATOR_DEDUP_ID_STR,
read_kwargs: dict[str, Any] | None = None,
write_kwargs: dict[str, Any] | None = None,
)#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask, nemo_curator.tasks.FileGroupTask]

Stage that takes in a file consiting of LSH bucket ids and document ids belonging to the bucket and outputs a file consisting of edges between documents with the same bucket id.

Args: doc_id_field: The field name containing the document ids for each bucket. output_path: The directory to write the output file to. read_kwargs: Keyword arguments to pass for reading the input files. Only the storage_options key is supported for now. write_kwargs: Keyword arguments to pass for writing the output files. Only the storage_options key is supported for now.

Initialization

name#

‘BucketsToEdgesStage’

process(
task: nemo_curator.tasks.FileGroupTask,
) nemo_curator.tasks.FileGroupTask#

Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out

resources#

‘Resources(…)’