stages.deduplication.fuzzy.buckets_to_edges#
Module Contents#
Classes#
Stage that takes in a file consiting of LSH bucket ids and document ids belonging to the bucket and outputs a file consisting of edges between documents with the same bucket id. |
API#
- class stages.deduplication.fuzzy.buckets_to_edges.BucketsToEdgesStage(
- output_path: str,
- doc_id_field: str = CURATOR_DEDUP_ID_STR,
- read_kwargs: dict[str, Any] | None = None,
- write_kwargs: dict[str, Any] | None = None,
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask,nemo_curator.tasks.FileGroupTask]Stage that takes in a file consiting of LSH bucket ids and document ids belonging to the bucket and outputs a file consisting of edges between documents with the same bucket id.
Args: doc_id_field: The field name containing the document ids for each bucket. output_path: The directory to write the output file to. read_kwargs: Keyword arguments to pass for reading the input files. Only the storage_options key is supported for now. write_kwargs: Keyword arguments to pass for writing the output files. Only the storage_options key is supported for now.
Initialization
- name#
‘BucketsToEdgesStage’
- process(
- task: nemo_curator.tasks.FileGroupTask,
Process a task and return the result. Args: task (X): Input task to process Returns (Y | list[Y]): - Single task: For 1-to-1 transformations - List of tasks: For 1-to-many transformations (e.g., readers) - None: If the task should be filtered out
- resources#
‘Resources(…)’