nemo_curator.stages.deduplication.fuzzy.buckets_to_edges

View as Markdown

Module Contents

Classes

NameDescription
BucketsToEdgesStageStage that takes in a file consiting of LSH bucket ids and document ids belonging to the bucket

API

class nemo_curator.stages.deduplication.fuzzy.buckets_to_edges.BucketsToEdgesStage(
output_path: str,
document_id_field: str = CURATOR_DEDUP_ID_STR,
read_kwargs: dict[str, typing.Any] | None = None,
write_kwargs: dict[str, typing.Any] | None = None
)

Bases: ProcessingStage[FileGroupTask, FileGroupTask]

Stage that takes in a file consiting of LSH bucket ids and document ids belonging to the bucket and outputs a file consisting of edges between documents with the same bucket id.

Parameters:

document_id_field
strDefaults to CURATOR_DEDUP_ID_STR

The field name containing the document ids for each bucket.

output_path
str

The directory to write the output file to.

read_kwargs
dict[str, Any] | NoneDefaults to None

Keyword arguments to pass for reading the input files. Only the storage_options key is supported for now.

write_kwargs
dict[str, Any] | NoneDefaults to None

Keyword arguments to pass for writing the output files. Only the storage_options key is supported for now.

name
= 'BucketsToEdgesStage'
output_fs
= get_fs(output_path, self.write_storage_options)
output_path
= self.output_fs.sep.join([output_path, self.name])
read_storage_options
resources
= Resources(cpus=1.0)
write_storage_options
nemo_curator.stages.deduplication.fuzzy.buckets_to_edges.BucketsToEdgesStage._check_io_kwargs(
kwargs: dict[str, typing.Any] | None
) -> None
nemo_curator.stages.deduplication.fuzzy.buckets_to_edges.BucketsToEdgesStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.FileGroupTask