nemo_curator.stages.deduplication.fuzzy.buckets_to_edges
nemo_curator.stages.deduplication.fuzzy.buckets_to_edges
Module Contents
Classes
API
Bases: ProcessingStage[FileGroupTask, FileGroupTask]
Stage that takes in a file consiting of LSH bucket ids and document ids belonging to the bucket and outputs a file consisting of edges between documents with the same bucket id.
Parameters:
document_id_field
The field name containing the document ids for each bucket.
output_path
The directory to write the output file to.
read_kwargs
Keyword arguments to pass for reading the input files. Only the storage_options key is supported for now.
write_kwargs
Keyword arguments to pass for writing the output files. Only the storage_options key is supported for now.
name
output_fs
output_path
read_storage_options
resources
write_storage_options