nemo_curator.stages.text.deduplication.removal_workflow
nemo_curator.stages.text.deduplication.removal_workflow
Module Contents
Classes
API
Dataclass
Bases: WorkflowBase
duplicate_id_field
duplicate_id_read_kwargs
id_field
id_generator_path
id_generator_storage_options
ids_to_remove_path
input_blocksize
input_fields
input_file_extensions
input_files_per_partition
input_filetype
input_kwargs
input_path
input_task_limit
output_fields
output_file_extension
output_filetype
output_kwargs
output_mode
output_path
Initialize parent class after dataclass initialization.
staticmethod
Sum num_removed metadata reported by downstream stages.