modules.fuzzy_dedup.fuzzyduplicates
#
Module Contents#
Classes#
Base class for all NeMo Curator deduplication modules. |
API#
- class modules.fuzzy_dedup.fuzzyduplicates.FuzzyDuplicates(
- config: nemo_curator.modules.config.FuzzyDuplicatesConfig,
- logger: logging.LoggerAdapter | str = './',
- perform_removal: bool = False,
Bases:
nemo_curator.modules.base.BaseDeduplicationModule
Base class for all NeMo Curator deduplication modules.
Initialization
Parameters
config: FuzzyDuplicatesConfig, Config options for finding FuzzyDuplicates logger: Existing logger to log to, or a path to a log directory. perform_removal: Whether to remove duplicates from the dataset. Default is False. Returns
DocumentDataset containing IDs of all documents and the corresponding duplicate group they belong to. Documents in the same group are near duplicates.
- identify_duplicates(
- dataset: nemo_curator.datasets.DocumentDataset,
Parameters
dataset: DocumentDataset The input datset to compute FuzzyDuplicates. Must contain a text and unique id field.
Returns
DocumentDataset containing IDs of all documents and the corresponding duplicate group they belong to. Documents in the same group are near duplicates.
- remove(
- dataset: nemo_curator.datasets.DocumentDataset,
- duplicates_to_remove: nemo_curator.datasets.DocumentDataset | None,
Remove fuzzy duplicates from a given DocumentDataset Parameters
dataset: DocumentDataset The input dataset from which to remove fuzzy duplicates duplicates_to_remove: DocumentDataset The dataset containing IDs of the fuzzy duplicates to remove Returns
DocumentDataset containing only non-duplicate documents