modules.fuzzy_dedup.fuzzyduplicates#

Module Contents#

Classes#

FuzzyDuplicates

Base class for all NeMo Curator deduplication modules.

API#

class modules.fuzzy_dedup.fuzzyduplicates.FuzzyDuplicates(
config: nemo_curator.modules.config.FuzzyDuplicatesConfig,
logger: logging.LoggerAdapter | str = './',
perform_removal: bool = False,
)#

Bases: nemo_curator.modules.base.BaseDeduplicationModule

Base class for all NeMo Curator deduplication modules.

Initialization

Parameters

config: FuzzyDuplicatesConfig, Config options for finding FuzzyDuplicates logger: Existing logger to log to, or a path to a log directory. perform_removal: Whether to remove duplicates from the dataset. Default is False. Returns

DocumentDataset containing IDs of all documents and the corresponding duplicate group they belong to. Documents in the same group are near duplicates.

identify_duplicates(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset | None#

Parameters

dataset: DocumentDataset The input datset to compute FuzzyDuplicates. Must contain a text and unique id field.

Returns

DocumentDataset containing IDs of all documents and the corresponding duplicate group they belong to. Documents in the same group are near duplicates.

remove(
dataset: nemo_curator.datasets.DocumentDataset,
duplicates_to_remove: nemo_curator.datasets.DocumentDataset | None,
) nemo_curator.datasets.DocumentDataset#

Remove fuzzy duplicates from a given DocumentDataset Parameters

dataset: DocumentDataset The input dataset from which to remove fuzzy duplicates duplicates_to_remove: DocumentDataset The dataset containing IDs of the fuzzy duplicates to remove Returns

DocumentDataset containing only non-duplicate documents