modules.exact_dedup#

Module Contents#

Classes#

ExactDuplicates

Find exact duplicates in a document corpus

API#

class modules.exact_dedup.ExactDuplicates(
logger: logging.LoggerAdapter | str = './',
id_field: str = 'id',
text_field: str = 'text',
hash_method: str = 'md5',
perform_removal: bool = False,
profile_dir: str | None = None,
cache_dir: str | None = None,
)#

Bases: nemo_curator.modules.base.BaseDeduplicationModule

Find exact duplicates in a document corpus

Initialization

Parameters

logger: Existing logger to log to, or a path to a log directory. id_field: Column in the Dataset denoting document ID. text_field: Column in the Dataset denoting document content. hash_method: The hashing algorithm used for identifying exact duplicates. Currently supports {“md5”} profile_dir: str, Default None If specified directory to write dask profile cache_dir: str, Default None If specified, will compute & write duplicate id’s to cache directory.

SUPPORTED_HASHES#

‘frozenset(…)’

hash_documents(
df: cudf.Series | pandas.Series,
) cudf.Series | pandas.Series#

Compute hashes for a Series containing documents

identify_duplicates(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Find document IDs for exact duplicates in a given DocumentDataset Parameters

dataset: DocumentDataset The input datset to find exact duplicates Returns

DocumentDataset containing IDs and hashes of all duplicate documents

remove(
dataset: nemo_curator.datasets.DocumentDataset,
duplicates_to_remove: nemo_curator.datasets.DocumentDataset | None,
) nemo_curator.datasets.DocumentDataset#

Remove exact duplicates from a given DocumentDataset Parameters

dataset: DocumentDataset The input dataset from which to remove exact duplicates duplicates_to_remove: DocumentDataset The dataset containing IDs of the exact duplicates to remove Returns

DocumentDataset containing only non-duplicate documents