modules.exact_dedup
#
Module Contents#
Classes#
Find exact duplicates in a document corpus |
API#
- class modules.exact_dedup.ExactDuplicates(
- logger: logging.LoggerAdapter | str = './',
- id_field: str = 'id',
- text_field: str = 'text',
- hash_method: str = 'md5',
- perform_removal: bool = False,
- profile_dir: str | None = None,
- cache_dir: str | None = None,
Bases:
nemo_curator.modules.base.BaseDeduplicationModule
Find exact duplicates in a document corpus
Initialization
Parameters
logger: Existing logger to log to, or a path to a log directory. id_field: Column in the Dataset denoting document ID. text_field: Column in the Dataset denoting document content. hash_method: The hashing algorithm used for identifying exact duplicates. Currently supports {“md5”} profile_dir: str, Default None If specified directory to write dask profile cache_dir: str, Default None If specified, will compute & write duplicate id’s to cache directory.
- SUPPORTED_HASHES#
‘frozenset(…)’
- hash_documents(
- df: cudf.Series | pandas.Series,
Compute hashes for a Series containing documents
- identify_duplicates(
- dataset: nemo_curator.datasets.DocumentDataset,
Find document IDs for exact duplicates in a given DocumentDataset Parameters
dataset: DocumentDataset The input datset to find exact duplicates Returns
DocumentDataset containing IDs and hashes of all duplicate documents
- remove(
- dataset: nemo_curator.datasets.DocumentDataset,
- duplicates_to_remove: nemo_curator.datasets.DocumentDataset | None,
Remove exact duplicates from a given DocumentDataset Parameters
dataset: DocumentDataset The input dataset from which to remove exact duplicates duplicates_to_remove: DocumentDataset The dataset containing IDs of the exact duplicates to remove Returns
DocumentDataset containing only non-duplicate documents