Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Deduplication#
Exact#
- class nemo_curator.ExactDuplicates(
- logger: logging.LoggerAdapter | str = './',
- id_field: str = 'id',
- text_field: str = 'text',
- hash_method: str = 'md5',
- perform_removal: bool = False,
- profile_dir: str | None = None,
- cache_dir: str | None = None,
Find exact duplicates in a document corpus
- call(
- dataset: DocumentDataset,
Performs an arbitrary operation on a dataset
- Parameters:
dataset (DocumentDataset) – The dataset to operate on
- hash_documents(
- df: cudf.Series | pd.Series,
Compute hashes for a Series containing documents
- identify_duplicates(
- dataset: DocumentDataset,
Find document IDs for exact duplicates in a given DocumentDataset :param dataset: The input datset to find exact duplicates :type dataset: DocumentDataset
- Return type:
DocumentDataset containing IDs and hashes of all duplicate documents
- remove(
- dataset: DocumentDataset,
- duplicates_to_remove: DocumentDataset | None,
Remove exact duplicates from a given DocumentDataset :param dataset: The input dataset from which to remove exact duplicates :type dataset: DocumentDataset :param duplicates_to_remove: The dataset containing IDs of the exact duplicates to remove :type duplicates_to_remove: DocumentDataset
- Return type:
DocumentDataset containing only non-duplicate documents
Fuzzy#
- class nemo_curator.BucketsToEdges(
- cache_dir: str | None = None,
- id_fields: list | str = 'id',
- str_id_name: str = 'id',
- bucket_field: str = '_bucket_id',
- logger: LoggerAdapter | str = './',
- profile_dir: str | None = None,
Maps buckets generated from LSH into an edgelist that can be processed further by Connected Components to find duplicate documents
- class nemo_curator.ConnectedComponents(
- cache_dir: str,
- jaccard_pairs_path: str,
- id_column='id',
- jaccard_threshold: float = 0.8,
- logger: LoggerAdapter | str = './',
- profile_dir: str | None = None,
- class nemo_curator.FuzzyDuplicatesConfig(
- cache_dir: str,
- profile_dir: str | None = None,
- id_field: str = 'id',
- text_field: str = 'text',
- perform_removal: bool = False,
- seed: int = 42,
- char_ngrams: int = 24,
- num_buckets: int = 20,
- hashes_per_bucket: int = 13,
- use_64_bit_hash: bool = False,
- buckets_per_shuffle: int = 1,
- false_positive_check: bool = False,
- num_anchors: int | None = None,
- jaccard_threshold: float | None = None,
- bucket_mapping_blocksize: int | None = None,
- parts_per_worker: int | None = None,
- bucket_parts_per_worker: int | None = None,
Configuration for MinHash based fuzzy duplicates detection. :param seed: :type seed: Seed for minhash permutations :param char_ngrams: :type char_ngrams: Size of Char ngram shingles used in minhash computation :param num_buckets: :type num_buckets: Number of Bands or buckets to use during Locality Sensitive Hashing :param hashes_per_bucket: :type hashes_per_bucket: Number of hashes per bucket/band. :param use_64_bit_hash: :type use_64_bit_hash: Whether to use a 32bit or 64bit hash function for minhashing. :param buckets_per_shuffle: Larger values process larger batches by processing multiple bands
but might lead to memory pressures and related errors.
- Parameters:
id_field (Column in the Dataset denoting document ID.)
text_field (Column in the Dataset denoting document content.)
perform_removal (Boolean value to specify whether calling the module should remove the duplicates from) – the original dataset, or return the list of IDs denoting duplicates.
profile_dir (str, Default None) – If specified directory to write dask profile
cache_dir (str, Default None) – Location to store deduplcation intermediates such as minhashes/buckets etc.
false_positive_check (bool,) – Whether to run a check to look for false positives within buckets. Note: This is a computationally expensive step.
num_anchors (int) – Number of documents per bucket to use as reference for computing jaccard pairs within that bucket to identify false positives.
jaccard_threshold (float) – The Jaccard similariy threshold to consider a document a near duplicate during false positive evaluations.
- class nemo_curator.FuzzyDuplicates(
- config: FuzzyDuplicatesConfig,
- logger: LoggerAdapter | str = './',
- call(
- dataset: DocumentDataset,
- perform_removal: bool = False,
Performs an arbitrary operation on a dataset
- Parameters:
dataset (DocumentDataset) – The dataset to operate on
- identify_duplicates(
- dataset: DocumentDataset,
- Parameters:
dataset (DocumentDataset) – The input datset to compute FuzzyDuplicates. Must contain a text and unique id field.
- Returns:
DocumentDataset containing IDs of all documents and the corresponding duplicate group
they belong to. Documents in the same group are near duplicates.
- remove(
- dataset: DocumentDataset,
- duplicates_to_remove: DocumentDataset | None,
Remove fuzzy duplicates from a given DocumentDataset :param dataset: The input dataset from which to remove fuzzy duplicates :type dataset: DocumentDataset :param duplicates_to_remove: The dataset containing IDs of the fuzzy duplicates to remove :type duplicates_to_remove: DocumentDataset
- Return type:
DocumentDataset containing only non-duplicate documents
- class nemo_curator.JaccardSimilarity(
- id_field='id',
- anchor_id_fields=['anchor_0_id', 'anchor_1_id'],
- text_field='text',
- ngram_width=5,
- class nemo_curator.LSH(
- cache_dir: str,
- num_hashes: int,
- num_buckets: int,
- buckets_per_shuffle: int = 1,
- false_positive_check: bool = False,
- logger: LoggerAdapter | str = './',
- id_fields: str | list = 'id',
- minhash_field: str = '_minhash_signature',
- profile_dir: str | None = None,
Performs LSH on a MinhashSignatures
- bucket_id_to_int(
- bucket_ddf: dask_cudf.DataFrame,
- bucket_col_name: str = 'bucket_id',
- start_id: int = 0,
Maps bucket ids to a contigious integer range from starting from start_id.
- lsh(write_path: str, df: dask_cudf.DataFrame) bool #
Computes hash buckets for the DataFrame and writes them as parquet files to the specified path.
- Parameters:
write_path (-) – The directory path to write parquet files.
df (-) – The input DataFrame with minhashes to be bucketed.
- Returns:
True if buckets were empty (no duplicates found), False otherwise.
- Return type:
are_buckets_empty
- class nemo_curator.MinHash(
- seed: int = 42,
- num_hashes: int = 260,
- char_ngrams: int = 24,
- use_64bit_hash: bool = False,
- logger: LoggerAdapter | str = './',
- id_field: str = 'id',
- text_field: str = 'text',
- profile_dir: str | None = None,
- cache_dir: str | None = None,
Computes minhash signatures of a document corpus
- generate_hash_permutation_seeds(
- bit_width: int,
- n_permutations: int = 260,
- seed: int = 0,
Generate seeds for all minhash permutations based on the given seed.
- generate_seeds(
- n_seeds: int = 260,
- seed: int = 0,
Generate seeds for all minhash permutations based on the given seed.
- minhash32(
- ser: cudf.Series,
- seeds: numpy.ndarray,
- char_ngram: int,
Compute 32bit minhashes based on the MurmurHash3 algorithm
- minhash64(
- ser: cudf.Series,
- seeds: numpy.ndarray,
- char_ngram: int,
Compute 64bit minhashes based on the MurmurHash3 algorithm
Semantic#
- class nemo_curator.SemDedup(
- config: SemDedupConfig,
- input_column: str = 'text',
- id_column: str = 'id',
- id_column_type: str = 'int',
- logger: Logger | str = './',
- call(
- dataset: DocumentDataset,
Execute the SemDedup process.
- Parameters:
dataset (DocumentDataset) – Input dataset for deduplication.
- Returns:
Deduplicated dataset.
- Return type:
- class nemo_curator.SemDedupConfig(
- cache_dir: str,
- profile_dir: str | None = None,
- num_files: int = -1,
- embedding_model_name_or_path: str = 'sentence-transformers/all-MiniLM-L6-v2',
- embedding_batch_size: int = 128,
- embeddings_save_loc: str = 'embeddings',
- embedding_max_mem_gb: int | None = None,
- embedding_pooling_strategy: str = 'mean_pooling',
- embedding_column: str = 'embeddings',
- write_embeddings_to_disk: bool = True,
- write_to_filename: bool = False,
- max_iter: int = 100,
- n_clusters: int = 1000,
- clustering_save_loc: str = 'clustering_results',
- random_state: int = 1234,
- sim_metric: str = 'cosine',
- which_to_keep: str = 'hard',
- batched_cosine_similarity: bool | int = 1024,
- sort_clusters: bool = True,
- kmeans_with_cos_dist: bool = False,
- clustering_input_partition_size: str = '2gb',
- eps_thresholds: ~typing.List[float] = <factory>,
- eps_to_extract: float = 0.01,
Configuration for Semantic Deduplication.
- cache_dir#
Directory to store cache.
- Type:
str
- profile_dir#
If specified, directory to write Dask profile. Default is None.
- Type:
Optional[str]
- num_files#
Number of files. Default is -1, meaning all files.
- Type:
int
- embedding_model_name_or_path#
Model name or path for embeddings. Default is “sentence-transformers/all-MiniLM-L6-v2”.
- Type:
str
- embedding_batch_size#
Initial batch size for processing embeddings. Default is 128.
- Type:
int
- embeddings_save_loc#
Location to save embeddings. Default is “embeddings”.
- Type:
str
- embedding_max_mem_gb#
Maximum memory usage in GB for the embedding process. If None, it defaults to the available GPU memory minus 4 GB.
- Type:
int
- embedding_pooling_strategy#
Strategy for pooling embeddings, either “mean_pooling” or “last_token”. Default is “mean_pooling”.
- Type:
str
- embedding_column#
The column name that stores the embeddings. Default is “embeddings”.
- Type:
str
- write_embeddings_to_disk#
If True, saves the embeddings to disk. We recommend setting this to False when you have a delayed pipeline. Setting it to False can lead to more memory overhead. Default is True.
- Type:
bool
- write_to_filename#
If True, saves the embeddings to the same filename as input files. Default False.
- Type:
bool
- max_iter#
Maximum iterations for clustering. Default is 100.
- Type:
int
- n_clusters#
Number of clusters. Default is 1000.
- Type:
int
- clustering_save_loc#
Location to save clustering results. Default is “clustering_results”.
- Type:
str
- random_state#
KMeans random state used for reproducibility. Default is 1234.
- Type:
int
- sim_metric#
Similarity metric for deduplication. Default is “cosine”.
- Type:
str
- which_to_keep#
Method to determine which duplicates to keep. Default is “hard”.
- Type:
str
- batched_cosine_similarity#
Whether to use batched cosine similarity (has less memory usage). Default is 1024. When False or 0, no batching is used and memory requirements are O(N^2) where N is the number of items in the cluster. When True, batch size is set to 1024 and memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size.
- Type:
Union[bool, int]
- sort_clusters#
Whether to sort clusters. Default is True.
- Type:
bool
- kmeans_with_cos_dist#
Whether or not to use KMeans with cosine distance. Default is False.
- Type:
bool
- clustering_input_partition_size#
The size of data partition with which to run KMeans. Default is “2gb”.
- Type:
str
- eps_thresholds#
Epsilon thresholds to calculate if semantically similar or not. Default is [0.01, 0.001].
- Type:
List[float]
- eps_to_extract#
Epsilon value to extract deduplicated data. Default is 0.01.
- Type:
float
- class nemo_curator.EmbeddingCreator(
- embedding_model_name_or_path: str = 'sentence-transformers/all-MiniLM-L6-v2',
- embedding_batch_size: int = 128,
- embedding_output_dir: str = './embeddings',
- embedding_max_mem_gb: int | None = None,
- embedding_pooling_strategy: str = 'mean_pooling',
- input_column: str = 'text',
- embedding_column: str = 'embeddings',
- write_embeddings_to_disk: bool = True,
- write_to_filename: bool = False,
- logger: Logger | str = './',
- profile_dir: str | None = None,
- class nemo_curator.ClusteringModel(
- id_column: str = 'id',
- max_iter: int = 100,
- n_clusters: int = 1000,
- clustering_output_dir: str = './clustering_results',
- embedding_column: str = 'embeddings',
- random_state: int = 1234,
- sim_metric: str = 'cosine',
- which_to_keep: str = 'hard',
- sort_clusters: bool = True,
- kmeans_with_cos_dist: bool = False,
- clustering_input_partition_size: str = '2gb',
- logger: Logger | str = './',
- profile_dir: str | None = None,
- class nemo_curator.SemanticClusterLevelDedup(
- n_clusters: int = 1000,
- emb_by_clust_dir: str = './clustering_results/embs_by_nearest_center',
- sorted_clusters_dir: str = './clustering_results/sorted',
- id_column: str = 'id',
- id_column_type: str = 'int',
- which_to_keep: str = 'hard',
- output_dir: str = './clustering_results',
- embedding_column: str = 'embeddings',
- batched_cosine_similarity: int = 1024,
- logger: Logger | str = './',
- profile_dir: str | None = None,
- compute_semantic_match_dfs(
- eps_list: List[float] | None = None,
Compute semantic match dataframes for clusters.
- Parameters:
eps_list (Optional[List[float]]) – List of epsilon values for clustering.
- extract_dedup_data(
- eps_to_extract: float,
Extract deduplicated data based on epsilon value.
- Parameters:
eps_to_extract (float) – Epsilon threshold for extracting deduplicated data.
- Returns:
Dataset containing deduplicated documents.
- Return type: