Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Deduplication

Exact

class nemo_curator.ExactDuplicates(logger: Union[logging.LoggerAdapter, str] = './', id_field: str = 'id', text_field: str = 'text', hash_method: str = 'md5', profile_dir: str = None, cache_dir: str = None)

Find exact duplicates in a document corpus

hash_documents(df: Union[cudf.Series, pd.Series]) → Union[cudf.Series, pd.Series]: Compute hashes for a Series containing documents

Fuzzy

class nemo_curator.FuzzyDuplicatesConfig(cache_dir: str, profile_dir: Optional[str] = None, id_field: str = 'id', text_field: str = 'text', seed: int = 42, char_ngrams: int = 5, num_buckets: int = 20, hashes_per_bucket: int = 13, use_64_bit_hash: bool = False, buckets_per_shuffle: int = 1, false_positive_check: bool = True, num_anchors: int = 2, jaccard_threshold: float = 0.8, bucket_mapping_blocksize: int = 256, parts_per_worker: int = 1, bucket_parts_per_worker: int = 8)

Configuration for MinHash based fuzzy duplicates detection. :param seed: :type seed: Seed for minhash permutations :param char_ngrams: :type char_ngrams: Size of Char ngram shingles used in minhash computation :param num_buckets: :type num_buckets: Number of Bands or buckets to use during Locality Sensitive Hashing :param hashes_per_bucket: :type hashes_per_bucket: Number of hashes per bucket/band. :param use_64_bit_hash: :type use_64_bit_hash: Whether to use a 32bit or 64bit hash function for minhashing. :param buckets_per_shuffle: Larger values process larger batches by processing multiple bands

but might lead to memory pressures and related errors.

Parameters

id_field (Column in the Dataset denoting document ID.) –
text_field (Column in the Dataset denoting document content.) –
profile_dir (str, Default None) – If specified directory to write dask profile
cache_dir (str, Default None) – Location to store deduplcation intermediates such as minhashes/buckets etc.
false_positive_check (bool,) – Whether to run a check to look for false positives within buckets. Note: This is a computationally expensive step.
num_anchors (int) – Number of documents per bucket to use as reference for computing jaccard pairs within that bucket to identify false positives.
jaccard_threshold (float) – The Jaccard similariy threshold to consider a document a near duplicate during false positive evaluations.

class nemo_curator.FuzzyDuplicates(config: FuzzyDuplicatesConfig, logger: Union[logging.LoggerAdapter, str] = './')

class nemo_curator.LSH(cache_dir: str, num_hashes: int, num_buckets: int, buckets_per_shuffle: int = 1, logger: Union[logging.LoggerAdapter, str] = './', id_fields: Union[str, list] = 'id', minhash_field: str = '_minhash_signature', profile_dir: str = None)

Performs LSH on a MinhashSignatures

bucket_id_to_int(bucket_ddf: dask_cudf.DataFrame, bucket_col_name: str = 'bucket_id', start_id: int = 0) → Tuple[dask_cudf.DataFrame, int]: Maps bucket ids to a contigious integer range from starting from start_id.

lsh(write_path: str, df: dask_cudf.DataFrame) → None: Computes buckets and writes them as parquet files to the write_path

class nemo_curator.MinHash(seed: int = 42, num_hashes: int = 260, char_ngrams: int = 5, use_64bit_hash: bool = False, logger: Union[logging.LoggerAdapter, str] = './', id_field: str = 'id', text_field: str = 'text', profile_dir: str = None, cache_dir: str = None)

Computes minhash signatures of a document corpus

generate_seeds(n_seeds: int = 260, seed: int = 0) → numpy.ndarray: Generate seeds for all minhash permutations based on the given seed.

minhash32(ser: cudf.Series, seeds: numpy.ndarray, char_ngram: int) → cudf.Series: Compute 32bit minhashes based on the MurmurHash3 algorithm

minhash64(ser: cudf.Series, seeds: numpy.ndarray, char_ngram: int) → cudf.Series: Compute 64bit minhashes based on the MurmurHash3 algorithm

Semantic

class nemo_curator.SemDedup(config: nemo_curator.modules.config.SemDedupConfig, logger: Union[logging.Logger, str] = './')

class nemo_curator.SemDedupConfig(cache_dir: str, num_files: int = -1, id_col_name: str = 'id', id_col_type: str = 'str', input_column: str = 'text', embeddings_save_loc: str = 'embeddings', embedding_model_name_or_path: str = 'sentence-transformers/all-MiniLM-L6-v2', embedding_batch_size: int = 128, embedding_max_mem_gb: int = 25, clustering_save_loc: str = 'clustering_results', n_clusters: int = 1000, seed: int = 1234, max_iter: int = 100, kmeans_with_cos_dist: bool = False, which_to_keep: str = 'hard', largest_cluster_size_to_process: int = 100000, sim_metric: str = 'cosine', eps_thresholds: typing.List[float] = <factory>, eps_to_extract: float = 0.01)

Configuration for Semantic Deduplication.

cache_dir

Directory to store cache.

Type: str

num_files

Number of files. Default is -1, meaning all files.

Type: int

id_col_name

Column name for ID.

Type: str

id_col_type

Column type for ID.

Type: str

input_column

Input column for embeddings.

Type: str

embeddings_save_loc

Location to save embeddings.

Type: str

embedding_model_name_or_path

Model name or path for embeddings.

Type: str

embedding_batch_size

Inital Batch size for processing embeddings.

Type: int

embedding_max_mem_gb

Maximum memory in GB for embeddings.

Type: int

clustering_save_loc

Location to save clustering results.

Type: str

n_clusters

Number of clusters.

Type: int

seed

Seed for clustering.

Type: int

max_iter

Maximum iterations for clustering.

Type: int

kmeans_with_cos_dist

Use KMeans with cosine distance.

Type: bool

which_to_keep

Which duplicates to keep.

Type: str

largest_cluster_size_to_process

Largest cluster size to process.

Type: int

sim_metric

Similarity metric for deduplication.

Type: str

eps_thresholds

Epsilon thresholds to calculate if semantically similar or not.

Type: List[float]

eps_to_extract

Epsilon value to extract deduplicated data.

Type: float

class nemo_curator.EmbeddingCreator(embedding_model_name_or_path: str, embedding_max_mem_gb: str, embedding_batch_size: int, embedding_output_dir: str, input_column: str = 'text', write_embeddings_to_disk: bool = True, write_to_filename: bool = False, logger: Union[logging.Logger, str] = './')

class nemo_curator.ClusteringModel(id_col: str, max_iter: int, n_clusters: int, clustering_output_dir: str, sim_metric: str = 'cosine', which_to_keep: str = 'hard', sort_clusters: bool = True, kmeans_with_cos_dist: bool = False, partition_size: str = '2gb', logger: Union[logging.Logger, str] = './')

class nemo_curator.SemanticClusterLevelDedup(n_clusters: int, emb_by_clust_dir: str, sorted_clusters_dir: str, id_col: str, id_col_type: str, which_to_keep: str, output_dir: str, logger: Union[logging.Logger, str] = './')

compute_semantic_match_dfs(eps_list: Optional[List[float]] = None) → None

Compute semantic match dataframes for clusters.

Parameters: eps_list (Optional[List[float]]) – List of epsilon values for clustering.

extract_dedup_data(eps_to_extract: float) → nemo_curator.datasets.doc_dataset.DocumentDataset

Extract deduplicated data based on epsilon value.

Parameters: eps_to_extract (float) – Epsilon threshold for extracting deduplicated data.
Returns: Dataset containing deduplicated documents.
Return type: DocumentDataset