GPU Accelerated Exact and Fuzzy Deduplication#
Background#
The exact and fuzzy document-level deduplication modules in NeMo Curator aim to reduce the occurrence of duplicate and near-duplicate documents in a dataset. Both functionalities are supported in NeMo Curator and accelerated using RAPIDS.
The main motivation for this is that training on randomly selected documents for many epochs can be sub-optimal to downstream performance for language models. For more information on when this is harmful, please see Muennighoff et al., 2023 and Tirumala et al., 2023.
Exact Deduplication#
Exact deduplication refers to removing identical documents (i.e., document strings that are equal) from the dataset.
As exact deduplication requires significantly less compute, we typically will run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion (as high as ~40%) of the duplicates can be exact duplicates.
How It Works#
Exact dedpulication works by hashing each document and only keeping one document per hash. Running exact deduplication works on both CPU- and GPU-based backends.
Usage#
Python API#
Note
Before running exact deduplication, you need to ensure that the dataset contains a unique ID for each document.
If needed, you can use the add_id
module within NeMo Curator to accomplish this.
from nemo_curator import AddId
from nemo_curator.datasets import DocumentDataset
add_id = AddId(id_field="my_id", id_prefix="doc_prefix")
dataset = DocumentDataset.read_json("input_file_path")
id_dataset = add_id(dataset)
id_dataset.to_parquet("/path/to/parquet/data")
After ensuring your dataset has a unique ID field (or creating one with the code above), you can perform exact deduplication as follows:
from nemo_curator import ExactDuplicates
from nemo_curator.datasets import DocumentDataset
# Initialize the deduplication object
exact_duplicates = ExactDuplicates(
id_field="my_id",
text_field="text",
perform_removal=True,
cache_dir="/path/to/dedup_outputs", # Recommended to specify a cache_dir if perform_removal=True
)
dataset = DocumentDataset.read_parquet(
input_files="/path/to/parquet/data",
backend="cudf", # or "pandas" for CPU
)
# Users who have specified perform_removal=False can split as following
duplicate_docs = exact_duplicates.identify_duplicates(dataset)
"""
Sample output:
my_id _hashes
22 doc_prefix-37820 e7cb1e88a7a30ea101d33e0c4c8857ef
70 doc_prefix-56261 bfce4501b9caa93cb3daccd6db1f13af
75 doc_prefix-56271 bfce4501b9caa93cb3daccd6db1f13af
84 doc_prefix-52261 0f763a2937d57b9d96bf9f220e55f2bd
107 doc_prefix-52271 0f763a2937d57b9d96bf9f220e55f2bd
"""
deduplicated_dataset = exact_duplicates.remove(dataset, duplicate_docs)
# Users who have specified perform_removal=True can get the output deduplicated dataset directly as follows
# deduplicated_dataset = exact_duplicates(dataset)
Tip
A more comprehensive example, can be found in examples/exact_deduplication.py.
CLI Utility#
Assuming that a unique ID has been added to each document, users can proceed with finding exact duplicates as follows:
- Find Exact Duplicates
Input: Data directories
Output:
_exact_duplicates.parquet
. List of exact duplicates and the document hash.
# same as `python nemo_curator/scripts/find_exact_duplicates.py`
gpu_exact_dups \
--input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
--output-dir /path/to/output_dir \
--input-json-text-field text_column_name \
--input-json-id-field id_column_name \
--log-dir ./
# --scheduler-file /path/to/file.json
All CLI scripts are included in the nemo_curator/scripts/
subdirectory.
Caution
The CLI utilities are limited to JSONL datasets and only work with GPU-based backends. For different dataset formats or backends use the Python API.
Fuzzy Deduplication#
When removing near-duplicates within the corpus, we perform fuzzy deduplication at the document level in order to remove documents with high Jaccard similarity scores. Our approach closely resembles the approach described in Smith et al., 2020.
How It Works#
This approach can essentially be split into the following stages:
Compute Minhashes: The first stage involves computing MinHash Signatures on documents. NeMo Curator currently only supports character-based n-grams for MinHashing. An approximate metric of ~4.5 characters per word can be used to determine the n-gram size for users familiar with word-based ngrams.
LSH (Locality Sensitive Hashing): Perform LSH to find candidate duplicates.
Buckets to Edgelist: If not using the false positive check, we directly convert the LSH buckets to edges for the connected components computation.
False Positive Check (optional alternative to Buckets to Edgelist): Due to the approximate nature of the bucketing via MinHash + LSH (Leskovec et al., 2020), NeMo Curator provides the option to further process each of the buckets by computing some pairwise Jaccard similarity scores between documents in each bucket and filter out false positives that might have been hashed into the same bucket.
Jaccard Map Buckets: Since buckets generated by LSH can have high cardinality, we map multiple LSH buckets to larger batches for efficient processing. Aditionally we assign a few documents (controlled via
num_anchor_docs
) for each bucket to be candidate documents for pairwise Jaccard similarity computations within that bucket.Jaccard Shuffle: Store documents from the original dataset into new directories and files such that all documents in the same batch (bucket) are stored together. This allows parallelizing pairwise Jaccard similarity computations across different buckets.
Jaccard Compute: Compute Jaccard similarity scores between all pairs of documents in each bucket to the candidate anchor docs.
Connected Components: Due to the approximate nature of LSH, documents that are near duplicates may be assigned into different buckets with a few overlapping documents between these buckets. We use a GPU accelerated connected components algorithm to find all connected components in the graph formed by the edges between documents in the same bucket.
The result from the connected components step is a list of document IDs and the group they belong to. All documents in the same group are considered near duplicates. These results can be used to remove the near duplicates from the corpus.
Usage#
Python API#
Note
Before running fuzzy deduplication, you need to ensure that the dataset contains a unique ID for each document.
If needed, you can use the add_id
module within NeMo Curator to accomplish this.
from nemo_curator import AddId
from nemo_curator.datasets import DocumentDataset
add_id = AddId(id_field="my_id", id_prefix="doc_prefix")
dataset = DocumentDataset.read_json("input_file_path")
id_dataset = add_id(dataset)
id_dataset.to_json("/path/to/jsonl/data")
Configuration
Using the API Directlty
from nemo_curator import FuzzyDuplicatesConfig config = FuzzyDuplicatesConfig( cache_dir="/path/to/dedup_outputs", # must be cleared between runs id_field="my_id", text_field="text", perform_removal=False, # dictates if deduplicated dataset or IDs of duplicates are returned seed=42, char_ngrams=24, num_buckets=20, hashes_per_bucket=13, use_64_bit_hash=False, buckets_per_shuffle=2, false_positive_check=False, )
Using a YAML file
cache_dir: /path/to/dedup_outputs id_field: my_id text_field: text perform_removal: False seed: 42 char_ngrams: 24 num_buckets: 20 hashes_per_bucket: 13 use_64_bit_hash: False buckets_per_shuffle: 2 false_positive_check: Falsefrom nemo_curator import FuzzyDuplicatesConfig config = FuzzyDuplicatesConfig.from_yaml("/path/to/config.yaml")
Usage Post Configuration
from nemo_curator import FuzzyDuplicates
from nemo_curator.datasets import DocumentDataset
# Initialize the deduplication object
fuzzy_duplicates = FuzzyDuplicates(config=config, logger="./")
dataset = DocumentDataset.read_json(
input_files="/path/to/jsonl/data",
backend="cudf", # FuzzyDuplicates only supports datasets with the cuDF backend.
)
# Users who have specified perform_removal=False can split as following
duplicate_docs = fuzzy_duplicates.identify_duplicates(dataset)
"""
Sample output:
my_id group
0 doc_prefix-56151 32
1 doc_prefix-47071 590
2 doc_prefix-06840 305
3 doc_prefix-20910 305
4 doc_prefix-42050 154
"""
deduplicated_dataset = fuzzy_duplicates.remove(dataset, duplicate_docs)
# Users who have specified perform_removal=True can get the output deduplicated dataset directly as follows
# deduplicated_dataset = fuzzy_duplicates(dataset)
Tip
A comprehensive example can be found in examples/fuzzy_deduplication.py.
The default values of
num_buckets
andhashes_per_bucket
are set to find documents with an approximately Jaccard similarity of 0.8 or above.Higher
buckets_per_shuffle
values can lead to better performance but might lead to out of memory errors.Setting the
false_positive_check
flag toFalse
is ideal for optimal performance.When setting the
false_positive_check
flag toTrue
ensurecache_dir
between runs is emptied to avoid data from previous runs interfering with the current run’s results.
CLI Utility#
Caution
Fuzzy deduplication CLI scripts only work with the specific ID format generated by the add_id
script. If the
dataset does not contain IDs in this format, it is recommended to create them with the add_id
script as follows:
add_id \
--id-field-name="my_id" \
--input-data-dir=<Path to directory containing jsonl files> \
--id-prefix="doc_prefix" \
--log-dir=./log/add_id
This will create a new field named my_id
within each JSON document which will have the form “doc_prefix-000001”.
If the dataset already has a unique ID this step can be skipped.
Once a unique ID has been added to each document, users can proceed with fuzzy deduplication, which roughly require the following steps (all scripts are included in the nemo_curator/scripts/fuzzy_deduplication subdirectory):
Compute Minhashes
Input: Data directories
Output:
minhashes.parquet
for each data directoryExample call:
# same as `python compute_minhashes.py` gpu_compute_minhashes \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --output-minhash-dir /path/to/output_minhashes \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --char-ngram char_ngram_size \ --hash-bytes 4 `#or 8 byte hashes` \ --seed 42 \ --log-dir ./ # --scheduler-file /path/to/file.json
Buckets (Minhash Buckets)
Input: Minhash directories
Output:
_buckets.parquet
Example call:
# same as `python minhash_lsh.py` minhash_buckets \ --input-data-dirs /path/to/output_minhashes/dir1 /path/to/output_minhashes/dir2 \ --output-bucket-dir /path/to/dedup_output \ --input-minhash-field _minhash_signature \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --num-bands num_bands \ --buckets-per-shuffle 1 `#Value between [1-num_bands]. Higher is better but might lead to OOM` \ --log-dir ./ # --false-positive-check `#Writes bucket ID's in a format required for the false positive check` # --scheduler-file /path/to/file.json
False Positive Check (optional): If skipping this step, proceed to the skip fp check section.
Jaccard Map Buckets
Input:
_buckets.parquet
and data directoriesOutput:
anchor_docs_with_bk.parquet
Example call:
# same as `python map_buckets.py` jaccard_map_buckets \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --input-bucket-dir /path/to/dedup_output/_buckets.parquet \ --output-dir /path/to/dedup_output \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name # --scheduler-file /path/to/file.json
Jaccard Shuffle
Input:
anchor_docs_with_bk.parquet
and data directoriesOutput:
shuffled_docs.parquet
Example call:
# same as `python jaccard_shuffle.py` jaccard_shuffle \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --input-bucket-mapping-dir /path/to/dedup_output/anchor_docs_with_bk.parquet \ --output-dir /path/to/dedup_output \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name # --scheduler-file /path/to/file.json
Jaccard Compute
Input:
shuffled_docs.parquet
Output:
jaccard_similarity_results.parquet
Example call:
# same as `python jaccard_compute.py` jaccard_compute \ --shuffled-docs-path /path/to/dedup_output/shuffled_docs.parquet \ --output-dir /path/to/dedup_output \ --ngram-size char_ngram_size_for_similarity \ --input-json-id-field id_column_name # --scheduler-file /path/to/file.json
Skipping the false positive check (more performant). This step is not needed if the false positive check was performed.
Buckets to Edgelist
Input:
_buckets.parquet
Output:
_edges.parquet
Example call:
# same as `python buckets_to_edges.py` buckets_to_edges \ --input-bucket-dir /path/to/dedup_output/_buckets.parquet \ --output-dir /path/to/dedup_output \ --input-json-id-field id_column_name # --scheduler-file /path/to/file.json
Connected Components
Input:
jaccard_similarity_results.parquet
(if you ran the false positive check) or_edges.parquet
(if you skipped the false positive check)Output:
connected_components.parquet
Example call:
# same as `python connected_components.py` gpu_connected_component \ --jaccard-pairs-path /path/to/dedup_output/jaccard_similarity_results.parquet `#Or /path/to/dedup_output/_edges.parquet` \ --output-dir /path/to/dedup_output \ --cache-dir /path/to/cc_cache \ --jaccard-threshold 0.8 \ --input-json-id-field id_column_name # --scheduler-file /path/to/file.json
Caution
The CLI utilities are limited to JSONL datasets and only work with specific ID formats. For different dataset or ID formats, use the Python API.
Incremental Fuzzy Deduplication#
If any new data is added to the corpus, you will need to perform deduplication incrementally. To incrementally perform fuzzy deduplication, we do not need to recompute minhashes for datasets where minhashes were already computed. Instead, you can organize your incremental datasets into separate directories and pass a list of all new directories to
gpu_compute_minhashes
.Input (assuming incremental snapshots are all under
/input/
):/input/cc-2020-40 /input/cc-2021-42 /input/cc-2022-60
Output (assuming
--output-minhash-dir=/output
):/output/cc-2020-40/minhashes.parquet /output/cc-2021-42/minhashes.parquet /output/cc-2022-60/minhashes.parquet
Example call:
# same as `python compute_minhashes.py` gpu_compute_minhashes \ --input-data-dirs /input/cc-2020-40 /input/cc-2020-42 /input/cc-2020-60 \ --output-minhash-dir /output/ \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --char-ngram char_ngram_size \ --hash-bytes 4(or 8 byte hashes) \ --seed 42 \ --log-dir ./ # --scheduler-file /path/to/file.json
All subsequent steps, starting with Buckets, can be executed on all the data (old and new) as described above without modification.