Within the NeMo Data Curator, we have both exact and fuzzy deduplication modules for removing duplicate/near-duplicate documents from NLP corpora. As exact deduplication is a much less involved procedure and requires significantly less compute, we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.
In both our exact and fuzzy deduplication implementations, we rely on the use of a redis-cluster as a distributed key-value store. In both cases, the keys are some sort of hash computed from the documents, and the values are document ids unique to each document.
For exact deduplication, we implement the following steps in order to remove exact duplicates from the corpus:
Assign a unique document ID to each document in the corpus. This can be accomplished using the
add_id
module within the NeMo Data Curator:add_id \ --input-data-dir=<Path to directory containing jsonl files> \ --log-dir=./log/add_id
By default, this will create a new field named
adlr_id
within each json document which will have the form “doc_id-000001”.Compute the hash of each document in the corpus and push the hash-document_id key-value pair to a redis-cluster:
hash_documents \ --input-data-dir=<Path to directory cotntaining jsonl files> \ --redis-host-ip-addr=<IP address of single node in redis-cluster> \ --use-md5-hash \ --output-redis-hash-names=./data/redis_hash_names.txt \ --log-dir=./log/hash_docs
This script will compute the hash of each document in the corpus and form a key-value pair of the hash and the assigned document id. Once this key-value pair has been computed, it is pushed and stored to a redis-cluster. This enables that documents with the same hashes will have their ids mapped into the same buckets.
As is apparent from the argument
--redis-host-ip-addr
, this script requires that the redis-cluster be operating prior to running this command. We provide the scriptexamples/start_redis_cluster.sh
as well as examples of using this script within theexamples/exact_deduplication.sh
andexamples/fuzzy_deduplication.sh
scripts.Once all documents have been hashed and the buckets formed, we then pull the buckets from the redis-cluster and form a list of duplicate document ids:
pull_duplicate_ids \ --input-redis-hash-names=<Input redis-hash-names> \ --redis-host-ip-addr=<IP address of single node in redis-cluster> \ --output-bucket-file-dir=<Output directory containing jsonl files of buckets> \ --output-duplicate-doc-ids=<Output list of duplicate ids> \ --log-dir=./log/pull_duplicate_ids
The output list of ids is written to the file specified by
--output-duplicate-doc-ids
. Additionally, the script will write out the formed buckets if a path is provided to the--output-bucker-file-dir
argument.With the id list, a filter can be constructed from the list of ids in order to remove the duplicate documents from the corpus. The filter
ndc.deduplication.filter.DuplicateFilter
reads in a list of ids and then for each document that it processes, checks if the id exists within its list. If the id is found within the list, the document is discarded, otherwise it is retained. The configuration fileconfig/duplicate_filter.yaml
provides a default configuration file to use this filter. With this configuration file, thefilter_documents
utility can be run as:filter_documents \ --input-data-dir=<Path to original data containing duplicates> \ --input-json-field="adlr_id" \ --filter-config-file=./config/exact_duplicate_filter.yaml \ --output-retained-document-dir=<Output directory containing deduped data> \ --output-removed-document-dir=<Output directory containing exact duplicates> \ --log-dir=./log/remove_duplicates
Note that the
--input-json-field="adlr_id"
argument indicates that instead of filtering on the “text” input field, we are filtering on the ids. For more information about filtering documents with thefilter_documents
utility, please see the documentation available within 1_document_filtering.rst.
When removing near-duplicates within the corpus we perform fuzzy deduplication at the document level in order to remove documents that
have high Jaccard similarity. Our approach closely resembles the approach described in Smith et al., 2020. This
approach can essentially be split into two stages. The first stage closely resembles that of the exact deduplication stage in that it involves computing
the hash of each document, pushing the hash to a redis-cluster to form buckets of similar documents and finally, the buckets are pulled from
redis-cluster and saved to disk. The hashing and bucketing steps of this stage use min-wise hashing and locality-sensitive hashing and thus this stage is often
referred to as MinHashLSH. To carry-out this stage of MinHashLSH and form these buckets we provide the utilities compute_minhashes
and lsh
which can be run as follows:
compute_minhashes \
--input-data-dir=<Input directory containing jsonl files> \
--minhash-length=< Number of minhashes to compute per document > \
--output-minhash-dir=<Output directory of computed minhashes> \
--output-doc-id-file=./data/all_doc_ids.txt \
--log-dir=./log/minhashes
lsh \
--input-minhash-dir=<Directory containing computed minhashes> \
--total-num-bands=<Number of bands used to divide up minhash signatures> \
--total-num-documents=<Total number of documents in the corpus> \
--redis-host-ip-addr=<IP address of host in redis-cluster> \
--output-redis-hash-names=./doc/redis_hashes.txt \
--output-bucket-file-dir=<Output directory containing buckets> \
--log-dir=./log/lsh
The first compute_minhashes
command computes a Minhash signature for each document and writes these computed signatures and the document id the
associated document to disk in the form of pickle files. Additionally, it writes out all of the ids processes to the file specified by
--output-doc-id-file
. Each Minhash signature is an array of 64 bit integers and the length of this array is determined by the value provided
to the --minhash-length
argument. This choice of the minhash length is dependent on the dataset as well as the desired accuracy for forming buckets
of similar documents (longer minhash signatures will tend to lead to fewer false positives and negatives during the locality-sensitive hashing stage).
Once the minhashes have been written to disk, the lsh
utility will then read in the minhashes, divide each signature into bands, and form a key-value
pair between a banded minhash signature and the document id and push this key-value pair to redis cluster. Additionally, once all key-value pairs
have been pushed and the buckets formed, lsh
will pull the formed buckets to the directory provided to the --output-bucket-file-dir
argument.
Depending on the length of the minhash signature as well as the number of bands this step can require significant amounts of memory. In the figure
shown below, we provide two scaling curves that show how much memory is required in the redis-cluster for dataset sizes ranging from 0.25 - 1.0 TB and
for MinHash lengths of 100 with 10 bands and 260 with 20 bands. Each of these experiments was run using a 30 node CPU cluster with hardware similar to the c5.24xlarge Amazon AWS C5 instance.

If you find that when running lsh
to form the buckets on redis-cluster you are running out of memory,
we provide additional arguments --min-band-index
and --max-band-index
that will only
form key-value pairs within a range of bands (e.g., if users desire to push only a single band,
they can can set --min-band-index
and --max-band-index
to the same value). This has the potential to reduce the memory
usage by (num-band)*X
. Assuming the same memory scaling curve as shown in the figure above,
it would then only require 100 GB of memory to run lsh
on a 1 TB dataset. Additionally, if users desire to push all bands
simultaneously, additional nodes can be added to a redis-cluster in order to achieve a greater distributed memory capacity.
Once the buckets have been written to disk after running lsh
, due to the approximate nature of the bucketing via MinHashLSH
(Leskovec et al., 2020), we process each of the buckets to remove any potential
false positives that may have been hashed into the buckets. When processing the buckets, in order to avoid loading all documents in memory and
searching for documents in the .jsonl
files, each MPI rank will read all candidate duplicate documents in the dataset in parallel
and will cache them to a persistent dictionary (this is accomplished using both the
add_file_paths_to_buckets
and cache_data_to_dict
scripts). Once all required documents have been cached,
using the compute_jaccard_pairs
and convert_pairs_to_matrices
utilities, Jaccard similarities will be computed amongst the documents
within the buckets and these similarities will be stored in a sparse adjacency matrix. Finally, using the sparse graph tools available within scipy,
the script compute_connected components
finds the connected components of this adjacency matrix and from each connected component,
selects a single document from each component as the deduplicated document, and labels all others as duplicates. Additionally, it writes out a file
containing a list of duplicate ids that can be used to remove duplicate documents via the filter_documents
utility (as was done
when removing exact duplicates). For a detailed example of how this is all carried out, please try out the examples/fuzzy_deduplication.sh
script.