Exact and fuzzy deduplication

Within the NeMo Data Curator, we have both exact and fuzzy deduplication modules for removing duplicate/near-duplicate documents from NLP corpora. As exact deduplication is a much less involved procedure and requires significantly less compute, we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.

In both our exact and fuzzy deduplication implementations, we rely on the use of a redis-cluster as a distributed key-value store. In both cases, the keys are some sort of hash computed from the documents, and the values are document ids unique to each document.

For exact deduplication, we implement the following steps in order to remove exact duplicates from the corpus:

  1. Assign a unique document ID to each document in the corpus. This can be accomplished using the add_id module within the NeMo Data Curator:

    Copy
    Copied!
                

    add_id \ --input-data-dir=<Path to directory containing jsonl files> \ --log-dir=./log/add_id

    By default, this will create a new field named adlr_id within each json document which will have the form “doc_id-000001”.

  2. Compute the hash of each document in the corpus and push the hash-document_id key-value pair to a redis-cluster:

    Copy
    Copied!
                

    hash_documents \ --input-data-dir=<Path to directory cotntaining jsonl files> \ --redis-host-ip-addr=<IP address of single node in redis-cluster> \ --use-md5-hash \ --output-redis-hash-names=./data/redis_hash_names.txt \ --log-dir=./log/hash_docs

    This script will compute the hash of each document in the corpus and form a key-value pair of the hash and the assigned document id. Once this key-value pair has been computed, it is pushed and stored to a redis-cluster. This enables that documents with the same hashes will have their ids mapped into the same buckets.

    As is apparent from the argument --redis-host-ip-addr, this script requires that the redis-cluster be operating prior to running this command. We provide the script examples/start_redis_cluster.sh as well as examples of using this script within the examples/exact_deduplication.sh and examples/fuzzy_deduplication.sh scripts.

  3. Once all documents have been hashed and the buckets formed, we then pull the buckets from the redis-cluster and form a list of duplicate document ids:

    Copy
    Copied!
                

    pull_duplicate_ids \ --input-redis-hash-names=<Input redis-hash-names> \ --redis-host-ip-addr=<IP address of single node in redis-cluster> \ --output-bucket-file-dir=<Output directory containing jsonl files of buckets> \ --output-duplicate-doc-ids=<Output list of duplicate ids> \ --log-dir=./log/pull_duplicate_ids

    The output list of ids is written to the file specified by --output-duplicate-doc-ids. Additionally, the script will write out the formed buckets if a path is provided to the --output-bucker-file-dir argument.

  4. With the id list, a filter can be constructed from the list of ids in order to remove the duplicate documents from the corpus. The filter ndc.deduplication.filter.DuplicateFilter reads in a list of ids and then for each document that it processes, checks if the id exists within its list. If the id is found within the list, the document is discarded, otherwise it is retained. The configuration file config/duplicate_filter.yaml provides a default configuration file to use this filter. With this configuration file, the filter_documents utility can be run as:

    Copy
    Copied!
                

    filter_documents \ --input-data-dir=<Path to original data containing duplicates> \ --input-json-field="adlr_id" \ --filter-config-file=./config/exact_duplicate_filter.yaml \ --output-retained-document-dir=<Output directory containing deduped data> \ --output-removed-document-dir=<Output directory containing exact duplicates> \ --log-dir=./log/remove_duplicates

    Note that the --input-json-field="adlr_id" argument indicates that instead of filtering on the “text” input field, we are filtering on the ids. For more information about filtering documents with the filter_documents utility, please see the documentation available within 1_document_filtering.rst.

When removing near-duplicates within the corpus we perform fuzzy deduplication at the document level in order to remove documents that have high Jaccard similarity. Our approach closely resembles the approach described in Smith et al., 2020. This approach can essentially be split into two stages. The first stage closely resembles that of the exact deduplication stage in that it involves computing the hash of each document, pushing the hash to a redis-cluster to form buckets of similar documents and finally, the buckets are pulled from redis-cluster and saved to disk. The hashing and bucketing steps of this stage use min-wise hashing and locality-sensitive hashing and thus this stage is often referred to as MinHashLSH. To carry-out this stage of MinHashLSH and form these buckets we provide the utilities compute_minhashes and lsh which can be run as follows:

Copy
Copied!
            

compute_minhashes \ --input-data-dir=<Input directory containing jsonl files> \ --minhash-length=< Number of minhashes to compute per document > \ --output-minhash-dir=<Output directory of computed minhashes> \ --output-doc-id-file=./data/all_doc_ids.txt \ --log-dir=./log/minhashes lsh \ --input-minhash-dir=<Directory containing computed minhashes> \ --total-num-bands=<Number of bands used to divide up minhash signatures> \ --total-num-documents=<Total number of documents in the corpus> \ --redis-host-ip-addr=<IP address of host in redis-cluster> \ --output-redis-hash-names=./doc/redis_hashes.txt \ --output-bucket-file-dir=<Output directory containing buckets> \ --log-dir=./log/lsh

The first compute_minhashes command computes a Minhash signature for each document and writes these computed signatures and the document id the associated document to disk in the form of pickle files. Additionally, it writes out all of the ids processes to the file specified by --output-doc-id-file. Each Minhash signature is an array of 64 bit integers and the length of this array is determined by the value provided to the --minhash-length argument. This choice of the minhash length is dependent on the dataset as well as the desired accuracy for forming buckets of similar documents (longer minhash signatures will tend to lead to fewer false positives and negatives during the locality-sensitive hashing stage). Once the minhashes have been written to disk, the lsh utility will then read in the minhashes, divide each signature into bands, and form a key-value pair between a banded minhash signature and the document id and push this key-value pair to redis cluster. Additionally, once all key-value pairs have been pushed and the buckets formed, lsh will pull the formed buckets to the directory provided to the --output-bucket-file-dir argument. Depending on the length of the minhash signature as well as the number of bands this step can require significant amounts of memory. In the figure shown below, we provide two scaling curves that show how much memory is required in the redis-cluster for dataset sizes ranging from 0.25 - 1.0 TB and for MinHash lengths of 100 with 10 bands and 260 with 20 bands. Each of these experiments was run using a 30 node CPU cluster with hardware similar to the c5.24xlarge Amazon AWS C5 instance.

redis_memory_scaling.png

If you find that when running lsh to form the buckets on redis-cluster you are running out of memory, we provide additional arguments --min-band-index and --max-band-index that will only form key-value pairs within a range of bands (e.g., if users desire to push only a single band, they can can set --min-band-index and --max-band-index to the same value). This has the potential to reduce the memory usage by (num-band)*X. Assuming the same memory scaling curve as shown in the figure above, it would then only require 100 GB of memory to run lsh on a 1 TB dataset. Additionally, if users desire to push all bands simultaneously, additional nodes can be added to a redis-cluster in order to achieve a greater distributed memory capacity.

Once the buckets have been written to disk after running lsh, due to the approximate nature of the bucketing via MinHashLSH (Leskovec et al., 2020), we process each of the buckets to remove any potential false positives that may have been hashed into the buckets. When processing the buckets, in order to avoid loading all documents in memory and searching for documents in the .jsonl files, each MPI rank will read all candidate duplicate documents in the dataset in parallel and will cache them to a persistent dictionary (this is accomplished using both the add_file_paths_to_buckets and cache_data_to_dict scripts). Once all required documents have been cached, using the compute_jaccard_pairs and convert_pairs_to_matrices utilities, Jaccard similarities will be computed amongst the documents within the buckets and these similarities will be stored in a sparse adjacency matrix. Finally, using the sparse graph tools available within scipy, the script compute_connected components finds the connected components of this adjacency matrix and from each connected component, selects a single document from each component as the deduplicated document, and labels all others as duplicates. Additionally, it writes out a file containing a list of duplicate ids that can be used to remove duplicate documents via the filter_documents utility (as was done when removing exact duplicates). For a detailed example of how this is all carried out, please try out the examples/fuzzy_deduplication.sh script.

© Copyright 2023, NVIDIA. Last updated on Nov 14, 2023.