GPU Accelerated Exact and Fuzzy Deduplication

User Guide (Latest Version)

Training on randomly selected documents for many epochs can be sub-optimal to downstream performance for language models. For more information on when this is harmful, please see Muennighoff et al., 2023 and Tirumala et al., 2023. The exact and fuzzy document-level deduplication module in the NeMo Curator aims at reducing the occurence of duplicate and near-duplicate documents in the dataset. Exact deduplication refers to removing identical (i.e., document strings are equal) documents from the dataset, while fuzzy deduplication refers to removing near-identical (e.g., an excerpt of a document is used in another document) documents from the dataset.

Both functionalities are supported in NeMo Curator and accelerated using RAPIDS. Exact dedpulication works by hashing each document and only keeping one document per hash. Fuzzy deduplication is more involved and follows the method outlined in Microsoft Turing NLG 530B.

As exact deduplication is a much less involved procedure and requires significantly less compute, we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.

When removing near-duplicates within the corpus we perform fuzzy deduplication at the document level in order to remove documents that have high Jaccard similarity. Our approach closely resembles the approach described in Smith et al., 2020. This approach can essentially be split into two conceptual changes. The first stage involves computing MinHashes Signatures on documents and then performing Locality Sensitive Hashing (LSH) to find candidate duplucates. Due to the approximate nature of the bucketing via MinHash + LSH (Leskovec et al., 2020) we process each of the buckets to remove any potential false positives that may have been hashed into the buckets.

Before running either of these modules, users should assign a unique document ID to each document in the corpus. This can be accomplished using the add_id module within the NeMo Curator:

Copy
Copied!
            

add_id \ --input-data-dir=<Path to directory containing jsonl files> \ --log-dir=./log/add_id

By default, this will create a new field named adlr_id within each json document which will have the form “doc_prefix-000001”. If the dataset already has a unique ID this step can be skipped.

Note: Fuzzy deduplication only works with numeric ID’s or the specific ID format generated by the add_id script. If the dataset does not contain ID’s in this format it’s recommended to convert to an integer based ID or ID created by the add_id script.

Once a unique ID has been added to each document, users can proceed with exact and fuzzy deduplication which roughly require the following steps (all scripts are included in the nemo_curator/scripts/ subdirectory):

  • Exact dedup
    1. Input: Data directories

    2. Output: _exact_duplicates.parquet. List of exact duplicates and the document hash.

  • Fuzzy Dedup

    1. Compute Minhashes

    • Input: Data Directories

    • Output: minhashes.parquet for each data dir.

    • Example call:

      Copy
      Copied!
                  

      # same as `python compute_minhashes.py` gpu_compute_minhashes \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --output-minhash-dir /path/to/output_minhashes \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --char-ngram char_ngram_size \ --hash-bytes 4(or 8 byte hashes) \ --seed 42 \ --log-dir ./ # --scheduler-file /path/to/file.json


    1. Buckets (Minhash Buckets)

    • Input: Minhash directories

    • Output: Buckets.parquet

    • Example call:

      Copy
      Copied!
                  

      # same as `python minhash_lsh.py` minhash_buckets \ --input-data-dirs /path/to/output_minhashes/dir1 /path/to/output_minhashes/dir2 \ --output-bucket-dir /path/to/dedup_output \ --input-minhash-field _minhash_signature \ --input-json-id-field id_column_name \ --minhash-length number_of_hashes \ --num-bands num_bands \ --buckets-per-shuffle 1 `#Value b/w [1-num_bands]. Higher is better but might lead to oom` \ --log-dir ./ # --scheduler-file /path/to/file.json


    1. Jaccard Map Buckets

    • Input: Buckets.parquet + Data Dir

    • Output: anchor_docs_with_bk.parquet

    • Example call:

      Copy
      Copied!
                  

      # same as `python map_buckets.py` jaccard_map_buckets \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --input-bucket-dir /path/to/dedup_output/_buckets.parquet \ --output-dir /path/to/dedup_output \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ # --scheduler-file /path/to/file.json


    1. Jaccard Shuffle

    • Input: anchor_docs_with_bk.parquet + Data Dir

    • Output: shuffled_docs.parquet

    • Example call:

      Copy
      Copied!
                  

      # same as `python jaccard_shuffle.py` jaccard_shuffle \ --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \ --input-bucket-mapping-dir /path/to/dedup_output/anchor_docs_with_bk.parquet \ --output-dir /path/to/dedup_output \ --input-json-text-field text_column_name \ --input-json-id-field id_column_name \ # --scheduler-file /path/to/file.json


    1. Jaccard compute

    • Input: Shuffled docs.parquet

    • Output: jaccard_similarity_results.parquet

    • Example call:

      Copy
      Copied!
                  

      # same as `python jaccard_compute.py` jaccard_compute \ --shuffled-docs-path /path/to/dedup_output/shuffled_docs.parquet \ --output-dir /path/to/dedup_output \ --ngram-size char_ngram_size_for_similarity \ # --scheduler-file /path/to/file.json


    1. Connected Components

    • Input: jaccard_similarity_results.parquet

    • Output: connected_components.parquet

    • Example call:

      Copy
      Copied!
                  

      # same as `python connected_components.py` gpu_connected_component \ --jaccard-pairs_path /path/to/dedup_output/jaccard_similarity_results.parquet \ --output-dir /path/to/dedup_output \ --cache-dir /path/to/cc_cache \ --jaccard-threshold 0.8 # --scheduler-file /path/to/file.json


In addition to the scripts, there are examples in the examples directory that showcase using the python module directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy deduplication.

Previous Language Identification and Unicode Fixing
Next Downstream Task Decontamination/Deduplication
© | | | | | | |. Last updated on Jun 24, 2024.