GPU Accelerated Exact and Fuzzy Deduplication

Training on randomly selected documents for many epochs can be sub-optimal to downstream performance for language models. For more information on when this is harmful, please see Muennighoff et al., 2023 and Tirumala et al., 2023 The exact and fuzzy document-level deduplication module in the NeMo Data Curator aims at reducing the occurence of duplicate and near-duplicate documents in the dataset. Exact deduplication refers to removing identical (i.e., document strings are equal) documents from the dataset, while fuzzy deduplication refers to removing near-identical (e.g., an excerpt of a document is used in another document) documents from the dataset.

Both functionalities are supported in NeMo Data Curator and accelerated using RAPIDS cuDF. Exact dedpulication works by hashing each document and only keeping one document per hash. Fuzzy deduplication is more involved and follows the method outlined in Microsoft Turing NLG 530B.

As exact deduplication is a much less involved procedure and requires significantly less compute, we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.

Before running either of these modules, users should assign a unique document ID to each document in the corpus. This can be accomplished using the add_id module within the NeMo Data Curator


add_id \ --input-data-dir=<Path to directory containing jsonl files> \ --log-dir=./log/add_id

By default, this will create a new field named adlr_id within each json document which will have the form “doc_id-000001”.

Once a unique ID has been added to each document, users can proceed with exact and fuzzy deduplication which roughly require the following steps (all scripts are included in the examples/gpu_deduplication subdirectory):

  • Exact dedup
    1. Input: Data directories

    2. Output: exact_duplicates.parquet. List of exact duplicates and the document hash.

  • Fuzzy Dedup
    1. Minhashes (Compute minhashes)
      1. Input: Data Directories

      2. Output: minhashes.parquet for each data dir.

    2. Buckets (Minhash Buckets/LSH)
      1. Input: Minhash directories

      2. Output: Buckets.parquet

    3. Jaccard Map Buckets + Jaccard shuffle
      1. Input: Buckets.parquet + Data Dir

      2. Output: Shuffled docs.parquet

    4. Jaccard compute
      1. Input: Shuffled docs.parquet

      2. Output: dedup_final_results.parquet

    5. Connected Components
      1. Input: Dedup_final_Results.parquet

      2. Output: connected_components.parquet

While calling the main script that points to these runscripts users can also set the relevant LIBCUDF_CUFILE_POLICY. It is reccomended to set LIBCUDF_CUFILE_POLICY=OFF for all runs calling the script.

After obtaining a list of duplicate IDs from the exact or fuzzy deduplication pipeline, the user can run the or script, respectively, to generate a text file containing a list of all document IDs which need to be removed in order to have a deduplicated dataset. Finally, the user can pass that text file into, which outputs the deduplicated dataset.

Previous Exact and fuzzy deduplication
Next Classifier and Heuristic Quality Filtering
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.