GPU Accelerated Exact and Fuzzy Deduplication

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide GPU Accelerated Exact and Fuzzy Deduplication

Background

Training on randomly selected documents for many epochs can be sub-optimal to downstream performance for language models. For more information on when this is harmful, please see Muennighoff et al., 2023 and Tirumala et al., 2023. The exact and fuzzy document-level deduplication module in the NeMo Curator aims at reducing the occurence of duplicate and near-duplicate documents in the dataset. Exact deduplication refers to removing identical (i.e., document strings are equal) documents from the dataset, while fuzzy deduplication refers to removing near-identical (e.g., an excerpt of a document is used in another document) documents from the dataset.

Both functionalities are supported in NeMo Curator and accelerated using RAPIDS. Exact dedpulication works by hashing each document and only keeping one document per hash. Fuzzy deduplication is more involved and follows the method outlined in Microsoft Turing NLG 530B.

Usage

As exact deduplication is a much less involved procedure and requires significantly less compute, we typically will first run exact deduplication before fuzzy deduplication. Also, from our experience in deduplicating Common Crawl snapshots, a significant portion of the duplicates are in fact exact duplicates.

When removing near-duplicates within the corpus we perform fuzzy deduplication at the document level in order to remove documents that have high Jaccard similarity. Our approach closely resembles the approach described in Smith et al., 2020. This approach can essentially be split into two conceptual changes. The first stage involves computing MinHashes Signatures on documents and then performing Locality Sensitive Hashing (LSH) to find candidate duplucates. Due to the approximate nature of the bucketing via MinHash + LSH (Leskovec et al., 2020) we process each of the buckets to remove any potential false positives that may have been hashed into the buckets.

Before running either of these modules, users should assign a unique document ID to each document in the corpus. This can be accomplished using the add_id module within the NeMo Curator:

Copy
Copied!

            
            add_id \
  --input-data-dir=<Path to directory containing jsonl files> \
  --log-dir=./log/add_id

By default, this will create a new field named adlr_id within each json document which will have the form “doc_prefix-000001”. If the dataset already has a unique ID this step can be skipped.

Note: Fuzzy deduplication only works with numeric ID’s or the specific ID format generated by the add_id script. If the dataset does not contain ID’s in this format it’s recommended to convert to an integer based ID or ID created by the add_id script.

Once a unique ID has been added to each document, users can proceed with exact and fuzzy deduplication which roughly require the following steps (all scripts are included in the nemo_curator/scripts/ subdirectory):

Exact dedup
1. Input: Data directories
2. Output: _exact_duplicates.parquet. List of exact duplicates and the document hash.
Fuzzy Dedup
1. Minhashes (Compute minhashes)
  
  Input: Data Directories
  
  Output: minhashes.parquet for each data dir.
2. Buckets (Minhash Buckets/LSH)
  
  Input: Minhash directories
  
  Output: _buckets.parquet
3. Map Buckets
  
  Input: Buckets.parquet + Data Dirs
  
  Output: anchor_docs_with_bk.parquet
4. Jaccard Shuffle
  
  Input: anchor_docs_with_bk.parquet + Data Dirs
  
  Output: shuffled_docs.parquet
5. Jaccard compute
  
  Input: Shuffled docs.parquet
  
  Output: jaccard_similarity_results.parquet
6. Connected Components
  
  Input: jaccard_similarity_results.parquet
  
  Output: connected_components.parquet

In addition to the scripts, there are examples in the examples directory that showcase using the python module directly in your own code. It also has examples on how to remove documents from the corpus using the list of duplicate IDs generated from exact or fuzzy deduplication.

Previous Language Identification and Unicode Fixing

Next Downstream Task Decontamination/Deduplication