Curate TextProcess DataDeduplication

Deduplication

View as Markdown

Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.

NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline.

How It Works

NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:

Method: MD5 hashing
Detects: Character-for-character identical documents
Speed: Fastest

Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
3
4ray_client = RayClient()
5ray_client.start()
6
7exact_workflow = ExactDeduplicationWorkflow(
8 input_path="/path/to/input/data",
9 output_path="/path/to/output",
10 text_field="text",
11 perform_removal=False, # Identification only
12 assign_id=True,
13 input_filetype="parquet"
14)
15exact_workflow.run()

For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Exact Duplicate Removal for details.


Deduplication Methods

Choose a deduplication method based on your needs:

Common Operations

Document IDs

Duplicate removal workflows require stable document identifiers. Choose one approach:

  • Use AddId to add IDs at the start of your pipeline
  • Use reader-based ID generation (_generate_ids, _assign_ids) backed by the ID Generator actor for stable integer IDs
  • Use existing IDs if your documents already have unique identifiers

Some workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.

Removing Duplicates

Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.

1from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
2
3removal_workflow = TextDuplicatesRemovalWorkflow(
4 input_path="/path/to/input",
5 ids_to_remove_path="/path/to/duplicates", # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
6 output_path="/path/to/clean",
7 input_filetype="parquet",
8 input_id_field="_curator_dedup_id",
9 ids_to_remove_duplicate_id_field="_curator_dedup_id",
10 id_generator_path="/path/to/id_generator.json" # Required when IDs were auto-assigned
11)
12removal_workflow.run()

When assign_id=True (IDs auto-assigned):

  • Duplicate IDs file contains _curator_dedup_id column
  • Set ids_to_remove_duplicate_id_field="_curator_dedup_id"
  • id_generator_path is required

When assign_id=False (using existing IDs):

  • Duplicate IDs file contains the column specified by id_field (e.g., "id")
  • Set ids_to_remove_duplicate_id_field to match your id_field value
  • id_generator_path not required

Outputs and Artifacts

Each deduplication method produces specific output files and directories:

Output Locations

MethodDuplicate IDs LocationID Generator FileDeduplicated Output
ExactExactDuplicateIds/ (parquet)exact_id_generator.json (if assign_id=True)Via TextDuplicatesRemovalWorkflow
FuzzyFuzzyDuplicateIds/ (parquet)fuzzy_id_generator.json (if IDs auto-assigned)Via TextDuplicatesRemovalWorkflow
Semanticoutput_path/duplicates/ (parquet)N/Aoutput_path/deduplicated/ (if perform_removal=True)

Column names:

  • _curator_dedup_id when assign_id=True or IDs are auto-assigned
  • Matches id_field parameter when assign_id=False

Choosing a Deduplication Method

Compare deduplication methods to select the best approach for your dataset:

Method Comparison

MethodBest ForSpeedDuplicate TypesGPU Required
ExactIdentical copiesVery fastCharacter-for-character matchesRequired
FuzzyNear-duplicates with small changesFastMinor edits, reformatting (~80% similarity)Required
SemanticSimilar meaning, different wordsModerateParaphrases, translations, rewritesRequired

Quick Decision Guide

Use this guide to quickly select the right method:

  • Start with Exact if you have numerous identical documents or need the fastest speed
  • Use Fuzzy if you need to catch near-duplicates with minor formatting differences
  • Use Semantic for meaning-based deduplication on large, diverse datasets

Exact Deduplication:

  • Removing identical copies of documents
  • Fast initial deduplication pass
  • Datasets with numerous exact duplicates
  • When speed is more important than detecting near-duplicates

Fuzzy Deduplication:

  • Removing near-duplicate documents with minor formatting differences
  • Detecting documents with small edits or typos
  • Fast deduplication when exact matching misses numerous duplicates
  • When speed is important but some near-duplicate detection is needed

Semantic Deduplication:

  • Removing semantically similar content (paraphrases, translations)
  • Large, diverse web-scale datasets
  • When meaning-based deduplication is more important than speed
  • Advanced use cases requiring embedding-based similarity detection

You can combine deduplication methods for comprehensive duplicate removal:

  1. Exact → Fuzzy → Semantic: Start with fastest methods, then apply more sophisticated methods
  2. Exact → Semantic: Use exact for quick wins, then semantic for meaning-based duplicates
  3. Fuzzy → Semantic: Use fuzzy for near-duplicates, then semantic for paraphrases

Run each method independently, then combine duplicate IDs before removal.

For detailed implementation guides, see:

GPU Acceleration

All deduplication workflows require GPU acceleration:

  • Exact: Ray backend with GPU support for MD5 hashing operations
  • Fuzzy: Ray backend with GPU support for MinHash computation and LSH operations
  • Semantic: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation

GPU acceleration provides significant speedup for large datasets through parallel processing.

Hardware Requirements

  • GPU: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
  • Memory: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
  • Executors: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support

Backend Setup

For optimal performance with large datasets, configure Ray backend:

1from nemo_curator.core.client import RayClient
2
3client = RayClient(
4 num_cpus=64, # Adjust based on available cores
5 num_gpus=4 # Should be roughly 2x the memory of embeddings
6)
7client.start()
8
9try:
10 workflow.run()
11finally:
12 client.stop()

For TB-scale datasets, consider distributed GPU clusters with Ray.

ID Generator for Large-Scale Operations

For large-scale duplicate removal, persist the ID Generator for consistent document tracking:

1from nemo_curator.stages.deduplication.id_generator import (
2 create_id_generator_actor,
3 write_id_generator_to_disk,
4 kill_id_generator_actor
5)
6
7create_id_generator_actor()
8id_generator_path = "semantic_id_generator.json"
9write_id_generator_to_disk(id_generator_path)
10kill_id_generator_actor()
11
12# Use saved ID generator in removal workflow
13removal_workflow = TextDuplicatesRemovalWorkflow(
14 input_path=input_path,
15 ids_to_remove_path=duplicates_path,
16 output_path=output_path,
17 id_generator_path=id_generator_path,
18 # ... other parameters
19)

The ID Generator ensures consistent IDs across workflow stages.

Next Steps

Ready to use deduplication?

For hands-on guidance: See Text Curation Tutorials for step-by-step examples.