Deduplication#

Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.

NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline.

How It Works#

NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:

Exact

Method: MD5 hashing
Detects: Character-for-character identical documents
Speed: Fastest

Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.

Fuzzy

Method: MinHash + Locality Sensitive Hashing (LSH)
Detects: Near-duplicates with minor edits (~80% similarity)
Speed: Fast

Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.

Semantic

Method: Embeddings + clustering + pairwise similarity
Detects: Semantically similar content (paraphrases, translations)
Speed: Moderate

Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.

Deduplication Methods#

Choose a deduplication method based on your needs:

Exact Duplicate Removal

Identify and remove character-for-character duplicates using MD5 hashing

hashing fast gpu-accelerated

Exact Duplicate Removal

Fuzzy Duplicate Removal

Identify and remove near-duplicates using MinHash and LSH similarity

minhash lsh gpu-accelerated

Fuzzy Duplicate Removal

Semantic Deduplication

Remove semantically similar documents using embeddings

embeddings gpu-accelerated meaning-based advanced

Semantic Deduplication

Common Operations#

Document IDs#

Duplicate removal workflows require stable document identifiers. Choose one approach:

Use AddId to add IDs at the start of your pipeline
Use reader-based ID generation (_generate_ids, _assign_ids) backed by the ID Generator actor for stable integer IDs
Use existing IDs if your documents already have unique identifiers

Some workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.

Removing Duplicates#

Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.

from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="/path/to/input",
    ids_to_remove_path="/path/to/duplicates",  # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
    output_path="/path/to/clean",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="/path/to/id_generator.json"  # Required when IDs were auto-assigned
)
removal_workflow.run()

Outputs and Artifacts#

Each deduplication method produces specific output files and directories:

Table 10 Output Locations#
Method	Duplicate IDs Location	ID Generator File	Deduplicated Output
Exact	`ExactDuplicateIds/` (parquet)	`exact_id_generator.json` (if `assign_id=True`)	Via `TextDuplicatesRemovalWorkflow`
Fuzzy	`FuzzyDuplicateIds/` (parquet)	`fuzzy_id_generator.json` (if IDs auto-assigned)	Via `TextDuplicatesRemovalWorkflow`
Semantic	`output_path/duplicates/` (parquet)	N/A	`output_path/deduplicated/` (if `perform_removal=True`)

Column names:

_curator_dedup_id when assign_id=True or IDs are auto-assigned
Matches id_field parameter when assign_id=False

Choosing a Deduplication Method#

Compare deduplication methods to select the best approach for your dataset:

Table 11 Method Comparison#
Method	Best For	Speed	Duplicate Types	GPU Required
Exact	Identical copies	Very fast	Character-for-character matches	Required
Fuzzy	Near-duplicates with small changes	Fast	Minor edits, reformatting (~80% similarity)	Required
Semantic	Similar meaning, different words	Moderate	Paraphrases, translations, rewrites	Required

Quick Decision Guide#

Use this guide to quickly select the right method:

Start with Exact if you have numerous identical documents or need the fastest speed
Use Fuzzy if you need to catch near-duplicates with minor formatting differences
Use Semantic for meaning-based deduplication on large, diverse datasets

For detailed implementation guides, see:

Next Steps#

Ready to use deduplication?

New to deduplication: Start with Exact Duplicate Removal for the fastest approach
Need near-duplicate detection: See Fuzzy Duplicate Removal for MinHash-based matching
Require semantic matching: Explore Semantic Deduplication for meaning-based deduplication

For hands-on guidance: See Text Curation Tutorials for step-by-step examples.