Deduplication#
Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.
NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline.
How It Works#
NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:
Method: MD5 hashing
Detects: Character-for-character identical documents
Speed: Fastest
Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.
Code Example
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
exact_workflow = ExactDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
assign_id=True,
input_filetype="parquet"
)
exact_workflow.run()
For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Exact Duplicate Removal for details.
Method: MinHash + Locality Sensitive Hashing (LSH)
Detects: Near-duplicates with minor edits (~80% similarity)
Speed: Fast
Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
Code Example
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="/path/to/input/data",
cache_path="/path/to/cache",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
input_blocksize="1GiB",
seed=42,
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
fuzzy_workflow.run()
For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Fuzzy Duplicate Removal for details.
Method: Embeddings + clustering + pairwise similarity
Detects: Semantically similar content (paraphrases, translations)
Speed: Moderate
Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
Code Example
from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
text_workflow = TextSemanticDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
cache_path="/path/to/cache",
text_field="text",
model_identifier="sentence-transformers/all-MiniLM-L6-v2",
n_clusters=100,
eps=0.01, # Similarity threshold
perform_removal=True # Complete deduplication
)
text_workflow.run()
Note: Two workflows available:
TextSemanticDeduplicationWorkflow: For raw text with automatic embedding generationSemanticDeduplicationWorkflow: For pre-computed embeddings
See Semantic Deduplication for details.
Advanced: Step-by-Step Semantic Deduplication
For fine-grained control, break semantic deduplication into separate stages:
from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
# 1. Create ID generator
create_id_generator_actor()
# 2. Generate embeddings separately
embedding_pipeline = Pipeline(
stages=[
ParquetReader(file_paths=input_path, _generate_ids=True),
EmbeddingCreatorStage(
model_identifier="sentence-transformers/all-MiniLM-L6-v2",
text_field="text"
),
ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"])
]
)
embedding_out = embedding_pipeline.run()
# 3. Run clustering and pairwise similarity
semantic_workflow = SemanticDeduplicationWorkflow(
input_path=embedding_output_path,
output_path=semantic_workflow_path,
n_clusters=100,
id_field="_curator_dedup_id",
embedding_field="embeddings",
eps=None # Skip duplicate identification for analysis
)
semantic_out = semantic_workflow.run()
# 4. Analyze results and choose eps parameter
# 5. Identify and remove duplicates
This approach enables analysis of intermediate results and fine-grained control.
Deduplication Methods#
Choose a deduplication method based on your needs:
Identify and remove character-for-character duplicates using MD5 hashing
Identify and remove near-duplicates using MinHash and LSH similarity
Remove semantically similar documents using embeddings
Common Operations#
Document IDs#
Duplicate removal workflows require stable document identifiers. Choose one approach:
Use
AddIdto add IDs at the start of your pipelineUse reader-based ID generation (
_generate_ids,_assign_ids) backed by the ID Generator actor for stable integer IDsUse existing IDs if your documents already have unique identifiers
Some workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.
Removing Duplicates#
Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input",
ids_to_remove_path="/path/to/duplicates", # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
output_path="/path/to/clean",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/id_generator.json" # Required when IDs were auto-assigned
)
removal_workflow.run()
ID Field Configuration
When assign_id=True (IDs auto-assigned):
Duplicate IDs file contains
_curator_dedup_idcolumnSet
ids_to_remove_duplicate_id_field="_curator_dedup_id"id_generator_pathis required
When assign_id=False (using existing IDs):
Duplicate IDs file contains the column specified by
id_field(e.g.,"id")Set
ids_to_remove_duplicate_id_fieldto match yourid_fieldvalueid_generator_pathnot required
Outputs and Artifacts#
Each deduplication method produces specific output files and directories:
Method |
Duplicate IDs Location |
ID Generator File |
Deduplicated Output |
|---|---|---|---|
Exact |
|
|
Via |
Fuzzy |
|
|
Via |
Semantic |
|
N/A |
|
Column names:
_curator_dedup_idwhenassign_id=Trueor IDs are auto-assignedMatches
id_fieldparameter whenassign_id=False
Choosing a Deduplication Method#
Compare deduplication methods to select the best approach for your dataset:
Method |
Best For |
Speed |
Duplicate Types |
GPU Required |
|---|---|---|---|---|
Exact |
Identical copies |
Very fast |
Character-for-character matches |
Required |
Fuzzy |
Near-duplicates with small changes |
Fast |
Minor edits, reformatting (~80% similarity) |
Required |
Semantic |
Similar meaning, different words |
Moderate |
Paraphrases, translations, rewrites |
Required |
Quick Decision Guide#
Use this guide to quickly select the right method:
Start with Exact if you have numerous identical documents or need the fastest speed
Use Fuzzy if you need to catch near-duplicates with minor formatting differences
Use Semantic for meaning-based deduplication on large, diverse datasets
When to Use Each Method
Exact Deduplication:
Removing identical copies of documents
Fast initial deduplication pass
Datasets with numerous exact duplicates
When speed is more important than detecting near-duplicates
Fuzzy Deduplication:
Removing near-duplicate documents with minor formatting differences
Detecting documents with small edits or typos
Fast deduplication when exact matching misses numerous duplicates
When speed is important but some near-duplicate detection is needed
Semantic Deduplication:
Removing semantically similar content (paraphrases, translations)
Large, diverse web-scale datasets
When meaning-based deduplication is more important than speed
Advanced use cases requiring embedding-based similarity detection
Combining Methods
You can combine deduplication methods for comprehensive duplicate removal:
Exact → Fuzzy → Semantic: Start with fastest methods, then apply more sophisticated methods
Exact → Semantic: Use exact for quick wins, then semantic for meaning-based duplicates
Fuzzy → Semantic: Use fuzzy for near-duplicates, then semantic for paraphrases
Run each method independently, then combine duplicate IDs before removal.
For detailed implementation guides, see:
Performance Considerations
GPU Acceleration
All deduplication workflows require GPU acceleration:
Exact: Ray backend with GPU support for MD5 hashing operations
Fuzzy: Ray backend with GPU support for MinHash computation and LSH operations
Semantic: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation
GPU acceleration provides significant speedup for large datasets through parallel processing.
Hardware Requirements
GPU: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
Memory: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
Executors: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support
Backend Setup
For optimal performance with large datasets, configure Ray backend:
from nemo_curator.core.client import RayClient
client = RayClient(
num_cpus=64, # Adjust based on available cores
num_gpus=4 # Should be roughly 2x the memory of embeddings
)
client.start()
try:
workflow.run()
finally:
client.stop()
For TB-scale datasets, consider distributed GPU clusters with Ray.
ID Generator for Large-Scale Operations
For large-scale duplicate removal, persist the ID Generator for consistent document tracking:
from nemo_curator.stages.deduplication.id_generator import (
create_id_generator_actor,
write_id_generator_to_disk,
kill_id_generator_actor
)
create_id_generator_actor()
id_generator_path = "semantic_id_generator.json"
write_id_generator_to_disk(id_generator_path)
kill_id_generator_actor()
# Use saved ID generator in removal workflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path=input_path,
ids_to_remove_path=duplicates_path,
output_path=output_path,
id_generator_path=id_generator_path,
# ... other parameters
)
The ID Generator ensures consistent IDs across workflow stages.
Next Steps#
Ready to use deduplication?
New to deduplication: Start with Exact Duplicate Removal for the fastest approach
Need near-duplicate detection: See Fuzzy Duplicate Removal for MinHash-based matching
Require semantic matching: Explore Semantic Deduplication for meaning-based deduplication
For hands-on guidance: See Text Curation Tutorials for step-by-step examples.