***
description: >-
Remove duplicate and near-duplicate documents efficiently using
GPU-accelerated and semantic deduplication modules
categories:
* workflows
tags:
* deduplication
* fuzzy-dedup
* semantic-dedup
* exact-dedup
* gpu-accelerated
* minhash
personas:
* data-scientist-focused
* mle-focused
difficulty: intermediate
content\_type: explanation
modality: text-only
***
# Deduplication
Remove duplicate and near-duplicate documents from text datasets using NeMo Curator's GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.
NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the [data processing pipeline ](/about/concepts/text/data/processing).
## How It Works
NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:
**Method**: MD5 hashing
**Detects**: Character-for-character identical documents
**Speed**: Fastest
Computes MD5 hashes for each document's text content and groups documents with identical hashes. Best for removing exact copies.
```python
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
ray_client = RayClient()
ray_client.start()
exact_workflow = ExactDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
assign_id=True,
input_filetype="parquet"
)
exact_workflow.run()
```
For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for details.
**Method**: MinHash + Locality Sensitive Hashing (LSH)
**Detects**: Near-duplicates with minor edits (\~80% similarity)
**Speed**: Fast
Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
```python
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
ray_client = RayClient()
ray_client.start()
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="/path/to/input/data",
cache_path="/path/to/cache",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
input_blocksize="1GiB",
seed=42,
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
fuzzy_workflow.run()
```
For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for details.
**Method**: Embeddings + clustering + pairwise similarity
**Detects**: Semantically similar content (paraphrases, translations)
**Speed**: Moderate
Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
```python
from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
text_workflow = TextSemanticDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
cache_path="/path/to/cache",
text_field="text",
model_identifier="sentence-transformers/all-MiniLM-L6-v2",
n_clusters=100,
eps=0.01, # Similarity threshold
perform_removal=True # Complete deduplication
)
text_workflow.run()
```
**Note**: Two workflows available:
* `TextSemanticDeduplicationWorkflow`: For raw text with automatic embedding generation
* `SemanticDeduplicationWorkflow`: For pre-computed embeddings
See [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for details.
For fine-grained control, break semantic deduplication into separate stages:
```python
from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
# 1. Create ID generator
create_id_generator_actor()
# 2. Generate embeddings separately
embedding_pipeline = Pipeline(
stages=[
ParquetReader(file_paths=input_path, _generate_ids=True),
EmbeddingCreatorStage(
model_identifier="sentence-transformers/all-MiniLM-L6-v2",
text_field="text"
),
ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"])
]
)
embedding_out = embedding_pipeline.run()
# 3. Run clustering and pairwise similarity
semantic_workflow = SemanticDeduplicationWorkflow(
input_path=embedding_output_path,
output_path=semantic_workflow_path,
n_clusters=100,
id_field="_curator_dedup_id",
embedding_field="embeddings",
eps=None # Skip duplicate identification for analysis
)
semantic_out = semantic_workflow.run()
# 4. Analyze results and choose eps parameter
# 5. Identify and remove duplicates
```
This approach enables analysis of intermediate results and fine-grained control.
***
## Deduplication Methods
Choose a deduplication method based on your needs:
Identify and remove character-for-character duplicates using MD5 hashing
hashing
fast
gpu-accelerated
Identify and remove near-duplicates using MinHash and LSH similarity
minhash
lsh
gpu-accelerated
Remove semantically similar documents using embeddings
embeddings
gpu-accelerated
meaning-based
advanced
## Common Operations
### Document IDs
Duplicate removal workflows require stable document identifiers. Choose one approach:
* **Use `AddId`** to add IDs at the start of your pipeline
* **Use reader-based ID generation** (`_generate_ids`, `_assign_ids`) backed by the ID Generator actor for stable integer IDs
* **Use existing IDs** if your documents already have unique identifiers
Some workflows write an ID generator state file (`*_id_generator.json`) for later removal when IDs are auto-assigned.
### Removing Duplicates
Use `TextDuplicatesRemovalWorkflow` to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.
```python
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input",
ids_to_remove_path="/path/to/duplicates", # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
output_path="/path/to/clean",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/id_generator.json" # Required when IDs were auto-assigned
)
removal_workflow.run()
```
**When `assign_id=True`** (IDs auto-assigned):
* Duplicate IDs file contains `_curator_dedup_id` column
* Set `ids_to_remove_duplicate_id_field="_curator_dedup_id"`
* `id_generator_path` is required
**When `assign_id=False`** (using existing IDs):
* Duplicate IDs file contains the column specified by `id_field` (e.g., `"id"`)
* Set `ids_to_remove_duplicate_id_field` to match your `id_field` value
* `id_generator_path` not required
### Outputs and Artifacts
Each deduplication method produces specific output files and directories:
| Method | Duplicate IDs Location | ID Generator File | Deduplicated Output |
| -------- | ----------------------------------- | ------------------------------------------------ | ------------------------------------------------------- |
| Exact | `ExactDuplicateIds/` (parquet) | `exact_id_generator.json` (if `assign_id=True`) | Via `TextDuplicatesRemovalWorkflow` |
| Fuzzy | `FuzzyDuplicateIds/` (parquet) | `fuzzy_id_generator.json` (if IDs auto-assigned) | Via `TextDuplicatesRemovalWorkflow` |
| Semantic | `output_path/duplicates/` (parquet) | N/A | `output_path/deduplicated/` (if `perform_removal=True`) |
**Column names**:
* `_curator_dedup_id` when `assign_id=True` or IDs are auto-assigned
* Matches `id_field` parameter when `assign_id=False`
## Choosing a Deduplication Method
Compare deduplication methods to select the best approach for your dataset:
| Method | Best For | Speed | Duplicate Types | GPU Required |
| ------------ | ---------------------------------- | --------- | -------------------------------------------- | ------------ |
| **Exact** | Identical copies | Very fast | Character-for-character matches | Required |
| **Fuzzy** | Near-duplicates with small changes | Fast | Minor edits, reformatting (\~80% similarity) | Required |
| **Semantic** | Similar meaning, different words | Moderate | Paraphrases, translations, rewrites | Required |
### Quick Decision Guide
Use this guide to quickly select the right method:
* **Start with Exact** if you have numerous identical documents or need the fastest speed
* **Use Fuzzy** if you need to catch near-duplicates with minor formatting differences
* **Use Semantic** for meaning-based deduplication on large, diverse datasets
**Exact Deduplication**:
* Removing identical copies of documents
* Fast initial deduplication pass
* Datasets with numerous exact duplicates
* When speed is more important than detecting near-duplicates
**Fuzzy Deduplication**:
* Removing near-duplicate documents with minor formatting differences
* Detecting documents with small edits or typos
* Fast deduplication when exact matching misses numerous duplicates
* When speed is important but some near-duplicate detection is needed
**Semantic Deduplication**:
* Removing semantically similar content (paraphrases, translations)
* Large, diverse web-scale datasets
* When meaning-based deduplication is more important than speed
* Advanced use cases requiring embedding-based similarity detection
You can combine deduplication methods for comprehensive duplicate removal:
1. **Exact → Fuzzy → Semantic**: Start with fastest methods, then apply more sophisticated methods
2. **Exact → Semantic**: Use exact for quick wins, then semantic for meaning-based duplicates
3. **Fuzzy → Semantic**: Use fuzzy for near-duplicates, then semantic for paraphrases
Run each method independently, then combine duplicate IDs before removal.
For detailed implementation guides, see:
* [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact)
* [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy)
* [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup)
### GPU Acceleration
All deduplication workflows require GPU acceleration:
* **Exact**: Ray backend with GPU support for MD5 hashing operations
* **Fuzzy**: Ray backend with GPU support for MinHash computation and LSH operations
* **Semantic**: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation
GPU acceleration provides significant speedup for large datasets through parallel processing.
### Hardware Requirements
* **GPU**: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
* **Memory**: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
* **Executors**: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support
### Backend Setup
For optimal performance with large datasets, configure Ray backend:
```python
from nemo_curator.core.client import RayClient
client = RayClient(
num_cpus=64, # Adjust based on available cores
num_gpus=4 # Should be roughly 2x the memory of embeddings
)
client.start()
try:
workflow.run()
finally:
client.stop()
```
For TB-scale datasets, consider distributed GPU clusters with Ray.
### ID Generator for Large-Scale Operations
For large-scale duplicate removal, persist the ID Generator for consistent document tracking:
```python
from nemo_curator.stages.deduplication.id_generator import (
create_id_generator_actor,
write_id_generator_to_disk,
kill_id_generator_actor
)
create_id_generator_actor()
id_generator_path = "semantic_id_generator.json"
write_id_generator_to_disk(id_generator_path)
kill_id_generator_actor()
# Use saved ID generator in removal workflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path=input_path,
ids_to_remove_path=duplicates_path,
output_path=output_path,
id_generator_path=id_generator_path,
# ... other parameters
)
```
The ID Generator ensures consistent IDs across workflow stages.
## Next Steps
**Ready to use deduplication?**
* **New to deduplication**: Start with [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for the fastest approach
* **Need near-duplicate detection**: See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for MinHash-based matching
* **Require semantic matching**: Explore [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for meaning-based deduplication
**For hands-on guidance**: See [Text Curation Tutorials ](/curate-text/tutorials) for step-by-step examples.