Overview | NeMo Curator

Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.

NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline .

How It Works

NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:

Exact

Method: MD5 hashing Detects: Character-for-character identical documents Speed: Fastest

Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.

Code Example

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
3 
4 ray_client = RayClient()
5 ray_client.start()
6 
7 exact_workflow = ExactDeduplicationWorkflow(
8     input_path="/path/to/input/data",
9     output_path="/path/to/output",
10     text_field="text",
11     perform_removal=False,  # Identification only
12     assign_id=True,
13     input_filetype="parquet"
14 )
15 result = exact_workflow.run()
16 # result.metadata contains: total_time, num_duplicates, identification_time, id_generator_path

For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Exact Duplicate Removal for details.

Fuzzy

Method: MinHash + Locality Sensitive Hashing (LSH) Detects: Near-duplicates with minor edits (~80% similarity) Speed: Fast

Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.

Code Example

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
3 
4 ray_client = RayClient()
5 ray_client.start()
6 
7 fuzzy_workflow = FuzzyDeduplicationWorkflow(
8     input_path="/path/to/input/data",
9     cache_path="/path/to/cache",
10     output_path="/path/to/output",
11     text_field="text",
12     perform_removal=False,  # Identification only
13     input_blocksize="1GiB",
14     seed=42,
15     char_ngrams=24,
16     num_bands=20,
17     minhashes_per_band=13
18 )
19 result = fuzzy_workflow.run()
20 # result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_path

For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Fuzzy Duplicate Removal for details.

Semantic

Method: Embeddings + clustering + pairwise similarity Detects: Semantically similar content (paraphrases, translations) Speed: Moderate

Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.

Code Example

1 from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
2 
3 text_workflow = TextSemanticDeduplicationWorkflow(
4     input_path="/path/to/input/data",
5     output_path="/path/to/output",
6     cache_path="/path/to/cache",
7     text_field="text",
8     n_clusters=100,
9     eps=0.01,  # Similarity threshold
10     perform_removal=True  # Complete deduplication
11 )
12 result = text_workflow.run()
13 # result.metadata contains: total_time, num_duplicates, num_duplicates_removed

Note: Two workflows available:

TextSemanticDeduplicationWorkflow: For raw text with automatic embedding generation
SemanticDeduplicationWorkflow: For pre-computed embeddings

See Semantic Deduplication for details.

Advanced: Step-by-Step Semantic Deduplication

For fine-grained control, break semantic deduplication into separate stages:

1 from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
2 from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
3 from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
4 
5 # 1. Create ID generator
6 create_id_generator_actor()
7 
8 # 2. Generate embeddings separately (using vLLM)
9 embedding_pipeline = Pipeline(
10     stages=[
11         ParquetReader(file_paths=input_path, _generate_ids=True),
12         VLLMEmbeddingModelStage(
13             model_identifier="google/embeddinggemma-300m",
14             text_field="text",
15             embedding_field="embeddings",
16         ),
17         ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"])
18     ]
19 )
20 embedding_out = embedding_pipeline.run()
21 
22 # 3. Run clustering and pairwise similarity
23 semantic_workflow = SemanticDeduplicationWorkflow(
24     input_path=embedding_output_path,
25     output_path=semantic_workflow_path,
26     n_clusters=100,
27     id_field="_curator_dedup_id",
28     embedding_field="embeddings",
29     eps=None  # Skip duplicate identification for analysis
30 )
31 result = semantic_workflow.run()
32 
33 # 4. Analyze results and choose eps parameter
34 # 5. Identify and remove duplicates

This approach enables analysis of intermediate results and fine-grained control.

Deduplication Methods

Choose a deduplication method based on your needs:

Exact Duplicate Removal

Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated

Fuzzy Duplicate Removal

Identify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated

Semantic Deduplication

Remove semantically similar documents using embeddings embeddings gpu-accelerated meaning-based advanced

Common Operations

Document IDs

Duplicate removal workflows require stable document identifiers. Choose one approach:

Use AddId to add IDs at the start of your pipeline
Use reader-based ID generation (_generate_ids, _assign_ids) backed by the ID Generator actor for stable integer IDs
Use existing IDs if your documents already have unique identifiers

Some workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.

Removing Duplicates

Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.

1 from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
2 
3 removal_workflow = TextDuplicatesRemovalWorkflow(
4     input_path="/path/to/input",
5     ids_to_remove_path="/path/to/duplicates",  # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
6     output_path="/path/to/clean",
7     input_filetype="parquet",
8     input_id_field="_curator_dedup_id",
9     ids_to_remove_duplicate_id_field="_curator_dedup_id",
10     id_generator_path="/path/to/id_generator.json"  # Required when IDs were auto-assigned
11 )
12 result = removal_workflow.run()
13 # result.metadata contains: total_time, num_duplicates_removed

ID Field Configuration

When assign_id=True (IDs auto-assigned):

Duplicate IDs file contains _curator_dedup_id column
Set ids_to_remove_duplicate_id_field="_curator_dedup_id"
id_generator_path is required

When assign_id=False (using existing IDs):

Duplicate IDs file contains the column specified by id_field (e.g., "id")
Set ids_to_remove_duplicate_id_field to match your id_field value
id_generator_path not required

Workflow Results

All deduplication workflows return a WorkflowRunResult object with timing and duplicate count metadata:

1 from nemo_curator.pipeline.workflow import WorkflowRunResult
2 
3 result = exact_workflow.run()
4 print(result.metadata)  # {"total_time": 42.1, "num_duplicates": 1500, ...}

Available metadata varies by workflow. Common keys include total_time and num_duplicates.

Outputs and Artifacts

Each deduplication method produces specific output files and directories:

Method	Duplicate IDs Location	ID Generator File	Deduplicated Output
Exact	`ExactDuplicateIds/` (parquet)	`exact_id_generator.json` (if `assign_id=True`)	Via `TextDuplicatesRemovalWorkflow`
Fuzzy	`FuzzyDuplicateIds/` (parquet)	`fuzzy_id_generator.json` (if IDs auto-assigned)	Via `TextDuplicatesRemovalWorkflow`
Semantic	`output_path/duplicates/` (parquet)	N/A	`output_path/deduplicated/` (if `perform_removal=True`)

Column names:

_curator_dedup_id when assign_id=True or IDs are auto-assigned
Matches id_field parameter when assign_id=False

Choosing a Deduplication Method

Compare deduplication methods to select the best approach for your dataset:

Method	Best For	Speed	Duplicate Types	GPU Required
Exact	Identical copies	Very fast	Character-for-character matches	Required
Fuzzy	Near-duplicates with small changes	Fast	Minor edits, reformatting (~80% similarity)	Required
Semantic	Similar meaning, different words	Moderate	Paraphrases, translations, rewrites	Required

Quick Decision Guide

Use this guide to quickly select the right method:

Start with Exact if you have numerous identical documents or need the fastest speed
Use Fuzzy if you need to catch near-duplicates with minor formatting differences
Use Semantic for meaning-based deduplication on large, diverse datasets

When to Use Each Method

Exact Deduplication:

Removing identical copies of documents
Fast initial deduplication pass
Datasets with numerous exact duplicates
When speed is more important than detecting near-duplicates

Fuzzy Deduplication:

Removing near-duplicate documents with minor formatting differences
Detecting documents with small edits or typos
Fast deduplication when exact matching misses numerous duplicates
When speed is important but some near-duplicate detection is needed

Semantic Deduplication:

Removing semantically similar content (paraphrases, translations)
Large, diverse web-scale datasets
When meaning-based deduplication is more important than speed
Advanced use cases requiring embedding-based similarity detection

Combining Methods

You can combine deduplication methods for comprehensive duplicate removal:

Exact → Fuzzy → Semantic: Start with fastest methods, then apply more sophisticated methods
Exact → Semantic: Use exact for quick wins, then semantic for meaning-based duplicates
Fuzzy → Semantic: Use fuzzy for near-duplicates, then semantic for paraphrases

Run each method independently, then combine duplicate IDs before removal.

For detailed implementation guides, see:

Performance Considerations

GPU Acceleration

All deduplication workflows require GPU acceleration:

Exact: Ray backend with GPU support for MD5 hashing operations
Fuzzy: Ray backend with GPU support for MinHash computation and LSH operations
Semantic: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation

GPU acceleration provides significant speedup for large datasets through parallel processing.

Hardware Requirements

GPU: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
Memory: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
Executors: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support

Backend Setup

For optimal performance with large datasets, configure Ray backend:

1 from nemo_curator.core.client import RayClient
2 
3 client = RayClient(
4     num_cpus=64,    # Adjust based on available cores
5     num_gpus=4      # Should be roughly 2x the memory of embeddings
6 )
7 client.start()
8 
9 try:
10     result = workflow.run()
11 finally:
12     client.stop()

For TB-scale datasets, consider distributed GPU clusters with Ray.

ID Generator for Large-Scale Operations

For large-scale duplicate removal, persist the ID Generator for consistent document tracking:

1 from nemo_curator.stages.deduplication.id_generator import (
2     create_id_generator_actor,
3     write_id_generator_to_disk,
4     kill_id_generator_actor
5 )
6 
7 create_id_generator_actor()
8 id_generator_path = "semantic_id_generator.json"
9 write_id_generator_to_disk(id_generator_path)
10 kill_id_generator_actor()
11 
12 # Use saved ID generator in removal workflow
13 removal_workflow = TextDuplicatesRemovalWorkflow(
14     input_path=input_path,
15     ids_to_remove_path=duplicates_path,
16     output_path=output_path,
17     id_generator_path=id_generator_path,
18     # ... other parameters
19 )

The ID Generator ensures consistent IDs across workflow stages.

Next Steps

Ready to use deduplication?

New to deduplication: Start with Exact Duplicate Removal for the fastest approach
Need near-duplicate detection: See Fuzzy Duplicate Removal for MinHash-based matching
Require semantic matching: Explore Semantic Deduplication for meaning-based deduplication

For hands-on guidance: See Text Curation Tutorials for step-by-step examples.