***

description: >-
Remove duplicate and near-duplicate documents efficiently using
GPU-accelerated and semantic deduplication modules
categories:

* workflows
  tags:
* deduplication
* fuzzy-dedup
* semantic-dedup
* exact-dedup
* gpu-accelerated
* minhash
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: explanation
  modality: text-only

***

# Deduplication

Remove duplicate and near-duplicate documents from text datasets using NeMo Curator's GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.

NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the [data processing pipeline ](/about/concepts/text/data/processing).

## How It Works

NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:

<Tabs>
  <Tab title="Exact">
    **Method**: MD5 hashing
    **Detects**: Character-for-character identical documents
    **Speed**: Fastest

    Computes MD5 hashes for each document's text content and groups documents with identical hashes. Best for removing exact copies.

    <Accordion title="Code Example">
      ```python
      from nemo_curator.core.client import RayClient
      from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

      ray_client = RayClient()
      ray_client.start()

      exact_workflow = ExactDeduplicationWorkflow(
          input_path="/path/to/input/data",
          output_path="/path/to/output",
          text_field="text",
          perform_removal=False,  # Identification only
          assign_id=True,
          input_filetype="parquet"
      )
      exact_workflow.run()
      ```

      For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for details.
    </Accordion>
  </Tab>
</Tabs>

<Tab title="Fuzzy">
  **Method**: MinHash + Locality Sensitive Hashing (LSH)
  **Detects**: Near-duplicates with minor edits (\~80% similarity)
  **Speed**: Fast

  Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.

  <Accordion title="Code Example">
    ```python
    from nemo_curator.core.client import RayClient
    from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

    ray_client = RayClient()
    ray_client.start()

    fuzzy_workflow = FuzzyDeduplicationWorkflow(
        input_path="/path/to/input/data",
        cache_path="/path/to/cache",
        output_path="/path/to/output",
        text_field="text",
        perform_removal=False,  # Identification only
        input_blocksize="1GiB",
        seed=42,
        char_ngrams=24,
        num_bands=20,
        minhashes_per_band=13
    )
    fuzzy_workflow.run()
    ```

    For removal, use `TextDuplicatesRemovalWorkflow` with the generated duplicate IDs. See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for details.
  </Accordion>
</Tab>

<Tab title="Semantic">
  **Method**: Embeddings + clustering + pairwise similarity
  **Detects**: Semantically similar content (paraphrases, translations)
  **Speed**: Moderate

  Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.

  <Accordion title="Code Example">
    ```python
    from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow

    text_workflow = TextSemanticDeduplicationWorkflow(
        input_path="/path/to/input/data",
        output_path="/path/to/output",
        cache_path="/path/to/cache",
        text_field="text",
        model_identifier="sentence-transformers/all-MiniLM-L6-v2",
        n_clusters=100,
        eps=0.01,  # Similarity threshold
        perform_removal=True  # Complete deduplication
    )
    text_workflow.run()
    ```

    **Note**: Two workflows available:

    * `TextSemanticDeduplicationWorkflow`: For raw text with automatic embedding generation
    * `SemanticDeduplicationWorkflow`: For pre-computed embeddings

    See [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for details.
  </Accordion>
</Tab>

<Accordion title="Advanced: Step-by-Step Semantic Deduplication">
  For fine-grained control, break semantic deduplication into separate stages:

  ```python
  from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
  from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
  from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow

  # 1. Create ID generator
  create_id_generator_actor()

  # 2. Generate embeddings separately
  embedding_pipeline = Pipeline(
      stages=[
          ParquetReader(file_paths=input_path, _generate_ids=True),
          EmbeddingCreatorStage(
              model_identifier="sentence-transformers/all-MiniLM-L6-v2",
              text_field="text"
          ),
          ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"])
      ]
  )
  embedding_out = embedding_pipeline.run()

  # 3. Run clustering and pairwise similarity
  semantic_workflow = SemanticDeduplicationWorkflow(
      input_path=embedding_output_path,
      output_path=semantic_workflow_path,
      n_clusters=100,
      id_field="_curator_dedup_id",
      embedding_field="embeddings",
      eps=None  # Skip duplicate identification for analysis
  )
  semantic_out = semantic_workflow.run()

  # 4. Analyze results and choose eps parameter
  # 5. Identify and remove duplicates
  ```

  This approach enables analysis of intermediate results and fine-grained control.
</Accordion>

***

## Deduplication Methods

Choose a deduplication method based on your needs:

<Cards>
  <Card title="Exact Duplicate Removal" href="/curate-text/process-data/deduplication/exact">
    Identify and remove character-for-character duplicates using MD5 hashing
    hashing
    fast
    gpu-accelerated
  </Card>

  <Card title="Fuzzy Duplicate Removal" href="/curate-text/process-data/deduplication/fuzzy">
    Identify and remove near-duplicates using MinHash and LSH similarity
    minhash
    lsh
    gpu-accelerated
  </Card>

  <Card title="Semantic Deduplication" href="/curate-text/process-data/deduplication/semdedup">
    Remove semantically similar documents using embeddings
    embeddings
    gpu-accelerated
    meaning-based
    advanced
  </Card>
</Cards>

## Common Operations

### Document IDs

Duplicate removal workflows require stable document identifiers. Choose one approach:

* **Use `AddId`** to add IDs at the start of your pipeline
* **Use reader-based ID generation** (`_generate_ids`, `_assign_ids`) backed by the ID Generator actor for stable integer IDs
* **Use existing IDs** if your documents already have unique identifiers

Some workflows write an ID generator state file (`*_id_generator.json`) for later removal when IDs are auto-assigned.

### Removing Duplicates

Use `TextDuplicatesRemovalWorkflow` to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.

```python
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="/path/to/input",
    ids_to_remove_path="/path/to/duplicates",  # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
    output_path="/path/to/clean",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="/path/to/id_generator.json"  # Required when IDs were auto-assigned
)
removal_workflow.run()
```

<Accordion title="ID Field Configuration">
  **When `assign_id=True`** (IDs auto-assigned):

  * Duplicate IDs file contains `_curator_dedup_id` column
  * Set `ids_to_remove_duplicate_id_field="_curator_dedup_id"`
  * `id_generator_path` is required

  **When `assign_id=False`** (using existing IDs):

  * Duplicate IDs file contains the column specified by `id_field` (e.g., `"id"`)
  * Set `ids_to_remove_duplicate_id_field` to match your `id_field` value
  * `id_generator_path` not required
</Accordion>

### Outputs and Artifacts

Each deduplication method produces specific output files and directories:

| Method   | Duplicate IDs Location              | ID Generator File                                | Deduplicated Output                                     |
| -------- | ----------------------------------- | ------------------------------------------------ | ------------------------------------------------------- |
| Exact    | `ExactDuplicateIds/` (parquet)      | `exact_id_generator.json` (if `assign_id=True`)  | Via `TextDuplicatesRemovalWorkflow`                     |
| Fuzzy    | `FuzzyDuplicateIds/` (parquet)      | `fuzzy_id_generator.json` (if IDs auto-assigned) | Via `TextDuplicatesRemovalWorkflow`                     |
| Semantic | `output_path/duplicates/` (parquet) | N/A                                              | `output_path/deduplicated/` (if `perform_removal=True`) |

**Column names**:

* `_curator_dedup_id` when `assign_id=True` or IDs are auto-assigned
* Matches `id_field` parameter when `assign_id=False`

## Choosing a Deduplication Method

Compare deduplication methods to select the best approach for your dataset:

| Method       | Best For                           | Speed     | Duplicate Types                              | GPU Required |
| ------------ | ---------------------------------- | --------- | -------------------------------------------- | ------------ |
| **Exact**    | Identical copies                   | Very fast | Character-for-character matches              | Required     |
| **Fuzzy**    | Near-duplicates with small changes | Fast      | Minor edits, reformatting (\~80% similarity) | Required     |
| **Semantic** | Similar meaning, different words   | Moderate  | Paraphrases, translations, rewrites          | Required     |

### Quick Decision Guide

Use this guide to quickly select the right method:

* **Start with Exact** if you have numerous identical documents or need the fastest speed
* **Use Fuzzy** if you need to catch near-duplicates with minor formatting differences
* **Use Semantic** for meaning-based deduplication on large, diverse datasets

<Accordion title="When to Use Each Method">
  **Exact Deduplication**:

  * Removing identical copies of documents
  * Fast initial deduplication pass
  * Datasets with numerous exact duplicates
  * When speed is more important than detecting near-duplicates

  **Fuzzy Deduplication**:

  * Removing near-duplicate documents with minor formatting differences
  * Detecting documents with small edits or typos
  * Fast deduplication when exact matching misses numerous duplicates
  * When speed is important but some near-duplicate detection is needed

  **Semantic Deduplication**:

  * Removing semantically similar content (paraphrases, translations)
  * Large, diverse web-scale datasets
  * When meaning-based deduplication is more important than speed
  * Advanced use cases requiring embedding-based similarity detection
</Accordion>

<Accordion title="Combining Methods">
  You can combine deduplication methods for comprehensive duplicate removal:

  1. **Exact → Fuzzy → Semantic**: Start with fastest methods, then apply more sophisticated methods
  2. **Exact → Semantic**: Use exact for quick wins, then semantic for meaning-based duplicates
  3. **Fuzzy → Semantic**: Use fuzzy for near-duplicates, then semantic for paraphrases

  Run each method independently, then combine duplicate IDs before removal.
</Accordion>

For detailed implementation guides, see:

* [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact)
* [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy)
* [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup)

<Accordion title="Performance Considerations">
  ### GPU Acceleration

  All deduplication workflows require GPU acceleration:

  * **Exact**: Ray backend with GPU support for MD5 hashing operations
  * **Fuzzy**: Ray backend with GPU support for MinHash computation and LSH operations
  * **Semantic**: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation

  GPU acceleration provides significant speedup for large datasets through parallel processing.

  ### Hardware Requirements

  * **GPU**: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
  * **Memory**: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
  * **Executors**: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support

  ### Backend Setup

  For optimal performance with large datasets, configure Ray backend:

  ```python
  from nemo_curator.core.client import RayClient

  client = RayClient(
      num_cpus=64,    # Adjust based on available cores
      num_gpus=4      # Should be roughly 2x the memory of embeddings
  )
  client.start()

  try:
      workflow.run()
  finally:
      client.stop()
  ```

  For TB-scale datasets, consider distributed GPU clusters with Ray.

  ### ID Generator for Large-Scale Operations

  For large-scale duplicate removal, persist the ID Generator for consistent document tracking:

  ```python
  from nemo_curator.stages.deduplication.id_generator import (
      create_id_generator_actor,
      write_id_generator_to_disk,
      kill_id_generator_actor
  )

  create_id_generator_actor()
  id_generator_path = "semantic_id_generator.json"
  write_id_generator_to_disk(id_generator_path)
  kill_id_generator_actor()

  # Use saved ID generator in removal workflow
  removal_workflow = TextDuplicatesRemovalWorkflow(
      input_path=input_path,
      ids_to_remove_path=duplicates_path,
      output_path=output_path,
      id_generator_path=id_generator_path,
      # ... other parameters
  )
  ```

  The ID Generator ensures consistent IDs across workflow stages.
</Accordion>

## Next Steps

**Ready to use deduplication?**

* **New to deduplication**: Start with [Exact Duplicate Removal ](/curate-text/process-data/deduplication/exact) for the fastest approach
* **Need near-duplicate detection**: See [Fuzzy Duplicate Removal ](/curate-text/process-data/deduplication/fuzzy) for MinHash-based matching
* **Require semantic matching**: Explore [Semantic Deduplication ](/curate-text/process-data/deduplication/semdedup) for meaning-based deduplication

**For hands-on guidance**: See [Text Curation Tutorials ](/curate-text/tutorials) for step-by-step examples.
