For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
        • Overview
        • Exact Deduplication
        • Fuzzy Deduplication
        • Semantic Deduplication
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
    • Installation
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How It Works
  • Before You Start
  • Quick Start
  • Configuration
  • Removing Duplicates
  • Output Format
  • File Formats
Curate TextProcess DataDeduplication

Exact Duplicate Removal

||View as Markdown|
Previous

Deduplication

Next

Fuzzy Duplicate Removal

Remove character-for-character duplicate documents using NeMo Curator’s exact duplicate removal workflow. This method computes MD5 hashes for each document’s text and identifies documents with identical hashes as duplicates.

For an overview of all duplicate removal options, refer to Deduplication.

How It Works

Exact deduplication uses MD5 hashing to identify identical documents:

  1. Computes MD5 hash for each document’s text content
  2. Groups documents by identical hash values
  3. Identifies duplicates and saves IDs for removal

This method targets character-for-character duplicates and is recommended for removing exact copies of documents.

Before You Start

Prerequisites:

  • Ray cluster with GPU support (required for distributed processing)
  • Stable document identifiers for removal (either existing IDs or IDs assigned by the workflow)

Quick Start

Get started with exact deduplication using the following example of identifying duplicates, then remove them:

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
3from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
4
5ray_client = RayClient()
6ray_client.start()
7
8# Step 1: Identify duplicates
9exact_workflow = ExactDeduplicationWorkflow(
10 input_path="input_data/",
11 output_path="./results",
12 text_field="text",
13 assign_id=True,
14 perform_removal=False,
15 input_filetype="parquet"
16)
17exact_workflow.run()
18# Duplicate IDs saved to ./results/ExactDuplicateIds/
19
20# Step 2: Remove duplicates
21removal_workflow = TextDuplicatesRemovalWorkflow(
22 input_path="input_data/",
23 ids_to_remove_path="./results/ExactDuplicateIds",
24 output_path="./deduplicated",
25 input_filetype="parquet",
26 input_id_field="_curator_dedup_id",
27 ids_to_remove_duplicate_id_field="_curator_dedup_id",
28 id_generator_path="./results/exact_id_generator.json"
29)
30removal_workflow.run()
31# Clean dataset saved to ./deduplicated/

Configuration

Configure exact deduplication using these key parameters:

Key Configuration Parameters

ParameterTypeDefaultDescription
input_pathstrlist[str]None
output_pathstrRequiredDirectory to write duplicate IDs and ID generator
text_fieldstr”text”Name of the text field in input data
assign_idboolTrueWhether to automatically assign unique IDs
id_fieldstrNoneNone
input_filetypestr”parquet”Input file format (“parquet” or “jsonl”)
input_blocksizestrint”2GiB”
perform_removalboolFalseReserved; must remain False. Exact removal is performed with TextDuplicatesRemovalWorkflow.

Removing Duplicates

After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
3
4ray_client = RayClient()
5ray_client.start()
6
7removal_workflow = TextDuplicatesRemovalWorkflow(
8 input_path="/path/to/input/data",
9 ids_to_remove_path="/path/to/output/ExactDuplicateIds",
10 output_path="/path/to/deduplicated",
11 input_filetype="parquet",
12 input_id_field="_curator_dedup_id",
13 ids_to_remove_duplicate_id_field="_curator_dedup_id",
14 id_generator_path="/path/to/output/exact_id_generator.json" # Required if assign_id=True
15)
16removal_workflow.run()
ID Field Configuration

When assign_id=True:

  • Duplicate IDs file contains _curator_dedup_id column
  • Set ids_to_remove_duplicate_id_field="_curator_dedup_id"
  • id_generator_path is required

When assign_id=False:

  • Duplicate IDs file contains the column specified by id_field (e.g., "id")
  • Set ids_to_remove_duplicate_id_field to match your id_field value
  • id_generator_path not required

Output Format

The exact deduplication process produces the following directory structure:

1output_path/
2├── ExactDuplicateIds/ # Duplicate identification results
3│ └── *.parquet # Parquet files with document IDs to remove
4└── exact_id_generator.json # ID generator mapping (if assign_id=True)

File Formats

The workflow produces these output files:

  1. Duplicate IDs (ExactDuplicateIds/*.parquet):

    • Contains document IDs to remove
    • Format: Parquet files with a single ID column
    • Column name depends on assign_id:
      • When assign_id=True: Column is "_curator_dedup_id"
      • When assign_id=False: Column matches the id_field parameter (e.g., "id")
    • Important: Contains only the IDs of documents to remove, not the full document content
  2. ID Generator (exact_id_generator.json):

    • JSON file containing ID generator state
    • Required for removal workflow when assign_id=True
    • Ensures consistent ID mapping across workflow stages
Performance Considerations

Performance characteristics:

  • Uses MD5 hashing over the configured text field to derive duplicate groups
  • Runs as a Ray-based workflow and writes duplicate IDs to the ExactDuplicateIds/ directory
  • Stores only document IDs to remove in the output files, not full document content

Best practices:

  • Use input_blocksize="2GiB" for optimal performance
  • Clear output directory between runs
  • Use assign_id=True for consistent ID tracking

For comparison with other deduplication methods and guidance on when to use exact deduplication, refer to the Deduplication overview.