Exact Duplicate Removal#

Remove character-for-character duplicate documents using NeMo Curator’s exact duplicate removal workflow. This method computes MD5 hashes for each document’s text and identifies documents with identical hashes as duplicates.

For an overview of all duplicate removal options, refer to Deduplication.

How It Works#

Exact deduplication uses MD5 hashing to identify identical documents:

Computes MD5 hash for each document’s text content
Groups documents by identical hash values
Identifies duplicates and saves IDs for removal

This method targets character-for-character duplicates and is recommended for removing exact copies of documents.

Before You Start#

Prerequisites:

Ray cluster with GPU support (required for distributed processing)
Stable document identifiers for removal (either existing IDs or IDs assigned by the workflow)

Quick Start#

Get started with exact deduplication using these examples:

Two-Step Process

Identify duplicates, then remove them:

from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

# Step 1: Identify duplicates
exact_workflow = ExactDeduplicationWorkflow(
    input_path="input_data/",
    output_path="./results",
    text_field="text",
    assign_id=True,
    perform_removal=False,
    input_filetype="parquet"
)
exact_workflow.run()
# Duplicate IDs saved to ./results/ExactDuplicateIds/

# Step 2: Remove duplicates
removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="input_data/",
    ids_to_remove_path="./results/ExactDuplicateIds",
    output_path="./deduplicated",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="./results/exact_id_generator.json"
)
removal_workflow.run()
# Clean dataset saved to ./deduplicated/

Minimal Example

from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

exact_workflow = ExactDeduplicationWorkflow(
    input_path="input_data/",
    output_path="./results",
    text_field="text",
    assign_id=True,
    perform_removal=False
)
exact_workflow.run()

Configuration#

Configure exact deduplication using these key parameters:

Table 12 Key Configuration Parameters#
Parameter	Type	Default	Description
`input_path`	str \| list[str]	None	Path(s) to input files or directories
`output_path`	str	Required	Directory to write duplicate IDs and ID generator
`text_field`	str	“text”	Name of the text field in input data
`assign_id`	bool	True	Whether to automatically assign unique IDs
`id_field`	str \| None	None	Existing ID field name (if assign_id=False)
`input_filetype`	str	“parquet”	Input file format (“parquet” or “jsonl”)
`input_blocksize`	str \| int	“2GiB”	Size of input blocks for processing
`perform_removal`	bool	False	Reserved; must remain `False`. Exact removal is performed with `TextDuplicatesRemovalWorkflow`.

Removing Duplicates#

After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:

from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="/path/to/input/data",
    ids_to_remove_path="/path/to/output/ExactDuplicateIds",
    output_path="/path/to/deduplicated",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="/path/to/output/exact_id_generator.json"  # Required if assign_id=True
)
removal_workflow.run()

Output Format#

The exact deduplication process produces the following directory structure:

output_path/
├── ExactDuplicateIds/              # Duplicate identification results
│   └── *.parquet                   # Parquet files with document IDs to remove
└── exact_id_generator.json         # ID generator mapping (if assign_id=True)

File Formats#

The workflow produces these output files:

Duplicate IDs (ExactDuplicateIds/*.parquet):
- Contains document IDs to remove
- Format: Parquet files with a single ID column
- Column name depends on assign_id:
  - When assign_id=True: Column is "_curator_dedup_id"
  - When assign_id=False: Column matches the id_field parameter (e.g., "id")
- Important: Contains only the IDs of documents to remove, not the full document content
ID Generator (exact_id_generator.json):
- JSON file containing ID generator state
- Required for removal workflow when assign_id=True
- Ensures consistent ID mapping across workflow stages

For comparison with other deduplication methods and guidance on when to use exact deduplication, refer to the Deduplication overview.