Adding Document IDs#
Add unique identifiers to each document in your text dataset for tracking and deduplication workflows.
How It Works#
Document IDs are required for:
Deduplication workflows - Track which documents are duplicates
Pipeline tracking - Monitor documents through processing stages
Dataset versioning - Identify documents across different versions
Usage#
Basic Usage#
from nemo_curator.stages.text.modules import AddId
# Add to your pipeline
pipeline.add_stage(AddId(id_field="doc_id"))
Configuration Options#
# Customize ID generation
pipeline.add_stage(AddId(
id_field="document_id", # Field name for IDs
id_prefix="corpus_v2", # Optional prefix
overwrite=True # Overwrite existing IDs
))
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
|
Required |
Field name where IDs will be stored |
|
|
|
Optional prefix for IDs |
|
|
|
Whether to overwrite existing ID fields |
ID Format#
Generated IDs follow this pattern:
Without prefix:
{task_uuid}_{index}
With prefix:
{prefix}_{task_uuid}_{index}
Examples:
a1b2c3d4-e5f6-7890-abcd-ef1234567890_0
corpus_v1_a1b2c3d4-e5f6-7890-abcd-ef1234567890_1
Complete Example#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.modules import AddId
from nemo_curator.stages.text.io.writer import JsonlWriter
# Create pipeline
pipeline = Pipeline(name="add_ids")
# Add stages
pipeline.add_stage(JsonlReader(file_paths="input/*.jsonl"))
pipeline.add_stage(AddId(id_field="doc_id", id_prefix="v1"))
pipeline.add_stage(JsonlWriter(output_path="output/"))
# Run pipeline
result = pipeline.run()
Alternative: Reader-Based ID Generation#
For deduplication workflows, you can generate IDs during data loading:
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
# Create ID generator
create_id_generator_actor()
# Reader generates IDs automatically
reader = JsonlReader(
file_paths="data/*.jsonl",
_generate_ids=True # Adds '_curator_dedup_id' field
)
This approach:
Generates monotonically increasing integer IDs
Required for some deduplication workflows
Persists ID state across pipeline runs
Error Handling#
Existing ID field:
# This raises ValueError if 'doc_id' already exists
AddId(id_field="doc_id", overwrite=False)
# This overwrites existing field with warning
AddId(id_field="doc_id", overwrite=True)
Best Practices#
Place early in pipeline - Add IDs after loading, before filtering
Use descriptive field names -
doc_id
,document_id
,unique_id
Choose appropriate method:
Use
AddId
for general document trackingUse reader-based generation for deduplication workflows
For deduplication workflows, see Deduplication.