Curate TextProcess DataContent Processing

Adding Document IDs

View as Markdown

Add unique identifiers to each document in your text dataset.

How It Works

Document IDs are useful for:

  • Pipeline tracking - Monitor documents through processing stages
  • Dataset versioning - Identify documents across different versions

Usage

Basic Usage

1from nemo_curator.stages.text.modules import AddId
2
3# Initialize pipeline, read stage, etc.
4
5# Add to your pipeline
6pipeline.add_stage(AddId(id_field="doc_id"))

Configuration Options

1# Customize ID generation
2pipeline.add_stage(AddId(
3 id_field="document_id", # Field name for IDs
4 id_prefix="corpus_v2", # Optional prefix
5 overwrite=True # Overwrite existing IDs
6))

Parameters

ParameterTypeDefaultDescription
id_fieldstrRequiredField name where IDs will be stored
id_prefixstrNoneOptional prefix for IDs
overwriteboolFalseWhether to overwrite existing ID fields

ID Format

Generated IDs follow this pattern:

  • Without prefix: {task_uuid}_{index}
  • With prefix: {prefix}_{task_uuid}_{index}

Complete Example

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.io.reader import JsonlReader
4from nemo_curator.stages.text.modules import AddId
5from nemo_curator.stages.text.io.writer import JsonlWriter
6
7# Initialize Ray client
8ray_client = RayClient()
9ray_client.start()
10
11# Create pipeline
12pipeline = Pipeline(name="add_ids")
13
14# Add stages
15pipeline.add_stage(JsonlReader(file_paths="input/"))
16pipeline.add_stage(AddId(id_field="doc_id", id_prefix="v1"))
17pipeline.add_stage(JsonlWriter("output/"))
18
19# Run pipeline
20result = pipeline.run()
21
22# Stop Ray client
23ray_client.stop()

Alternative: Reader-Based ID Generation

For deduplication workflows, unique IDs are generated during data loading:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
4from nemo_curator.stages.text.io.reader import JsonlReader
5
6# Initialize Ray client
7ray_client = RayClient()
8ray_client.start()
9
10pipeline = Pipeline(name="id_generator_example")
11
12# Create ID generator
13create_id_generator_actor()
14
15# Reader generates IDs automatically
16reader = JsonlReader(
17 file_paths="data/",
18 _generate_ids=True # Adds '_curator_dedup_id' field
19)
20pipeline.add_stage(reader)
21
22# Run pipeline
23results = pipeline.run()
24
25# Stop Ray client
26ray_client.stop()
27
28# Examine the first 5 rows of the first DocumentBatch
29print(results[0].data.head())

This approach:

  • Generates monotonically increasing integer IDs
  • Required for some deduplication workflows
  • Persists ID state across pipeline runs

Error Handling

Existing ID field:

1# This raises ValueError if 'doc_id' already exists
2AddId(id_field="doc_id", overwrite=False)
3
4# This overwrites existing field with warning
5AddId(id_field="doc_id", overwrite=True)

Best Practices

  • Place early in pipeline - Add IDs after loading, before filtering
  • Use descriptive field names - doc_id, document_id, unique_id
  • Choose appropriate method:
    • Use AddId for general document tracking
    • Use ID generator for deduplication workflows

For deduplication workflows, see Deduplication.