*** description: >- Add unique identifiers to documents in your text dataset for tracking and deduplication workflows categories: * text-curation tags: * preprocessing * identifiers * document-tracking * pipeline personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: how-to modality: text-only *** # Adding Document IDs Add unique identifiers to each document in your text dataset. ## How It Works Document IDs are useful for: * **Pipeline tracking** - Monitor documents through processing stages * **Dataset versioning** - Identify documents across different versions *** ## Usage ### Basic Usage ```python from nemo_curator.stages.text.modules import AddId # Initialize pipeline, read stage, etc. # Add to your pipeline pipeline.add_stage(AddId(id_field="doc_id")) ``` ### Configuration Options ```python # Customize ID generation pipeline.add_stage(AddId( id_field="document_id", # Field name for IDs id_prefix="corpus_v2", # Optional prefix overwrite=True # Overwrite existing IDs )) ``` #### Parameters | Parameter | Type | Default | Description | | ----------- | ------ | -------- | --------------------------------------- | | `id_field` | `str` | Required | Field name where IDs will be stored | | `id_prefix` | `str` | `None` | Optional prefix for IDs | | `overwrite` | `bool` | `False` | Whether to overwrite existing ID fields | #### ID Format Generated IDs follow this pattern: * Without prefix: `{task_uuid}_{index}` * With prefix: `{prefix}_{task_uuid}_{index}` ### Complete Example ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.stages.text.modules import AddId from nemo_curator.stages.text.io.writer import JsonlWriter # Initialize Ray client ray_client = RayClient() ray_client.start() # Create pipeline pipeline = Pipeline(name="add_ids") # Add stages pipeline.add_stage(JsonlReader(file_paths="input/")) pipeline.add_stage(AddId(id_field="doc_id", id_prefix="v1")) pipeline.add_stage(JsonlWriter("output/")) # Run pipeline result = pipeline.run() # Stop Ray client ray_client.stop() ``` ### Alternative: Reader-Based ID Generation For deduplication workflows, unique IDs are generated during data loading: ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor from nemo_curator.stages.text.io.reader import JsonlReader # Initialize Ray client ray_client = RayClient() ray_client.start() pipeline = Pipeline(name="id_generator_example") # Create ID generator create_id_generator_actor() # Reader generates IDs automatically reader = JsonlReader( file_paths="data/", _generate_ids=True # Adds '_curator_dedup_id' field ) pipeline.add_stage(reader) # Run pipeline results = pipeline.run() # Stop Ray client ray_client.stop() # Examine the first 5 rows of the first DocumentBatch print(results[0].data.head()) ``` This approach: * Generates monotonically increasing integer IDs * Required for some deduplication workflows * Persists ID state across pipeline runs *** ## Error Handling **Existing ID field:** ```python # This raises ValueError if 'doc_id' already exists AddId(id_field="doc_id", overwrite=False) # This overwrites existing field with warning AddId(id_field="doc_id", overwrite=True) ``` *** ## Best Practices * **Place early in pipeline** - Add IDs after loading, before filtering * **Use descriptive field names** - `doc_id`, `document_id`, `unique_id` * **Choose appropriate method**: * Use `AddId` for general document tracking * Use ID generator for deduplication workflows *** For deduplication workflows, see [Deduplication](/curate-text/process-data/deduplication).