Adding Document IDs
Add unique identifiers to each document in your text dataset.
How It Works
Document IDs are useful for:
- Pipeline tracking - Monitor documents through processing stages
- Dataset versioning - Identify documents across different versions
Usage
Basic Usage
Configuration Options
Parameters
ID Format
Generated IDs follow this pattern:
- Without prefix:
{task_uuid}_{index} - With prefix:
{prefix}_{task_uuid}_{index}
Complete Example
Alternative: Reader-Based ID Generation
For deduplication workflows, unique IDs are generated during data loading:
This approach:
- Generates monotonically increasing integer IDs
- Required for some deduplication workflows
- Persists ID state across pipeline runs
Error Handling
Existing ID field:
Best Practices
- Place early in pipeline - Add IDs after loading, before filtering
- Use descriptive field names -
doc_id,document_id,unique_id - Choose appropriate method:
- Use
AddIdfor general document tracking - Use ID generator for deduplication workflows
- Use
For deduplication workflows, see Deduplication.