Add unique identifiers to each document in your text dataset.
How It Works
Document IDs are useful for:
- Pipeline tracking - Monitor documents through processing stages
- Dataset versioning - Identify documents across different versions
Usage
Basic Usage
Configuration Options
Parameters
Generated IDs follow this pattern:
- Without prefix:
{task_uuid}_{index}
- With prefix:
{prefix}_{task_uuid}_{index}
Complete Example
Alternative: Reader-Based ID Generation
For deduplication workflows, unique IDs are generated during data loading:
This approach:
- Generates monotonically increasing integer IDs
- Required for some deduplication workflows
- Persists ID state across pipeline runs
Error Handling
Existing ID field:
Best Practices
- Place early in pipeline - Add IDs after loading, before filtering
- Use descriptive field names -
doc_id, document_id, unique_id
- Choose appropriate method:
- Use
AddId for general document tracking
- Use ID generator for deduplication workflows
For deduplication workflows, see Deduplication.