nemo_curator.stages.text.modules.add_id
nemo_curator.stages.text.modules.add_id
Module Contents
Classes
API
Dataclass
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
The module responsible for adding unique IDs to each document record.
This stage adds a unique identifier to each document in the batch by combining the task UUID with a sequential index.
Parameters:
id_field
The field where the generated ID will be stored.
id_prefix
A prefix to add to the generated IDs.
overwrite
Whether to overwrite existing IDs.
id_field
id_prefix
name
overwrite
Adds unique IDs to each document in the batch.
The IDs are generated by combining the batch UUID with a sequential index, ensuring uniqueness across the entire dataset.
Parameters:
batch
The batch to add IDs to
Returns: DocumentBatch | None
A batch with unique IDs added to each document