nemo_curator.stages.text.modules.add_id

View as Markdown

Module Contents

Classes

NameDescription
AddIdThe module responsible for adding unique IDs to each document record.

API

class nemo_curator.stages.text.modules.add_id.AddId(
id_field: str,
id_prefix: str | None = None,
overwrite: bool = False,
name: str = 'add_id'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

The module responsible for adding unique IDs to each document record.

This stage adds a unique identifier to each document in the batch by combining the task UUID with a sequential index.

Parameters:

id_field
str

The field where the generated ID will be stored.

id_prefix
str | NoneDefaults to None

A prefix to add to the generated IDs.

overwrite
boolDefaults to False

Whether to overwrite existing IDs.

id_field
str
id_prefix
str | None = None
name
str = 'add_id'
overwrite
bool = False
nemo_curator.stages.text.modules.add_id.AddId.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.modules.add_id.AddId.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.modules.add_id.AddId.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch | None

Adds unique IDs to each document in the batch.

The IDs are generated by combining the batch UUID with a sequential index, ensuring uniqueness across the entire dataset.

Parameters:

batch
DocumentBatch

The batch to add IDs to

Returns: DocumentBatch | None

A batch with unique IDs added to each document