stages.text.download.base.iterator#

Module Contents#

Classes#

DocumentIterateStage

Stage that iterates through downloaded files and extracts records.

DocumentIterator

Abstract base class for document iterators.

API#

class stages.text.download.base.iterator.DocumentIterateStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask, nemo_curator.tasks.DocumentBatch]

Stage that iterates through downloaded files and extracts records.

Takes local file paths and produces a DocumentBatch with records. All iterators yield dict[str, str] records uniformly.

add_filename_column: bool | str#

True

inputs() tuple[list[str], list[str]]#

Define input requirements - expects FileGroupTask with local file paths.

iterator: stages.text.download.base.iterator.DocumentIterator#

None

outputs() tuple[list[str], list[str]]#

Define output - produces DocumentBatch with records.

process(
task: nemo_curator.tasks.FileGroupTask,
) nemo_curator.tasks.DocumentBatch#

Iterate through files and extract records.

Args: task (FileGroupTask): Task containing local file paths

Returns: DocumentBatch: Batch containing records

record_limit: int | None#

None

class stages.text.download.base.iterator.DocumentIterator#

Bases: abc.ABC

Abstract base class for document iterators.

Always yields dict[str, str] records. For raw content that needs extraction, the iterator can put it in any field (e.g., “raw_content”, “html”, “content”, etc.)

abstractmethod iterate(file_path: str) collections.abc.Iterator[dict[str, Any]]#

Iterate over records in a file, yielding dict records.

abstractmethod output_columns() list[str]#

Define output columns - produces DocumentBatch with records.