`stages.text.download.base.iterator`#

Module Contents#

Classes#

`DocumentIterateStage`	Stage that iterates through downloaded files and extracts records.
`DocumentIterator`	Abstract base class for document iterators.

API#

class stages.text.download.base.iterator.DocumentIterateStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask, nemo_curator.tasks.DocumentBatch]

Stage that iterates through downloaded files and extracts records.

Takes local file paths and produces a DocumentBatch with records. All iterators yield dict[str, str] records uniformly.

add_filename_column: bool | str#: True

inputs() → tuple[list[str], list[str]]#: Define input requirements - expects FileGroupTask with local file paths.

iterator: stages.text.download.base.iterator.DocumentIterator#: None

outputs() → tuple[list[str], list[str]]#: Define output - produces DocumentBatch with records.

process( task: nemo_curator.tasks.FileGroupTask, ) → nemo_curator.tasks.DocumentBatch#

Iterate through files and extract records.

Args: task (FileGroupTask): Task containing local file paths

Returns: DocumentBatch: Batch containing records

record_limit: int | None#: None

class stages.text.download.base.iterator.DocumentIterator#

Bases: abc.ABC

Abstract base class for document iterators.

Always yields dict[str, str] records. For raw content that needs extraction, the iterator can put it in any field (e.g., “raw_content”, “html”, “content”, etc.)

abstractmethod iterate(file_path: str) → collections.abc.Iterator[dict[str, Any]]#: Iterate over records in a file, yielding dict records.

abstractmethod output_columns() → list[str]#: Define output columns - produces DocumentBatch with records.

stages.text.download.base.iterator#

Module Contents#

Classes#

API#

`stages.text.download.base.iterator`#