stages.text.download.base.iterator
#
Module Contents#
Classes#
Stage that iterates through downloaded files and extracts records. |
|
Abstract base class for document iterators. |
API#
- class stages.text.download.base.iterator.DocumentIterateStage#
Bases:
nemo_curator.stages.base.ProcessingStage
[nemo_curator.tasks.FileGroupTask
,nemo_curator.tasks.DocumentBatch
]Stage that iterates through downloaded files and extracts records.
Takes local file paths and produces a DocumentBatch with records. All iterators yield dict[str, str] records uniformly.
- add_filename_column: bool | str#
True
- inputs() tuple[list[str], list[str]] #
Define input requirements - expects FileGroupTask with local file paths.
- iterator: stages.text.download.base.iterator.DocumentIterator#
None
- outputs() tuple[list[str], list[str]] #
Define output - produces DocumentBatch with records.
- process(
- task: nemo_curator.tasks.FileGroupTask,
Iterate through files and extract records.
Args: task (FileGroupTask): Task containing local file paths
Returns: DocumentBatch: Batch containing records
- record_limit: int | None#
None
- class stages.text.download.base.iterator.DocumentIterator#
Bases:
abc.ABC
Abstract base class for document iterators.
Always yields dict[str, str] records. For raw content that needs extraction, the iterator can put it in any field (e.g., “raw_content”, “html”, “content”, etc.)
- abstractmethod iterate(file_path: str) collections.abc.Iterator[dict[str, Any]] #
Iterate over records in a file, yielding dict records.
- abstractmethod output_columns() list[str] #
Define output columns - produces DocumentBatch with records.