nemo_curator.stages.text.download.base.iterator

View as Markdown

Module Contents

Classes

NameDescription
DocumentIterateExtractStageStage that iterates through downloaded files with DocumentIterator,
DocumentIteratorAbstract base class for document iterators.

API

class nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage(
iterator: nemo_curator.stages.text.download.base.iterator.DocumentIterator,
extractor: nemo_curator.stages.text.download.base.extract.DocumentExtractor | None = None,
record_limit: int | None = None,
add_filename_column: bool | str = True,
max_calls_per_worker: int | None = None
)
Dataclass

Bases: ProcessingStage[FileGroupTask, DocumentBatch]

Stage that iterates through downloaded files with DocumentIterator, then extracts structured content from raw records with DocumentExtractor.

Takes local file paths and produces a DocumentBatch with extracted content. If DocumentIterator produces the final format, then DocumentExtractor is not needed.

add_filename_column
bool | str = True
extractor
DocumentExtractor | None = None
iterator
DocumentIterator
max_calls_per_worker
int | None = None
record_limit
int | None = None
nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.__post_init__()

Initialize the stage.

nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.inputs() -> tuple[list[str], list[str]]

Define input requirements - expects FileGroupTask with local file paths.

nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.outputs() -> tuple[list[str], list[str]]

Define output - produces DocumentBatch with processed records.

nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.DocumentBatch

Iterate through files and extract structured content.

Parameters:

task
FileGroupTask

Task containing local file paths

Returns: DocumentBatch

Batch containing extracted records

nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.ray_stage_spec() -> dict[str, typing.Any]

Get Ray configuration for this stage.

class nemo_curator.stages.text.download.base.iterator.DocumentIterator()
Abstract

Abstract base class for document iterators.

Always yields dict[str, str] records. For raw content that needs extraction, the iterator can put it in any field (e.g., “raw_content”, “html”, “content”, etc.)

nemo_curator.stages.text.download.base.iterator.DocumentIterator.iterate(
file_path: str
) -> collections.abc.Iterator[dict[str, typing.Any]]
abstract

Iterate over records in a file, yielding dict records.

nemo_curator.stages.text.download.base.iterator.DocumentIterator.output_columns() -> list[str]
abstract

Define output columns - produces DocumentBatch with records.