nemo_curator.stages.text.download.base.iterator
nemo_curator.stages.text.download.base.iterator
Module Contents
Classes
API
Bases: ProcessingStage[FileGroupTask, DocumentBatch]
Stage that iterates through downloaded files with DocumentIterator, then extracts structured content from raw records with DocumentExtractor.
Takes local file paths and produces a DocumentBatch with extracted content. If DocumentIterator produces the final format, then DocumentExtractor is not needed.
Initialize the stage.
Define input requirements - expects FileGroupTask with local file paths.
Define output - produces DocumentBatch with processed records.
Iterate through files and extract structured content.
Parameters:
Task containing local file paths
Returns: DocumentBatch
Batch containing extracted records
Get Ray configuration for this stage.
Abstract base class for document iterators.
Always yields dict[str, str] records. For raw content that needs extraction, the iterator can put it in any field (e.g., “raw_content”, “html”, “content”, etc.)
Iterate over records in a file, yielding dict records.
Define output columns - produces DocumentBatch with records.