stages.text.download.arxiv.iterator#

Module Contents#

Classes#

ArxivIterator

Processes downloaded Arxiv files and extracts article content.

API#

class stages.text.download.arxiv.iterator.ArxivIterator(log_frequency: int = 1000)#

Bases: nemo_curator.stages.text.download.DocumentIterator

Processes downloaded Arxiv files and extracts article content.

Initialization

iterate(file_path: str) collections.abc.Iterator[dict[str, Any]]#

Iterate over records in a file, yielding dict records.

output_columns() list[str]#

Define output columns - produces DocumentBatch with records.