stages.text.download.arxiv.iterator
#
Module Contents#
Classes#
Processes downloaded Arxiv files and extracts article content. |
API#
- class stages.text.download.arxiv.iterator.ArxivIterator(log_frequency: int = 1000)#
Bases:
nemo_curator.stages.text.download.DocumentIterator
Processes downloaded Arxiv files and extracts article content.
Initialization
- iterate(file_path: str) collections.abc.Iterator[dict[str, Any]] #
Iterate over records in a file, yielding dict records.
- output_columns() list[str] #
Define output columns - produces DocumentBatch with records.