stages.text.download.common_crawl.warc_iterator#
Module Contents#
Classes#
Processes downloaded WARC files. |
API#
- class stages.text.download.common_crawl.warc_iterator.CommonCrawlWarcIterator#
Bases:
nemo_curator.stages.text.download.DocumentIteratorProcesses downloaded WARC files.
- iterate(file_path: str) collections.abc.Iterator[dict[str, Any]]#
Process a task containing WARC files and extract their contents.
- output_columns() list[str]#
Define output columns - produces DocumentBatch with records.