stages.text.download.common_crawl.warc_iterator#

Module Contents#

Classes#

CommonCrawlWarcIterator

Processes downloaded WARC files.

API#

class stages.text.download.common_crawl.warc_iterator.CommonCrawlWarcIterator#

Bases: nemo_curator.stages.text.download.DocumentIterator

Processes downloaded WARC files.

iterate(file_path: str) collections.abc.Iterator[dict[str, Any]]#

Process a task containing WARC files and extract their contents.

output_columns() list[str]#

Define output columns - produces DocumentBatch with records.