nemo_curator.stages.text.download.common_crawl.warc_iterator

View as Markdown

Module Contents

Classes

NameDescription
CommonCrawlWarcIteratorProcesses downloaded WARC files.

API

class nemo_curator.stages.text.download.common_crawl.warc_iterator.CommonCrawlWarcIterator()

Bases: DocumentIterator

Processes downloaded WARC files.

nemo_curator.stages.text.download.common_crawl.warc_iterator.CommonCrawlWarcIterator.iterate(
file_path: str
) -> collections.abc.Iterator[dict[str, typing.Any]]

Process a task containing WARC files and extract their contents.

nemo_curator.stages.text.download.common_crawl.warc_iterator.CommonCrawlWarcIterator.output_columns() -> list[str]