`download.commoncrawl`#

Module Contents#

Classes#

`CommonCrawlWARCDownloader`	Downloads WARC files from the Common Crawl
`CommonCrawlWARCDownloaderExtractOnly`	A ‘dummy’ downloader that simply puts pre-downloaded files on the queue
`CommonCrawlWARCExtractor`	Abstract class for extracting text from records read from disk
`CommonCrawlWARCIterator`	Abstract iterator class for reading in raw records that have been downloaded to disk
`HTMLExtractorAlgorithm`	Helper class that provides a standard way to create an ABC using inheritance.
`JusTextExtractor`	Helper class that provides a standard way to create an ABC using inheritance.
`ResiliparseExtractor`	Helper class that provides a standard way to create an ABC using inheritance.
`TrafilaturaExtractor`	Helper class that provides a standard way to create an ABC using inheritance.

Functions#

`decode_html`
`download_common_crawl`	Downloads Common Crawl WARC snapshots and extracts text content using a specified extraction algorithm.
`get_all_stop_words`
`get_stop_list_dict`
`lang_detect`
`try_decode_with_detected_encoding`

Data#

NON_SPACED_LANGUAGES

API#

class download.commoncrawl.CommonCrawlWARCDownloader( download_dir: str, aws: bool = False, verbose: bool = False, )#

Bases: nemo_curator.download.doc_builder.DocumentDownloader

Downloads WARC files from the Common Crawl

Initialization

Creates a downloader

Args: download_dir: Path to store raw compressed WARC files aws: If True, uses the s5cmd command to download from the Common Crawl’s S3 bucket. If False, uses wget. verbose: If True, logs stdout and stderr of the download command (s5cmd/wget)

download(url: str) → str#

class download.commoncrawl.CommonCrawlWARCDownloaderExtractOnly( aws: bool = False, verbose: bool = False, )#

Bases: nemo_curator.download.doc_builder.DocumentDownloader

A ‘dummy’ downloader that simply puts pre-downloaded files on the queue

Initialization

download(url: str) → str#

class download.commoncrawl.CommonCrawlWARCExtractor( algorithm: download.commoncrawl.HTMLExtractorAlgorithm | None = None, stop_lists: dict[str, frozenset[str]] | None = None, )#

Bases: nemo_curator.download.doc_builder.DocumentExtractor

Abstract class for extracting text from records read from disk

Initialization

extract(content: str) → dict[str, str] | None#

class download.commoncrawl.CommonCrawlWARCIterator(log_frequency: int = 1000)#

Bases: nemo_curator.download.doc_builder.DocumentIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

Initialization

iterate( file_path: str, ) → collections.abc.Iterator[tuple[dict[str, str], str]]#

class download.commoncrawl.HTMLExtractorAlgorithm#

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

abstract extract_text( html: str, stop_words: frozenset[str], language: str, ) → list[str] | None#

class download.commoncrawl.JusTextExtractor( length_low: int = 70, length_high: int = 200, stopwords_low: float = 0.3, stopwords_high: float = 0.32, max_link_density: float = 0.2, max_heading_distance: int = 200, no_headings: bool = False, is_boilerplate: bool | None = None, logger: logging.Logger | None = None, )#

Bases: download.commoncrawl.HTMLExtractorAlgorithm

Helper class that provides a standard way to create an ABC using inheritance.

Initialization

Initialize the jusText text extraction algorithm with specified parameters.

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. The key idea is that long blocks can often be classified with high confidence, while shorter blocks require context-based adjustments.

Here is an overview of the jusText algorithm: • Segmentation: The document is split into textual blocks based on HTML tags that typically define separate sections (e.g.,

,

). • Preprocessing: Contents of

,

download.commoncrawl#

Module Contents#

Classes#

Functions#

Data#

API#

`download.commoncrawl`#