download.commoncrawl#

Module Contents#

Classes#

CommonCrawlWARCDownloader

Downloads WARC files from the Common Crawl

CommonCrawlWARCDownloaderExtractOnly

A ‘dummy’ downloader that simply puts pre-downloaded files on the queue

CommonCrawlWARCExtractor

Abstract class for extracting text from records read from disk

CommonCrawlWARCIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

HTMLExtractorAlgorithm

Helper class that provides a standard way to create an ABC using inheritance.

JusTextExtractor

Helper class that provides a standard way to create an ABC using inheritance.

ResiliparseExtractor

Helper class that provides a standard way to create an ABC using inheritance.

TrafilaturaExtractor

Helper class that provides a standard way to create an ABC using inheritance.

Functions#

decode_html

download_common_crawl

Downloads Common Crawl WARC snapshots and extracts text content using a specified extraction algorithm.

get_all_stop_words

get_stop_list_dict

lang_detect

try_decode_with_detected_encoding

Data#

API#

class download.commoncrawl.CommonCrawlWARCDownloader(
download_dir: str,
aws: bool = False,
verbose: bool = False,
)#

Bases: nemo_curator.download.doc_builder.DocumentDownloader

Downloads WARC files from the Common Crawl

Initialization

Creates a downloader

Args: download_dir: Path to store raw compressed WARC files aws: If True, uses the s5cmd command to download from the Common Crawl’s S3 bucket. If False, uses wget. verbose: If True, logs stdout and stderr of the download command (s5cmd/wget)

download(url: str) str#
class download.commoncrawl.CommonCrawlWARCDownloaderExtractOnly(
aws: bool = False,
verbose: bool = False,
)#

Bases: nemo_curator.download.doc_builder.DocumentDownloader

A ‘dummy’ downloader that simply puts pre-downloaded files on the queue

Initialization

download(url: str) str#
class download.commoncrawl.CommonCrawlWARCExtractor(
algorithm: download.commoncrawl.HTMLExtractorAlgorithm | None = None,
stop_lists: dict[str, frozenset[str]] | None = None,
)#

Bases: nemo_curator.download.doc_builder.DocumentExtractor

Abstract class for extracting text from records read from disk

Initialization

extract(content: str) dict[str, str] | None#
class download.commoncrawl.CommonCrawlWARCIterator(log_frequency: int = 1000)#

Bases: nemo_curator.download.doc_builder.DocumentIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

Initialization

iterate(
file_path: str,
) collections.abc.Iterator[tuple[dict[str, str], str]]#
class download.commoncrawl.HTMLExtractorAlgorithm#

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

abstractmethod extract_text(
html: str,
stop_words: frozenset[str],
language: str,
) list[str] | None#
class download.commoncrawl.JusTextExtractor(
length_low: int = 70,
length_high: int = 200,
stopwords_low: float = 0.3,
stopwords_high: float = 0.32,
max_link_density: float = 0.2,
max_heading_distance: int = 200,
no_headings: bool = False,
is_boilerplate: bool | None = None,
logger: logging.Logger | None = None,
)#

Bases: download.commoncrawl.HTMLExtractorAlgorithm

Helper class that provides a standard way to create an ABC using inheritance.

Initialization

Initialize the jusText text extraction algorithm with specified parameters.

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. The key idea is that long blocks can often be classified with high confidence, while shorter blocks require context-based adjustments.

Here is an overview of the jusText algorithm: • Segmentation: The document is split into textual blocks based on HTML tags that typically define separate sections (e.g.,

,

,

). • Preprocessing: Contents of
,