download.commoncrawl
#
Module Contents#
Classes#
Downloads WARC files from the Common Crawl |
|
A ‘dummy’ downloader that simply puts pre-downloaded files on the queue |
|
Abstract class for extracting text from records read from disk |
|
Abstract iterator class for reading in raw records that have been downloaded to disk |
|
Helper class that provides a standard way to create an ABC using inheritance. |
|
Helper class that provides a standard way to create an ABC using inheritance. |
|
Helper class that provides a standard way to create an ABC using inheritance. |
|
Helper class that provides a standard way to create an ABC using inheritance. |
Functions#
Downloads Common Crawl WARC snapshots and extracts text content using a specified extraction algorithm. |
|
Data#
API#
- class download.commoncrawl.CommonCrawlWARCDownloader(
- download_dir: str,
- aws: bool = False,
- verbose: bool = False,
Bases:
nemo_curator.download.doc_builder.DocumentDownloader
Downloads WARC files from the Common Crawl
Initialization
Creates a downloader
Args: download_dir: Path to store raw compressed WARC files aws: If True, uses the s5cmd command to download from the Common Crawl’s S3 bucket. If False, uses wget. verbose: If True, logs stdout and stderr of the download command (s5cmd/wget)
- download(url: str) str #
- class download.commoncrawl.CommonCrawlWARCDownloaderExtractOnly(
- aws: bool = False,
- verbose: bool = False,
Bases:
nemo_curator.download.doc_builder.DocumentDownloader
A ‘dummy’ downloader that simply puts pre-downloaded files on the queue
Initialization
- download(url: str) str #
- class download.commoncrawl.CommonCrawlWARCExtractor(
- algorithm: download.commoncrawl.HTMLExtractorAlgorithm | None = None,
- stop_lists: dict[str, frozenset[str]] | None = None,
Bases:
nemo_curator.download.doc_builder.DocumentExtractor
Abstract class for extracting text from records read from disk
Initialization
- extract(content: str) dict[str, str] | None #
- class download.commoncrawl.CommonCrawlWARCIterator(log_frequency: int = 1000)#
Bases:
nemo_curator.download.doc_builder.DocumentIterator
Abstract iterator class for reading in raw records that have been downloaded to disk
Initialization
- iterate(
- file_path: str,
- class download.commoncrawl.HTMLExtractorAlgorithm#
Bases:
abc.ABC
Helper class that provides a standard way to create an ABC using inheritance.
- abstractmethod extract_text(
- html: str,
- stop_words: frozenset[str],
- language: str,
- class download.commoncrawl.JusTextExtractor(
- length_low: int = 70,
- length_high: int = 200,
- stopwords_low: float = 0.3,
- stopwords_high: float = 0.32,
- max_link_density: float = 0.2,
- max_heading_distance: int = 200,
- no_headings: bool = False,
- is_boilerplate: bool | None = None,
- logger: logging.Logger | None = None,
Bases:
download.commoncrawl.HTMLExtractorAlgorithm
Helper class that provides a standard way to create an ABC using inheritance.
Initialization
Initialize the jusText text extraction algorithm with specified parameters.
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. The key idea is that long blocks can often be classified with high confidence, while shorter blocks require context-based adjustments.
Here is an overview of the jusText algorithm: • Segmentation: The document is split into textual blocks based on HTML tags that typically define separate sections (e.g.,
,,
). • Preprocessing: Contents of
,