nemo_curator.stages.text.download.common_crawl.stage
nemo_curator.stages.text.download.common_crawl.stage
Module Contents
Classes
API
Bases: DocumentDownloadExtractStage
Composite stage for downloading and processing Common Crawl data.
This pipeline:
- Generates WARC URLs (either from main or news crawls)
- Downloads WARC files
- Extracts content from WARC files
- Extracts text from HTML content
downloader
extractor
iterator
name
url_generator
Decompose this composite stage into its constituent stages.
Get a description of this composite stage.