The download_and_extract
utility within the NeMo Data Curator is a generic tool that can be used to download and extract from a number of different
datasets. In general, it can be called as follows in order to download and extract text from the web:
download_and_extract \
--input-url-file=<Path to .txt file containing list of URLs> \
--builder-config-file=<Path to .yaml file that describes how the data should be downloaded and extracted> \
--output-json-dir=<Path to output directory to which data will be written in .jsonl format>
This utility takes as input a list of URLs that point to files that contain prepared, unextracted data (e.g., pre-crawled web pages from Common Crawl), a config file that describes how to download and extract the data, and the output directory to where the extracted text will be written in jsonl format (one json written to each document per line). For each URL provided in the list of URLs, a corresponding jsonl file will be written to the output directory.
The config file that must be provided at runtime, should take the following form:
download_module: ndc.download.mydataset.DatasetDownloader
download_params: {}
iterator_module: ndc.download.mydataset.DatasetIterator
iterator_params: {}
extract_module: ndc.download.mydataset.DatasetExtractor
extract_params: {}
Each pair of lines corresponds to an implementation of the abstract DocumentDownloader, DocumentIterator and DocumentExtractor classes. In this case the dummy names of DatasetDownloader, DatasetIterator, and DatasetExtractor have been provided. For this example, each of these have been defined within the fictitious file ndc/download/mydataset.py
. Already within the NeMo Data Curator, we provide implementations of each of these classes for the Common Crawl, Wikipedia and ArXiv datasets. For the remainder of this document, we walk through users how to use the download_and_extract
utility to efficiently prepare terabytes of web crawl data using the provided Common Crawl implementation.
The Common Crawl data repository consists of temporal snapshots of significant portions of the web that can be downloaded and used by the general public. In total, it contains several petabytes of web crawls stored in millions of compressed Web ARChive (WARC) files (.warc.gz). The crawls performed to obtain the Common Crawl data are carried out at semi-regular intervals and the result of each crawl is labeled by the year and week number of the year when the crawl was posted. In this short guide, we will be working with the November/December 2020 crawl which is labeled as CC-MAIN-2020-50. As posted on the page describing the crawl, this crawl contains about 60 TB (compressed) in total of crawled web pages. Using the tools provided within the NeMo Data Curator, users can download and extract the text from the crawled HTML in a distributed fashion.
As described in the first section of this document, the first step towards using the download_and_extract
for Common Crawl will be to create a list of URLs that point to the location of the WARC files hosted by Common Crawl.
Within the NDC ,we provide the utility get_common_crawl_urls
to obtain these urls. This utility can be run as follows:
get_common_crawl_urls \
--cc-snapshot-index-file=./url_data/collinfo.json \
--starting-snapshot="2020-50" \
--ending-snapshot="2020-50" \
--output-warc-url-file=./url_data/warc_urls_cc_2020_50.txt
This script pulls the Common Crawl index from https://index.commoncrawl.org and stores the index to the file
specified by the argument --cc-snapshot-index-file
. It then retrieves all WARC urls between the
dates specified by the arguments --starting-snapshot
and --ending-snapshot
.
Finally, it writes all WARC urls to the text file --output-warc-urls
. This file is a simple text file
with the following format:
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00000.warc.gz
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00001.warc.gz
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00002.warc.gz
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00003.warc.gz
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00004.warc.gz
...
For the CC-MAIN-2020-50 snapshot there are a total of 72,000 compressed WARC files each between 800 - 900 MB.
Now with the prepared list of URLs, we can use the Common Crawl config included in the config
directory under the root directory of the repository. This config uses the download, data loader and extraction classes defined in the file ndc/download/commoncrawl.py
.
With this config and the input list of URLs, the download_and_extract
utility can be used as follows for downloading and extracting text from Common Crawl:
download_and_extract \
--input-url-file=./url_data/warc_urls_cc_2020_50.txt \
--builder-config-file=./config/cc_warc_builder.yaml \
--log-dir=./log/download_cc \
--output-json-dir=/datasets/CC-MAIN-2020-50/json
If used in conjunction with the srun
command described in the General Usage section of the README, the above
command will distribute the input urls to each MPI rank which has been assigned to a unique node. Then,
with the assigned URLs, each MPI rank will by default fork a single download process and the max number of
CPUS per node extraction processes. The WARC downloader will pull WARC files from S3 buckets using s5cmd, which is installed in the NeMo Megatron Container (note that to use s5cmd
, one must have the necessary credentials within ~/.aws/config). As the download process begins to download WARC files, once a WARC has been fully downloaded to disk with the downloader, a separate process will open the WARC, and iterate over the WARC records (via the CommonCrawlWARCIterator). Then, for each WARC record, the CommonCrawlWARCExtractor will perform the following:
Decode the HTML within the record from binary to text
If the HTML can be properly decoded, then with pyCLD2, perform language detection on the input HTML
Finally, the extract the relevant text with jusText from the HTML and write it out as a single string within the ‘text’ field of a json entry within a .jsonl file
As the text is extracted from the WARC records, the prepared documents are written to the directory specified by --output-json-dir
. Here is an
example of a single line of an output .jsonl file extracted from a WARC record
{"text": "커뮤니티\n\n어린이 요리 교실은 평소 조리와 제과 제빵에 관심이 있는 초등학생을 대상으로 나이프스킬, 한식, 중식, 양식, 제과, 제빵, 디저트,
생활요리 등 요리 기초부터 시작해 다양한 요리에 대해 배우고, 경험할 수 있도록 구성되었다.\n\n요즘 부모들의 자녀 요리 교육에 대한 관심이 높아지고
있는데, 어린이 요리교실은 자녀들이 어디서 어떻게 요리를 처음 시작할지 막막하고 어려워 고민하는 이들을 위해 만들어졌다.\n\n그 뿐만 아니라 학생들이
식재료를 다루는 과정에서 손으로 만지고 느끼는 것이 감각을 자극하여 두뇌발달에 도움을 주며, 조리를 통해 자신의 감정을 자연스럽게 표현할 수
있고 이를 통해 정서적 안정을 얻을 수 있다. 또한, 다양한 사물을 만져 보면서 차이점을 구별하고 사물의 특징에 대해 인지할 수 있으므로 인지 능력 향상에
도움이 되며, 만지고 느끼고 비교하는 과정에서 감각 기능을 향상시킬 수 있다.\n\n방과 후 시간이 되지 않는 초등학생들을 위해 평일반 뿐만 아니라 주말반도
운영하고 있으며 두 분의 선생님들의 안전적인 지도하에 수업이 진행된다. 한국조리예술학원은 젊은 감각과 학생들과의 소통을 통해 자발적인 교육을 가르친다.
자세한 학원 문의는 한국조리예술학원 홈페이지나 대표 전화, 카카오톡 플러스친구를 통해 가능하다.", "id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac",
"source_id": "https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-50/segments/1606141163411.0/warc/CC-MAIN-20201123153826-20201123183826-00000.warc.gz",
"url": "http://hanjowon.co.kr/web/home.php?mid=70&go=pds.list&pds_type=1&start=20&num=67&s_key1=&s_que=", "language": "KOREAN"}
Once all records have been processed within a WARC file, it is by default deleted from disk.