Download and Extract#

Base Classes#

class nemo_curator.download.DocumentDownloader#

Abstract class for downloading remote data to disk

class nemo_curator.download.DocumentIterator#

Abstract iterator class for reading in raw records that have been downloaded to disk

class nemo_curator.download.DocumentExtractor#

Abstract class for extracting text from records read from disk

class nemo_curator.download.batch_download(
urls: list[str],
downloader: DocumentDownloader,
)#

Downloads all the urls using the downloader in parallel

class nemo_curator.download.download_and_extract(
urls: list[str],
output_paths: list[str],
downloader: DocumentDownloader,
iterator: DocumentIterator,
extractor: DocumentExtractor,
output_format: dict,
output_type: Literal['jsonl', 'parquet'] = 'jsonl',
keep_raw_download: bool = False,
force_download: bool = False,
input_meta: str | dict | None = None,
filename_col: str = 'file_name',
record_limit: int | None = None,
)#

Download files from the given URLs, extract their records, and construct a DocumentDataset.

For each URL provided, this function downloads the corresponding file (unless an extracted output already exists and force_download is False), iterates over its records, extracts the desired content, and finally converts all records into a DocumentDataset.

Parameters:
  • urls (List[str]) – A list of URLs from which to download dataset files.

  • output_paths (List[str]) – A list of file paths where the extracted outputs should be found. If a file already exists at a given path and force_download is False, that partition is skipped.

  • downloader (DocumentDownloader) – The downloader instance responsible for fetching files from the specified URLs.

  • iterator (DocumentIterator) – The iterator instance used to traverse the downloaded file and yield records.

  • extractor (DocumentExtractor) – The extractor instance used to obtain the desired content from each record.

  • output_format (dict) – A dictionary mapping column names to the data types for the extracted records.

  • output_type (Literal["jsonl", "parquet"], optional) – The output file format/extension. Must be either “jsonl” or “parquet”. Defaults to “jsonl”. This parameter is only used to verify whether an extracted output already exists.

  • keep_raw_download (bool, optional) – If True, the raw downloaded files are retained after extraction. Defaults to False.

  • force_download (bool, optional) – If False and an output file already exists at a given path, the download and extraction for that file are skipped. Defaults to False.

  • input_meta (Union[str, dict], optional) – Optional metadata describing the input file’s schema. Defaults to None.

  • filename_col (str, optional) – The name for the column in the resulting dataset that records the basename of the output file. Defaults to “file_name”.

  • record_limit (int, optional) – Limit the number of records to extract from each file. Defaults to None.

Returns:

A dataset composed of the records extracted from the downloaded files.

Return type:

DocumentDataset

class nemo_curator.download.import_downloader(downloader_path: str)#
class nemo_curator.download.import_extractor(extractor_path: str)#
class nemo_curator.download.import_iterator(iterator_path: str)#

Common Crawl#

class nemo_curator.download.download_common_crawl(
output_path: str,
start_snapshot: str,
end_snapshot: str,
output_type: Literal['jsonl', 'parquet'] = 'jsonl',
algorithm: HTMLExtractorAlgorithm | None = None,
stop_lists: dict[str, frozenset[str]] | None = None,
news: bool = False,
aws: bool = False,
raw_download_dir: str | None = None,
keep_raw_download: bool = False,
force_download: bool = False,
url_limit: int | None = None,
record_limit: int | None = None,
)#

Downloads Common Crawl WARC snapshots and extracts text content using a specified extraction algorithm.

Parameters:
  • output_path (str) – The root directory used for managing download and extraction. • Raw WARC files are stored in a “downloads” subdirectory under this path. • This path is also checked for existing extraction results; if found, extraction can be skipped. • Note: This function returns a DocumentDataset, and writing the extracted data to disk is the caller’s responsibility.

  • start_snapshot (str) – Identifier for the earliest snapshot to process. • For CC-MAIN datasets, use the format ‘YYYY-WeekNumber’ (e.g., ‘2020-50’ or ‘2021-04’). • For CC-NEWS datasets (when news=True), use the ‘YYYY-MM’ (Year-Month) format.

  • end_snapshot (str) – Identifier for the latest snapshot to process, which must be chronologically after start_snapshot.

  • output_type (Literal["jsonl", "parquet"]) – The file format for the extracted output. Must be either “jsonl” or “parquet”. • This is not used for the output file, but is used to check if an extracted output already exists.

  • algorithm – The text extraction algorithm instance to use for HTML processing. • This can be a JusTextExtractor (default), ResiliparseExtractor, or TrafilaturaExtractor object.

  • stop_lists – A dictionary stop lists, where the keys are languages (e.g., “ENGLISH”) and the values are Python frozensets denoting the list of stop words for that language. If None, it defaults to jusText’s stop lists: miso-belica/jusText, with added Thai, Chinese, and Japanese support.

  • news (bool) – When True, indicates that URLs should be retrieved from the CC-NEWS dataset. • This also means snapshot identifiers should follow the ‘YYYY-MM’ format.

  • aws (bool) – If True, downloads are sourced from Common Crawl’s S3 bucket using s5cmd; • If False, wget is used to fetch the files via HTTPS.

  • raw_download_dir – Optional; the directory to temporarily store raw WARC files. • If not provided, defaults to a “downloads” folder within output_path.

  • keep_raw_download (bool) – If True, retains the downloaded raw WARC files after extraction. • If False, these raw files may be removed following extraction.

  • force_download (bool) – If False, skips re-downloading or re-extracting snapshots if outputs already exist in output_path.

  • url_limit – Optional; the maximum number of WARC files to download from the snapshot range. • If None, all available files within the specified snapshots are downloaded.

  • record_limit – Optional; the maximum number of records to extract from each WARC file. • If None, all available records are extracted.

class nemo_curator.download.CommonCrawlWARCDownloader(
download_dir: str,
aws: bool = False,
verbose: bool = False,
)#

Downloads WARC files from the Common Crawl

class nemo_curator.download.CommonCrawlWARCExtractor(
algorithm: HTMLExtractorAlgorithm | None = None,
stop_lists: dict[str, frozenset[str]] | None = None,
)#
class nemo_curator.download.CommonCrawlWARCIterator(log_frequency: int = 1000)#
class nemo_curator.download.CommonCrawlWARCDownloaderExtractOnly(
aws: bool = False,
verbose: bool = False,
)#

A ‘dummy’ downloader that simply puts pre-downloaded files on the queue

class nemo_curator.download.JusTextExtractor(
length_low: int = 70,
length_high: int = 200,
stopwords_low: float = 0.3,
stopwords_high: float = 0.32,
max_link_density: float = 0.2,
max_heading_distance: int = 200,
no_headings: bool = False,
is_boilerplate: bool | None = None,
logger: Logger | None = None,
)#
class nemo_curator.download.ResiliparseExtractor(
required_stopword_density: float = 0.32,
main_content: bool = True,
alt_texts: bool = False,
)#
class nemo_curator.download.TrafilaturaExtractor(
required_stopword_density: float = 0.32,
min_extracted_size: int = 250,
min_extracted_comm_size: int = 1,
min_output_size: int = 1,
min_output_comm_size: int = 1,
max_tree_size: int | None = None,
min_duplcheck_size: int = 100,
max_repetitions: int = 2,
**extract_kwargs,
)#

Wikipedia#

class nemo_curator.download.download_wikipedia(
output_path: str,
language: str = 'en',
dump_date: str | None = None,
output_type: Literal['jsonl', 'parquet'] = 'jsonl',
raw_download_dir: str | None = None,
keep_raw_download: bool = False,
force_download: bool = False,
url_limit: int | None = None,
record_limit: int | None = None,
)#

Downloads and extracts articles from a Wikipedia dump.

This function retrieves a list of Wikipedia dump URLs for the specified language and dump date, downloads the compressed bz2 dump file (if it is not already present), and extracts its articles using mwparserfromhell. The resulting articles are saved in the specified output format (e.g., “jsonl”) along with relevant metadata.

Parameters:
  • output_path (str) – The root directory where the final extracted files and intermediate outputs (if any) are stored.

  • language (str, optional) – The language code for the Wikipedia dump to download. Default is “en”.

  • dump_date (Optional[str], optional) – The dump date in “YYYYMMDD” format. If None, the latest available dump is downloaded.

  • output_type (Literal["jsonl", "parquet"], optional) – The file format/extension for saving the extracted documents (e.g., “jsonl”). Defaults to “jsonl”. This is not used for the output file, but is used to check if an extracted output already exists and read it if so.

  • raw_download_dir (Optional[str], optional) – Directory used for temporary storage of raw bz2 dump files. If None, a subdirectory named “downloads” under output_path is used.

  • keep_raw_download (bool, optional) – If True, retains the raw bz2 files after extraction. Default is False.

  • force_download (bool, optional) – If False, skips re-downloading or re-extracting files that already exist.

  • url_limit (Optional[int], optional) – The maximum number of dump file URLs to process. If None, all available URLs are processed.

  • record_limit (Optional[int], optional) – Limit the number of records to extract from each file. If None, all available records are extracted.

Returns:

A dataset object containing the extracted Wikipedia articles along with associated metadata.

Return type:

DocumentDataset

class nemo_curator.download.WikipediaDownloader(download_dir: str, verbose: bool = False)#
class nemo_curator.download.WikipediaIterator(language: str = 'en', log_frequency: int = 1000)#
class nemo_curator.download.WikipediaExtractor(language: str = 'en', parser=mwparserfromhell)#

Arxiv#

class nemo_curator.download.download_arxiv(
output_path: str,
output_type: Literal['jsonl', 'parquet'] = 'jsonl',
raw_download_dir: str | None = None,
keep_raw_download: bool = False,
force_download: bool = False,
url_limit: int | None = None,
record_limit: int | None = None,
)#

Download Arxiv tar files and extract the contained LaTeX projects.

This function obtains a list of Arxiv tar file URLs (via get_arxiv_urls), downloads the tar files, and then extracts the contained LaTeX source files. The resulting documents (after extraction) are assembled into a DocumentDataset.

Parameters:
  • output_path (str) – The root directory where both the final extracted files and the raw download subdirectory will be stored. The extracted files (in the format specified by output_type) are eventually saved in this directory.

  • output_type (Literal["jsonl", "parquet"], optional) – The file format/extension used for saving the extracted documents (e.g., “jsonl” or “parquet”). Default is “jsonl”. This is not used for the output file, but is used to check if an extracted output already exists and read it if so.

  • raw_download_dir (Optional[str], optional) – The directory where the raw downloaded tar files will be kept. If None, a folder named “downloads” under output_path is used.

  • keep_raw_download (bool, optional) – If True, the raw tar files (before extraction) are not removed after processing. Default is False.

  • force_download (bool, optional) – If False, then if an output file already exists for a given URL, re-downloading and re-extraction will be skipped. Default is False.

  • url_limit (Optional[int], optional) – Limits the maximum number of Arxiv tar file URLs to download and process. If None, all available URLs (from get_arxiv_urls) are processed.

  • record_limit (Optional[int], optional) – Limits the maximum number of records to extract from each tar file. If None, all available records are extracted.

Returns:

A dataset object containing the extracted documents.

Return type:

DocumentDataset

class nemo_curator.download.ArxivDownloader(download_dir: str, verbose: bool = False)#
class nemo_curator.download.ArxivIterator(log_frequency: int = 1000)#
class nemo_curator.download.ArxivExtractor#