download.doc_builder#

Module Contents#

Classes#

DocumentDownloader

Abstract class for downloading remote data to disk

DocumentExtractor

Abstract class for extracting text from records read from disk

DocumentIterator

Abstract iterator class for reading in raw records that have been downloaded to disk

Functions#

batch_download

Downloads all the urls using the downloader in parallel

download_and_extract

Download files from the given URLs, extract their records, and construct a DocumentDataset.

import_downloader

import_extractor

import_iterator

API#

class download.doc_builder.DocumentDownloader#

Bases: abc.ABC

Abstract class for downloading remote data to disk

Initialization

abstractmethod download(url: str) str#
class download.doc_builder.DocumentExtractor#

Bases: abc.ABC

Abstract class for extracting text from records read from disk

Initialization

abstractmethod extract(content: str) dict[str, str]#
class download.doc_builder.DocumentIterator#

Bases: abc.ABC

Abstract iterator class for reading in raw records that have been downloaded to disk

Initialization

abstractmethod iterate(
file_path: str,
) collections.abc.Iterator[tuple[dict[str, str], str]]#
download.doc_builder.batch_download(
urls: list[str],
downloader: download.doc_builder.DocumentDownloader,
) list[str]#

Downloads all the urls using the downloader in parallel

download.doc_builder.download_and_extract(
urls: list[str],
output_paths: list[str],
downloader: download.doc_builder.DocumentDownloader,
iterator: download.doc_builder.DocumentIterator,
extractor: download.doc_builder.DocumentExtractor,
output_format: dict,
output_type: Literal[jsonl, parquet] = 'jsonl',
keep_raw_download: bool = False,
force_download: bool = False,
input_meta: str | dict | None = None,
filename_col: str = 'file_name',
record_limit: int | None = None,
) nemo_curator.datasets.DocumentDataset#

Download files from the given URLs, extract their records, and construct a DocumentDataset.

For each URL provided, this function downloads the corresponding file (unless an extracted output already exists and force_download is False), iterates over its records, extracts the desired content, and finally converts all records into a DocumentDataset.

Args: urls (List[str]): A list of URLs from which to download dataset files. output_paths (List[str]): A list of file paths where the extracted outputs should be found. If a file already exists at a given path and force_download is False, that partition is skipped. downloader (DocumentDownloader): The downloader instance responsible for fetching files from the specified URLs. iterator (DocumentIterator): The iterator instance used to traverse the downloaded file and yield records. extractor (DocumentExtractor): The extractor instance used to obtain the desired content from each record. output_format (dict): A dictionary mapping column names to the data types for the extracted records. output_type (Literal[“jsonl”, “parquet”], optional): The output file format/extension. Must be either “jsonl” or “parquet”. Defaults to “jsonl”. This parameter is only used to verify whether an extracted output already exists. keep_raw_download (bool, optional): If True, the raw downloaded files are retained after extraction. Defaults to False. force_download (bool, optional): If False and an output file already exists at a given path, the download and extraction for that file are skipped. Defaults to False. input_meta (Union[str, dict], optional): Optional metadata describing the input file’s schema. Defaults to None. filename_col (str, optional): The name for the column in the resulting dataset that records the basename of the output file. Defaults to “file_name”. record_limit (int, optional): Limit the number of records to extract from each file. Defaults to None. Returns: DocumentDataset: A dataset composed of the records extracted from the downloaded files.

download.doc_builder.import_downloader(
downloader_path: str,
) download.doc_builder.DocumentDownloader#
download.doc_builder.import_extractor(
extractor_path: str,
) download.doc_builder.DocumentExtractor#
download.doc_builder.import_iterator(
iterator_path: str,
) download.doc_builder.DocumentIterator#