`download.doc_builder`#

Module Contents#

Classes#

`DocumentDownloader`	Abstract class for downloading remote data to disk
`DocumentExtractor`	Abstract class for extracting text from records read from disk
`DocumentIterator`	Abstract iterator class for reading in raw records that have been downloaded to disk

Functions#

`batch_download`	Downloads all the urls using the downloader in parallel
`download_and_extract`	Download files from the given URLs, extract their records, and construct a DocumentDataset.
`import_downloader`
`import_extractor`
`import_iterator`

API#

class download.doc_builder.DocumentDownloader#

Bases: abc.ABC

Abstract class for downloading remote data to disk

Initialization

abstract download(url: str) → str#

class download.doc_builder.DocumentExtractor#

Bases: abc.ABC

Abstract class for extracting text from records read from disk

Initialization

abstract extract(content: str) → dict[str, str]#

class download.doc_builder.DocumentIterator#

Bases: abc.ABC

Abstract iterator class for reading in raw records that have been downloaded to disk

Initialization

abstract iterate( file_path: str, ) → collections.abc.Iterator[tuple[dict[str, str], str]]#

download.doc_builder.batch_download( urls: list[str], downloader: download.doc_builder.DocumentDownloader, ) → list[str]#: Downloads all the urls using the downloader in parallel

download.doc_builder.download_and_extract( urls: list[str], output_paths: list[str], downloader: download.doc_builder.DocumentDownloader, iterator: download.doc_builder.DocumentIterator, extractor: download.doc_builder.DocumentExtractor, output_format: dict, output_type: Literal[jsonl, parquet] = 'jsonl', keep_raw_download: bool = False, force_download: bool = False, input_meta: str | dict | None = None, filename_col: str = 'file_name', record_limit: int | None = None, ) → nemo_curator.datasets.DocumentDataset#

Download files from the given URLs, extract their records, and construct a DocumentDataset.

For each URL provided, this function downloads the corresponding file (unless an extracted output already exists and force_download is False), iterates over its records, extracts the desired content, and finally converts all records into a DocumentDataset.

Args: urls (List[str]): A list of URLs from which to download dataset files. output_paths (List[str]): A list of file paths where the extracted outputs should be found. If a file already exists at a given path and force_download is False, that partition is skipped. downloader (DocumentDownloader): The downloader instance responsible for fetching files from the specified URLs. iterator (DocumentIterator): The iterator instance used to traverse the downloaded file and yield records. extractor (DocumentExtractor): The extractor instance used to obtain the desired content from each record. output_format (dict): A dictionary mapping column names to the data types for the extracted records. output_type (Literal[“jsonl”, “parquet”], optional): The output file format/extension. Must be either “jsonl” or “parquet”. Defaults to “jsonl”. This parameter is only used to verify whether an extracted output already exists. keep_raw_download (bool, optional): If True, the raw downloaded files are retained after extraction. Defaults to False. force_download (bool, optional): If False and an output file already exists at a given path, the download and extraction for that file are skipped. Defaults to False. input_meta (Union[str, dict], optional): Optional metadata describing the input file’s schema. Defaults to None. filename_col (str, optional): The name for the column in the resulting dataset that records the basename of the output file. Defaults to “file_name”. record_limit (int, optional): Limit the number of records to extract from each file. Defaults to None. Returns: DocumentDataset: A dataset composed of the records extracted from the downloaded files.

download.doc_builder.import_downloader( downloader_path: str, ) → download.doc_builder.DocumentDownloader#

download.doc_builder.import_extractor( extractor_path: str, ) → download.doc_builder.DocumentExtractor#

download.doc_builder.import_iterator( iterator_path: str, ) → download.doc_builder.DocumentIterator#

download.doc_builder#

Module Contents#

Classes#

Functions#

API#

`download.doc_builder`#