download.doc_builder
#
Module Contents#
Classes#
Abstract class for downloading remote data to disk |
|
Abstract class for extracting text from records read from disk |
|
Abstract iterator class for reading in raw records that have been downloaded to disk |
Functions#
Downloads all the urls using the downloader in parallel |
|
Download files from the given URLs, extract their records, and construct a DocumentDataset. |
|
API#
- class download.doc_builder.DocumentDownloader#
Bases:
abc.ABC
Abstract class for downloading remote data to disk
Initialization
- abstractmethod download(url: str) str #
- class download.doc_builder.DocumentExtractor#
Bases:
abc.ABC
Abstract class for extracting text from records read from disk
Initialization
- abstractmethod extract(content: str) dict[str, str] #
- class download.doc_builder.DocumentIterator#
Bases:
abc.ABC
Abstract iterator class for reading in raw records that have been downloaded to disk
Initialization
- abstractmethod iterate(
- file_path: str,
- download.doc_builder.batch_download(
- urls: list[str],
- downloader: download.doc_builder.DocumentDownloader,
Downloads all the urls using the downloader in parallel
- download.doc_builder.download_and_extract(
- urls: list[str],
- output_paths: list[str],
- downloader: download.doc_builder.DocumentDownloader,
- iterator: download.doc_builder.DocumentIterator,
- extractor: download.doc_builder.DocumentExtractor,
- output_format: dict,
- output_type: Literal[jsonl, parquet] = 'jsonl',
- keep_raw_download: bool = False,
- force_download: bool = False,
- input_meta: str | dict | None = None,
- filename_col: str = 'file_name',
- record_limit: int | None = None,
Download files from the given URLs, extract their records, and construct a DocumentDataset.
For each URL provided, this function downloads the corresponding file (unless an extracted output already exists and force_download is False), iterates over its records, extracts the desired content, and finally converts all records into a DocumentDataset.
Args: urls (List[str]): A list of URLs from which to download dataset files. output_paths (List[str]): A list of file paths where the extracted outputs should be found. If a file already exists at a given path and force_download is False, that partition is skipped. downloader (DocumentDownloader): The downloader instance responsible for fetching files from the specified URLs. iterator (DocumentIterator): The iterator instance used to traverse the downloaded file and yield records. extractor (DocumentExtractor): The extractor instance used to obtain the desired content from each record. output_format (dict): A dictionary mapping column names to the data types for the extracted records. output_type (Literal[“jsonl”, “parquet”], optional): The output file format/extension. Must be either “jsonl” or “parquet”. Defaults to “jsonl”. This parameter is only used to verify whether an extracted output already exists. keep_raw_download (bool, optional): If True, the raw downloaded files are retained after extraction. Defaults to False. force_download (bool, optional): If False and an output file already exists at a given path, the download and extraction for that file are skipped. Defaults to False. input_meta (Union[str, dict], optional): Optional metadata describing the input file’s schema. Defaults to None. filename_col (str, optional): The name for the column in the resulting dataset that records the basename of the output file. Defaults to “file_name”. record_limit (int, optional): Limit the number of records to extract from each file. Defaults to None. Returns: DocumentDataset: A dataset composed of the records extracted from the downloaded files.
- download.doc_builder.import_downloader(
- downloader_path: str,
- download.doc_builder.import_extractor(
- extractor_path: str,
- download.doc_builder.import_iterator(
- iterator_path: str,