stages.text.download.base.download
#
Module Contents#
Classes#
Stage that downloads files from URLs to local storage. |
|
Abstract base class for document downloaders. |
API#
- class stages.text.download.base.download.DocumentDownloadStage#
Bases:
nemo_curator.stages.base.ProcessingStage
[nemo_curator.tasks.FileGroupTask
,nemo_curator.tasks.FileGroupTask
]Stage that downloads files from URLs to local storage.
Takes a FileGroupTask with URLs and returns a FileGroupTask with local file paths. This allows the download step to scale independently from iteration/extraction.
- downloader: stages.text.download.base.download.DocumentDownloader#
None
- inputs() tuple[list[str], list[str]] #
Define input requirements - expects FileGroupTask with URLs.
- outputs() tuple[list[str], list[str]] #
Define output - produces FileGroupTask with local paths.
- process(
- task: nemo_curator.tasks.FileGroupTask,
Download URLs to local files.
Args: task (FileGroupTask): Task containing URLs to download
Returns: FileGroupTask: Task containing local file paths
- xenna_stage_spec() dict[str, Any] #
Get Xenna configuration for this stage.
Returns (dict[str, Any]): Dictionary containing Xenna-specific configuration
- class stages.text.download.base.download.DocumentDownloader(download_dir: str, verbose: bool = False)#
Bases:
abc.ABC
Abstract base class for document downloaders.
Initialization
Initialize the downloader.
Args: download_dir: Directory to store downloaded files verbose: If True, logs detailed download information
- download(url: str) str | None #
Download a document from URL with temporary file handling.
Downloads file to temporary location then atomically moves to final path. Checks for existing file to avoid re-downloading. Supports resumable downloads. Args: url: URL to download
Returns: Path to downloaded file, or None if download failed
- num_workers_per_node() int | None #
Number of workers per node for Downloading. This is sometimes needed to ensure we are not overloading the network.
Returns: Number of workers per node, or None if there is no limit and we can download as fast as possible