stages.text.download.base.download#
Module Contents#
Classes#
Stage that downloads files from URLs to local storage. |
|
Abstract base class for document downloaders. |
API#
- class stages.text.download.base.download.DocumentDownloadStage#
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask,nemo_curator.tasks.FileGroupTask]Stage that downloads files from URLs to local storage.
Takes a FileGroupTask with URLs and returns a FileGroupTask with local file paths. This allows the download step to scale independently from iteration/extraction.
- downloader: stages.text.download.base.download.DocumentDownloader#
None
- inputs() tuple[list[str], list[str]]#
Define input requirements - expects FileGroupTask with URLs.
- outputs() tuple[list[str], list[str]]#
Define output - produces FileGroupTask with local paths.
- process(
- task: nemo_curator.tasks.FileGroupTask,
Download URLs to local files.
Args: task (FileGroupTask): Task containing URLs to download
Returns: FileGroupTask: Task containing local file paths
- xenna_stage_spec() dict[str, Any]#
Get Xenna configuration for this stage.
Returns (dict[str, Any]): Dictionary containing Xenna-specific configuration
- class stages.text.download.base.download.DocumentDownloader(download_dir: str, verbose: bool = False)#
Bases:
abc.ABCAbstract base class for document downloaders.
Initialization
Initialize the downloader.
Args: download_dir: Directory to store downloaded files verbose: If True, logs detailed download information
- download(url: str) str | None#
Download a document from URL with temporary file handling.
Downloads file to temporary location then atomically moves to final path. Checks for existing file to avoid re-downloading. Supports resumable downloads. Args: url: URL to download
Returns: Path to downloaded file, or None if download failed
- num_workers_per_node() int | None#
Number of workers per node for Downloading. This is sometimes needed to ensure we are not overloading the network.
Returns: Number of workers per node, or None if there is no limit and we can download as fast as possible