nemo_curator.stages.text.download.base.download
nemo_curator.stages.text.download.base.download
Module Contents
Classes
API
Bases: ProcessingStage[FileGroupTask, FileGroupTask]
Stage that downloads files from URLs to local storage.
Takes a FileGroupTask with URLs and returns a FileGroupTask with local file paths. This allows the download step to scale independently from iteration/extraction.
Define input requirements - expects FileGroupTask with URLs.
Define output - produces FileGroupTask with local paths.
Download URLs to local files.
Parameters:
Task containing URLs to download
Returns: FileGroupTask
Task containing local file paths
Abstract base class for document downloaders.
Download URL to specified path.
Parameters:
URL to download
Local path to save file
Returns: bool
Tuple of (success, error_message). If success is True, error_message should be None.
Generate output filename from URL.
Parameters:
URL to download
Returns: str
Output filename (without directory path)
Download a document from URL with temporary file handling.
Downloads file to temporary location then atomically moves to final path. Checks for existing file to avoid re-downloading. Supports resumable downloads. Args: url: URL to download
Returns: str | None
Path to downloaded file, or None if download failed
Number of workers per node for Downloading. This is sometimes needed to ensure we are not overloading the network.
Returns: int | None
Number of workers per node, or None if there is no limit and we can download as fast as possible