stages.text.download.base.download#

Module Contents#

Classes#

DocumentDownloadStage

Stage that downloads files from URLs to local storage.

DocumentDownloader

Abstract base class for document downloaders.

API#

class stages.text.download.base.download.DocumentDownloadStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask, nemo_curator.tasks.FileGroupTask]

Stage that downloads files from URLs to local storage.

Takes a FileGroupTask with URLs and returns a FileGroupTask with local file paths. This allows the download step to scale independently from iteration/extraction.

downloader: stages.text.download.base.download.DocumentDownloader#

None

inputs() tuple[list[str], list[str]]#

Define input requirements - expects FileGroupTask with URLs.

outputs() tuple[list[str], list[str]]#

Define output - produces FileGroupTask with local paths.

process(
task: nemo_curator.tasks.FileGroupTask,
) nemo_curator.tasks.FileGroupTask#

Download URLs to local files.

Args: task (FileGroupTask): Task containing URLs to download

Returns: FileGroupTask: Task containing local file paths

xenna_stage_spec() dict[str, Any]#

Get Xenna configuration for this stage.

Returns (dict[str, Any]): Dictionary containing Xenna-specific configuration

class stages.text.download.base.download.DocumentDownloader(download_dir: str, verbose: bool = False)#

Bases: abc.ABC

Abstract base class for document downloaders.

Initialization

Initialize the downloader.

Args: download_dir: Directory to store downloaded files verbose: If True, logs detailed download information

download(url: str) str | None#

Download a document from URL with temporary file handling.

Downloads file to temporary location then atomically moves to final path. Checks for existing file to avoid re-downloading. Supports resumable downloads. Args: url: URL to download

Returns: Path to downloaded file, or None if download failed

num_workers_per_node() int | None#

Number of workers per node for Downloading. This is sometimes needed to ensure we are not overloading the network.

Returns: Number of workers per node, or None if there is no limit and we can download as fast as possible