nemo_curator.stages.text.download.base.download

View as Markdown

Module Contents

Classes

NameDescription
DocumentDownloadStageStage that downloads files from URLs to local storage.
DocumentDownloaderAbstract base class for document downloaders.

API

class nemo_curator.stages.text.download.base.download.DocumentDownloadStage(
downloader: nemo_curator.stages.text.download.base.download.DocumentDownloader
)
Dataclass

Bases: ProcessingStage[FileGroupTask, FileGroupTask]

Stage that downloads files from URLs to local storage.

Takes a FileGroupTask with URLs and returns a FileGroupTask with local file paths. This allows the download step to scale independently from iteration/extraction.

downloader
DocumentDownloader
resources
= Resources(cpus=0.5)
nemo_curator.stages.text.download.base.download.DocumentDownloadStage.__post_init__()
nemo_curator.stages.text.download.base.download.DocumentDownloadStage.inputs() -> tuple[list[str], list[str]]

Define input requirements - expects FileGroupTask with URLs.

nemo_curator.stages.text.download.base.download.DocumentDownloadStage.outputs() -> tuple[list[str], list[str]]

Define output - produces FileGroupTask with local paths.

nemo_curator.stages.text.download.base.download.DocumentDownloadStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.FileGroupTask

Download URLs to local files.

Parameters:

task
FileGroupTask

Task containing URLs to download

Returns: FileGroupTask

Task containing local file paths

nemo_curator.stages.text.download.base.download.DocumentDownloadStage.xenna_stage_spec() -> dict[str, typing.Any]
class nemo_curator.stages.text.download.base.download.DocumentDownloader(
download_dir: str,
verbose: bool = False
)
Abstract

Abstract base class for document downloaders.

nemo_curator.stages.text.download.base.download.DocumentDownloader._download_to_path(
url: str,
path: str
) -> tuple[bool, str | None]
abstract

Download URL to specified path.

Parameters:

url
str

URL to download

path
str

Local path to save file

Returns: bool

Tuple of (success, error_message). If success is True, error_message should be None.

nemo_curator.stages.text.download.base.download.DocumentDownloader._get_output_filename(
url: str
) -> str
abstract

Generate output filename from URL.

Parameters:

url
str

URL to download

Returns: str

Output filename (without directory path)

nemo_curator.stages.text.download.base.download.DocumentDownloader.download(
url: str
) -> str | None

Download a document from URL with temporary file handling.

Downloads file to temporary location then atomically moves to final path. Checks for existing file to avoid re-downloading. Supports resumable downloads. Args: url: URL to download

Returns: str | None

Path to downloaded file, or None if download failed

nemo_curator.stages.text.download.base.download.DocumentDownloader.num_workers_per_node() -> int | None

Number of workers per node for Downloading. This is sometimes needed to ensure we are not overloading the network.

Returns: int | None

Number of workers per node, or None if there is no limit and we can download as fast as possible