stages.text.download.base.url_generation#

Module Contents#

Classes#

URLGenerationStage

Stage that generates URLs from minimal input parameters.

URLGenerator

Abstract base class for URL generators - generates URLs from minimal input.

API#

class stages.text.download.base.url_generation.URLGenerationStage#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks._EmptyTask, nemo_curator.tasks.FileGroupTask]

Stage that generates URLs from minimal input parameters.

This allows pipelines to start with URL generation (like Common Crawl).

inputs() tuple[list[str], list[str]]#

Define input requirements - expects empty task.

limit: int | None#

None

outputs() tuple[list[str], list[str]]#

Define output - produces FileGroupTask with URLs.

process(
task: nemo_curator.tasks._EmptyTask,
) list[nemo_curator.tasks.FileGroupTask]#

Generate URLs and create FileGroupTasks.

Args: task (_EmptyTask): Empty input task

Returns: list[FileGroupTask]: List of tasks containing URLs

ray_stage_spec() dict[str, Any]#

Get Ray configuration for this stage. Note : This is only used for Ray Data which is an experimental backend. The keys are defined in RayStageSpecKeys in backends/experimental/ray_data/utils.py

Returns (dict[str, Any]): Dictionary containing Ray-specific configuration

url_generator: stages.text.download.base.url_generation.URLGenerator#

None

class stages.text.download.base.url_generation.URLGenerator#

Bases: abc.ABC

Abstract base class for URL generators - generates URLs from minimal input.

abstractmethod generate_urls() list[str]#

Generate a list of URLs to download.