stages.text.download.base.url_generation#
Module Contents#
Classes#
Stage that generates URLs from minimal input parameters. |
|
Abstract base class for URL generators - generates URLs from minimal input. |
API#
- class stages.text.download.base.url_generation.URLGenerationStage#
Bases:
nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks._EmptyTask,nemo_curator.tasks.FileGroupTask]Stage that generates URLs from minimal input parameters.
This allows pipelines to start with URL generation (like Common Crawl).
- inputs() tuple[list[str], list[str]]#
Define input requirements - expects empty task.
- limit: int | None#
None
- outputs() tuple[list[str], list[str]]#
Define output - produces FileGroupTask with URLs.
- process(
- task: nemo_curator.tasks._EmptyTask,
Generate URLs and create FileGroupTasks.
Args: task (_EmptyTask): Empty input task
Returns: list[FileGroupTask]: List of tasks containing URLs
- ray_stage_spec() dict[str, Any]#
Get Ray configuration for this stage. Note : This is only used for Ray Data which is an experimental backend. The keys are defined in RayStageSpecKeys in backends/experimental/ray_data/utils.py
Returns (dict[str, Any]): Dictionary containing Ray-specific configuration
- url_generator: stages.text.download.base.url_generation.URLGenerator#
None