stages.text.download.base.url_generation
#
Module Contents#
Classes#
Stage that generates URLs from minimal input parameters. |
|
Abstract base class for URL generators - generates URLs from minimal input. |
API#
- class stages.text.download.base.url_generation.URLGenerationStage#
Bases:
nemo_curator.stages.base.ProcessingStage
[nemo_curator.tasks._EmptyTask
,nemo_curator.tasks.FileGroupTask
]Stage that generates URLs from minimal input parameters.
This allows pipelines to start with URL generation (like Common Crawl).
- inputs() tuple[list[str], list[str]] #
Define input requirements - expects empty task.
- limit: int | None#
None
- outputs() tuple[list[str], list[str]] #
Define output - produces FileGroupTask with URLs.
- process(
- task: nemo_curator.tasks._EmptyTask,
Generate URLs and create FileGroupTasks.
Args: task (_EmptyTask): Empty input task
Returns: list[FileGroupTask]: List of tasks containing URLs
- ray_stage_spec() dict[str, Any] #
Get Ray configuration for this stage. Note : This is only used for Ray Data which is an experimental backend. The keys are defined in RayStageSpecKeys in backends/experimental/ray_data/utils.py
Returns (dict[str, Any]): Dictionary containing Ray-specific configuration
- url_generator: stages.text.download.base.url_generation.URLGenerator#
None