nemo_curator.stages.text.download.base.url_generation
nemo_curator.stages.text.download.base.url_generation
Module Contents
Classes
API
Dataclass
Bases: ProcessingStage[_EmptyTask, FileGroupTask]
Stage that generates URLs from minimal input parameters.
This allows pipelines to start with URL generation (like Common Crawl).
limit
resources
url_generator
Define input requirements - expects empty task.
Define output - produces FileGroupTask with URLs.
Generate URLs and create FileGroupTasks.
Parameters:
task
Empty input task
Returns: list[FileGroupTask]
list[FileGroupTask]: List of tasks containing URLs
Abstract
Abstract base class for URL generators - generates URLs from minimal input.
abstract
Generate a list of URLs to download.