nemo_curator.stages.text.download.base.url_generation

View as Markdown

Module Contents

Classes

NameDescription
URLGenerationStageStage that generates URLs from minimal input parameters.
URLGeneratorAbstract base class for URL generators - generates URLs from minimal input.

API

class nemo_curator.stages.text.download.base.url_generation.URLGenerationStage(
url_generator: nemo_curator.stages.text.download.base.url_generation.URLGenerator,
limit: int | None = None
)
Dataclass

Bases: ProcessingStage[_EmptyTask, FileGroupTask]

Stage that generates URLs from minimal input parameters.

This allows pipelines to start with URL generation (like Common Crawl).

limit
int | None = None
resources
= Resources(cpus=0.5)
url_generator
URLGenerator
nemo_curator.stages.text.download.base.url_generation.URLGenerationStage.__post_init__()
nemo_curator.stages.text.download.base.url_generation.URLGenerationStage.inputs() -> tuple[list[str], list[str]]

Define input requirements - expects empty task.

nemo_curator.stages.text.download.base.url_generation.URLGenerationStage.outputs() -> tuple[list[str], list[str]]

Define output - produces FileGroupTask with URLs.

nemo_curator.stages.text.download.base.url_generation.URLGenerationStage.process(
task: nemo_curator.tasks._EmptyTask
) -> list[nemo_curator.tasks.FileGroupTask]

Generate URLs and create FileGroupTasks.

Parameters:

task
_EmptyTask

Empty input task

Returns: list[FileGroupTask]

list[FileGroupTask]: List of tasks containing URLs

nemo_curator.stages.text.download.base.url_generation.URLGenerationStage.ray_stage_spec() -> dict[str, typing.Any]
nemo_curator.stages.text.download.base.url_generation.URLGenerationStage.xenna_stage_spec() -> dict[str, typing.Any]
class nemo_curator.stages.text.download.base.url_generation.URLGenerator()
Abstract

Abstract base class for URL generators - generates URLs from minimal input.

nemo_curator.stages.text.download.base.url_generation.URLGenerator.generate_urls() -> list[str]
abstract

Generate a list of URLs to download.