nemo_curator.stages.text.download.base.stage

View as Markdown

Module Contents

Classes

NameDescription
DocumentDownloadExtractStageComposite stage that combines URL generation, download, and iterate-extract stages.

API

class nemo_curator.stages.text.download.base.stage.DocumentDownloadExtractStage(
url_generator: nemo_curator.stages.text.download.base.url_generation.URLGenerator,
downloader: nemo_curator.stages.text.download.base.download.DocumentDownloader,
iterator: nemo_curator.stages.text.download.base.iterator.DocumentIterator,
extractor: nemo_curator.stages.text.download.base.extract.DocumentExtractor | None = None,
url_limit: int | None = None,
record_limit: int | None = None,
add_filename_column: bool | str = True,
extractor_max_calls_per_worker: int | None = None
)
Dataclass

Bases: CompositeStage[_EmptyTask, DocumentBatch]

Composite stage that combines URL generation, download, and iterate-extract stages.

This supports the full 3-step pipeline pattern like Common Crawl:

  1. Generate URLs from minimal input
  2. Download files from URLs
  3. Iterate through files to extract structured content
add_filename_column
bool | str = True
downloader
DocumentDownloader
extractor
DocumentExtractor | None = None
extractor_max_calls_per_worker
int | None = None
iterator
DocumentIterator
record_limit
int | None = None
url_generator
URLGenerator
url_limit
int | None = None
nemo_curator.stages.text.download.base.stage.DocumentDownloadExtractStage.__post_init__()

Initialize the constituent stages.

nemo_curator.stages.text.download.base.stage.DocumentDownloadExtractStage.decompose() -> list[nemo_curator.stages.base.ProcessingStage]

Decompose into constituent stages.

nemo_curator.stages.text.download.base.stage.DocumentDownloadExtractStage.get_description() -> str

Get description of this composite stage.