nemo_curator.stages.text.download.base.stage
nemo_curator.stages.text.download.base.stage
Module Contents
Classes
API
Dataclass
Bases: CompositeStage[_EmptyTask, DocumentBatch]
Composite stage that combines URL generation, download, and iterate-extract stages.
This supports the full 3-step pipeline pattern like Common Crawl:
- Generate URLs from minimal input
- Download files from URLs
- Iterate through files to extract structured content
add_filename_column
downloader
extractor
extractor_max_calls_per_worker
iterator
record_limit
url_generator
url_limit
Initialize the constituent stages.
Decompose into constituent stages.
Get description of this composite stage.