stages.text.download.base.stage
#
Module Contents#
Classes#
Composite stage that combines URL generation, download, iterate, and extract stages. |
API#
- class stages.text.download.base.stage.DocumentDownloadExtractStage#
Bases:
nemo_curator.stages.base.CompositeStage
[nemo_curator.tasks._EmptyTask
,nemo_curator.tasks.DocumentBatch
]Composite stage that combines URL generation, download, iterate, and extract stages.
This supports the full 4-step pipeline pattern like Common Crawl:
Generate URLs from minimal input
Download files from URLs
Iterate through files to extract raw records
Extract structured content from raw records
Initialization
- add_filename_column: bool | str#
True
- decompose() list[nemo_curator.stages.base.ProcessingStage] #
Decompose into constituent stages.
- downloader: stages.text.download.base.download.DocumentDownloader#
None
- extractor: stages.text.download.base.extract.DocumentExtractor | None#
None
- get_description() str #
Get description of this composite stage.
- iterator: stages.text.download.base.iterator.DocumentIterator#
None
- record_limit: int | None#
None
- url_generator: stages.text.download.base.url_generation.URLGenerator#
None
- url_limit: int | None#
None