stages.text.download.base.stage#

Module Contents#

Classes#

DocumentDownloadExtractStage

Composite stage that combines URL generation, download, iterate, and extract stages.

API#

class stages.text.download.base.stage.DocumentDownloadExtractStage#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks._EmptyTask, nemo_curator.tasks.DocumentBatch]

Composite stage that combines URL generation, download, iterate, and extract stages.

This supports the full 4-step pipeline pattern like Common Crawl:

  1. Generate URLs from minimal input

  2. Download files from URLs

  3. Iterate through files to extract raw records

  4. Extract structured content from raw records

Initialization

add_filename_column: bool | str#

True

decompose() list[nemo_curator.stages.base.ProcessingStage]#

Decompose into constituent stages.

downloader: stages.text.download.base.download.DocumentDownloader#

None

extractor: stages.text.download.base.extract.DocumentExtractor | None#

None

get_description() str#

Get description of this composite stage.

iterator: stages.text.download.base.iterator.DocumentIterator#

None

record_limit: int | None#

None

url_generator: stages.text.download.base.url_generation.URLGenerator#

None

url_limit: int | None#

None