> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.text.download.base.iterator

## Module Contents

### Classes

| Name                                                                                                          | Description                                                         |
| ------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| [`DocumentIterateExtractStage`](#nemo_curator-stages-text-download-base-iterator-DocumentIterateExtractStage) | Stage that iterates through downloaded files with DocumentIterator, |
| [`DocumentIterator`](#nemo_curator-stages-text-download-base-iterator-DocumentIterator)                       | Abstract base class for document iterators.                         |

### API

<Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterateExtractStage">
  <CodeBlock links={{"nemo_curator.stages.text.download.base.iterator.DocumentIterator":"#nemo_curator-stages-text-download-base-iterator-DocumentIterator","nemo_curator.stages.text.download.base.extract.DocumentExtractor":"/nemo-curator/nemo_curator/stages/text/download/base/extract#nemo_curator-stages-text-download-base-extract-DocumentExtractor"}} showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage(
        iterator: nemo_curator.stages.text.download.base.iterator.DocumentIterator,
        extractor: nemo_curator.stages.text.download.base.extract.DocumentExtractor | None = None,
        record_limit: int | None = None,
        add_filename_column: bool | str = True,
        max_calls_per_worker: int | None = None
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Dataclass
  </Badge>

  **Bases:** [ProcessingStage\[FileGroupTask, DocumentBatch\]](/nemo-curator/nemo_curator/stages/base#nemo_curator-stages-base-ProcessingStage)

  Stage that iterates through downloaded files with DocumentIterator,
  then extracts structured content from raw records with DocumentExtractor.

  Takes local file paths and produces a DocumentBatch with extracted content.
  If DocumentIterator produces the final format, then DocumentExtractor is not needed.

  <ParamField path="add_filename_column" type="bool | str = True" />

  <ParamField path="extractor" type="DocumentExtractor | None = None" />

  <ParamField path="iterator" type="DocumentIterator" />

  <ParamField path="max_calls_per_worker" type="int | None = None" />

  <ParamField path="record_limit" type="int | None = None" />

  <Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterateExtractStage-__post_init__">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.__post_init__()
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Initialize the stage.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterateExtractStage-inputs">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.inputs() -> tuple[list[str], list[str]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Define input requirements - expects FileGroupTask with local file paths.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterateExtractStage-outputs">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.outputs() -> tuple[list[str], list[str]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Define output - produces DocumentBatch with processed records.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterateExtractStage-process">
    <CodeBlock links={{"nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask","nemo_curator.tasks.DocumentBatch":"/nemo-curator/nemo_curator/tasks/document#nemo_curator-tasks-document-DocumentBatch"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.process(
          task: nemo_curator.tasks.FileGroupTask
      ) -> nemo_curator.tasks.DocumentBatch
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Iterate through files and extract structured content.

    **Parameters:**

    <ParamField path="task" type="FileGroupTask">
      Task containing local file paths
    </ParamField>

    **Returns:** `DocumentBatch`

    Batch containing extracted records
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterateExtractStage-ray_stage_spec">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.base.iterator.DocumentIterateExtractStage.ray_stage_spec() -> dict[str, typing.Any]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Get Ray configuration for this stage.
  </Indent>
</Indent>

<Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterator">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.download.base.iterator.DocumentIterator()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Abstract
  </Badge>

  Abstract base class for document iterators.

  Always yields dict\[str, str] records. For raw content that needs extraction,
  the iterator can put it in any field (e.g., "raw\_content", "html", "content", etc.)

  <Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterator-iterate">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.base.iterator.DocumentIterator.iterate(
          file_path: str
      ) -> collections.abc.Iterator[dict[str, typing.Any]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      abstract
    </Badge>

    Iterate over records in a file, yielding dict records.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-base-iterator-DocumentIterator-output_columns">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.base.iterator.DocumentIterator.output_columns() -> list[str]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      abstract
    </Badge>

    Define output columns - produces DocumentBatch with records.
  </Indent>
</Indent>