> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.text.download.html_extractors.base

## Module Contents

### Classes

| Name                                                                                                       | Description |
| ---------------------------------------------------------------------------------------------------------- | ----------- |
| [`HTMLExtractorAlgorithm`](#nemo_curator-stages-text-download-html_extractors-base-HTMLExtractorAlgorithm) | -           |

### API

<Anchor id="nemo_curator-stages-text-download-html_extractors-base-HTMLExtractorAlgorithm">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.download.html_extractors.base.HTMLExtractorAlgorithm()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Abstract
  </Badge>

  <ParamField path="NON_SPACED_LANGUAGES" />

  <Anchor id="nemo_curator-stages-text-download-html_extractors-base-HTMLExtractorAlgorithm-extract_text">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.html_extractors.base.HTMLExtractorAlgorithm.extract_text(
          html: str,
          stop_words: frozenset[str],
          language: str
      ) -> list[str] | None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      abstract
    </Badge>
  </Indent>
</Indent>