> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.text.download.wikipedia.extract

## Module Contents

### Classes

| Name                                                                                            | Description                                                |
| ----------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
| [`WikipediaExtractor`](#nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor) | Extractor for Wikipedia articles from MediaWiki XML dumps. |

### Data

[`CAT_ALIASES`](#nemo_curator-stages-text-download-wikipedia-extract-CAT_ALIASES)

[`MEDIA_ALIASES`](#nemo_curator-stages-text-download-wikipedia-extract-MEDIA_ALIASES)

### API

<Anchor id="nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor(
        language: str = 'en'
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** `DocumentExtractor`

  Extractor for Wikipedia articles from MediaWiki XML dumps.

  <Anchor id="nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor-_create_filter_functions">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor._create_filter_functions(
          re_rm_wikilink: re.Pattern[str],
          re_clean_wikilink: re.Pattern[str]
      ) -> tuple
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Create filter functions for Wikipedia content processing.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor-_create_filters">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor._create_filters() -> tuple[re.Pattern[str], re.Pattern[str], re.Pattern[str]]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Create regex patterns for filtering Wikipedia content.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor-_process_sections">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor._process_sections(
          wikicode: typing.Any,
          re_rm_magic: re.Pattern[str],
          rm_wikilink: typing.Any,
          rm_tag: typing.Any,
          is_category: typing.Any,
          try_replace_obj: typing.Any,
          try_remove_obj: typing.Any
      ) -> str
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Process sections of the Wikipedia article.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor-extract">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor.extract(
          record: dict[str, typing.Any]
      ) -> dict[str, typing.Any] | None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Extract and clean Wikipedia article content.

    **Parameters:**

    <ParamField path="record" type="dict[str, Any]">
      Record containing raw\_content field with Wikipedia markup
    </ParamField>

    **Returns:** `dict[str, Any] | None`

    Dict with cleaned text and metadata, or None if extraction fails
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor-input_columns">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor.input_columns() -> list[str]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Define the input columns expected by this extractor.
  </Indent>

  <Anchor id="nemo_curator-stages-text-download-wikipedia-extract-WikipediaExtractor-output_columns">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor.output_columns() -> list[str]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Define the output columns produced by this extractor.
  </Indent>
</Indent>

<Anchor id="nemo_curator-stages-text-download-wikipedia-extract-CAT_ALIASES">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.stages.text.download.wikipedia.extract.CAT_ALIASES = {'ab': ['Категория', 'Акатегориа'], 'ace': ['Kawan', 'Kategori'], 'af': ['Katego...
    ```
  </CodeBlock>
</Anchor>

<Anchor id="nemo_curator-stages-text-download-wikipedia-extract-MEDIA_ALIASES">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.stages.text.download.wikipedia.extract.MEDIA_ALIASES = {'ab': ['Медиа', 'Файл', 'Афаил', 'Амедиа', 'Изображение'], 'ace': ['Beureukaih'...
    ```
  </CodeBlock>
</Anchor>