nemo_curator.stages.text.download.wikipedia.extract

View as Markdown

Module Contents

Classes

NameDescription
WikipediaExtractorExtractor for Wikipedia articles from MediaWiki XML dumps.

Data

CAT_ALIASES

MEDIA_ALIASES

API

class nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor(
language: str = 'en'
)

Bases: DocumentExtractor

Extractor for Wikipedia articles from MediaWiki XML dumps.

nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor._create_filter_functions(
re_rm_wikilink: re.Pattern[str],
re_clean_wikilink: re.Pattern[str]
) -> tuple

Create filter functions for Wikipedia content processing.

nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor._create_filters() -> tuple[re.Pattern[str], re.Pattern[str], re.Pattern[str]]

Create regex patterns for filtering Wikipedia content.

nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor._process_sections(
wikicode: typing.Any,
re_rm_magic: re.Pattern[str],
rm_wikilink: typing.Any,
rm_tag: typing.Any,
is_category: typing.Any,
try_replace_obj: typing.Any,
try_remove_obj: typing.Any
) -> str

Process sections of the Wikipedia article.

nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor.extract(
record: dict[str, typing.Any]
) -> dict[str, typing.Any] | None

Extract and clean Wikipedia article content.

Parameters:

record
dict[str, Any]

Record containing raw_content field with Wikipedia markup

Returns: dict[str, Any] | None

Dict with cleaned text and metadata, or None if extraction fails

nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor.input_columns() -> list[str]

Define the input columns expected by this extractor.

nemo_curator.stages.text.download.wikipedia.extract.WikipediaExtractor.output_columns() -> list[str]

Define the output columns produced by this extractor.

nemo_curator.stages.text.download.wikipedia.extract.CAT_ALIASES = {'ab': ['Категория', 'Акатегориа'], 'ace': ['Kawan', 'Kategori'], 'af': ['Katego...
nemo_curator.stages.text.download.wikipedia.extract.MEDIA_ALIASES = {'ab': ['Медиа', 'Файл', 'Афаил', 'Амедиа', 'Изображение'], 'ace': ['Beureukaih'...