nemo_curator.stages.text.download.wikipedia.iterator

View as Markdown

Module Contents

Classes

NameDescription
WikipediaIteratorProcesses downloaded Wikipedia dump files and extracts article content.

API

class nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator(
language: str = 'en',
log_frequency: int = 1000
)

Bases: DocumentIterator

Processes downloaded Wikipedia dump files and extracts article content.

_counter
= 0
nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator._extract_element_text(
elem: xml.etree.ElementTree.Element,
namespace: str,
tag: str
) -> str | None

Extract text from an XML element.

nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator._get_article_content(
elem: xml.etree.ElementTree.Element,
namespace: str
) -> str | None

Extract raw content from Wikipedia article element.

nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator._get_article_metadata(
elem: xml.etree.ElementTree.Element,
namespace: str
) -> dict[str, typing.Any] | None

Extract metadata from a Wikipedia article element.

nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator._should_log_progress(
_: str
) -> bool

Check if progress should be logged based on counter.

nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator._should_skip_article(
metadata: dict[str, typing.Any],
raw_content: str | None
) -> bool

Check if article should be skipped based on metadata and content.

nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator.iterate(
file_path: str
) -> collections.abc.Iterator[dict[str, typing.Any]]

Process a Wikipedia dump file and extract article content.

Parameters:

file_path
str

Path to the downloaded .bz2 file

nemo_curator.stages.text.download.wikipedia.iterator.WikipediaIterator.output_columns() -> list[str]

Define the output columns produced by this iterator.