nemo_curator.stages.text.filters.histogram.histogram

View as MarkdownOpen in Claude

Module Contents

Classes

NameDescription
HistogramFilterHistogram filter used by the NLLB paper (https://arxiv.org/pdf/2207.04672). See p30 for details.

API

class nemo_curator.stages.text.filters.histogram.histogram.HistogramFilter(
lang: str | None = 'en',
threshold: float | None = 0.8,
cache_dir: str | None = '',
threshold_char: str | None = ']'
)

Bases: DocumentFilter

Histogram filter used by the NLLB paper (https://arxiv.org/pdf/2207.04672). See p30 for details.

The high-level idea of histogram filter can be described as a cheap version of language ID. Basically, it checks what ratio of characters in the data instance are included in the character historgrams collected from trusted data in the corresponding language. If the ratio is too low, then there is a good chance that there is a language ID mismatch and the data instance should be discarded.

Written with reference to the original fairseq implementation at: https://github.com/facebookresearch/fairseq/blob/main/examples/m2m_100/process_data/clean_histogram.py.

_cache_dir
= cache_dir if cache_dir else user_cache_dir()
_name
= 'histogram'
nemo_curator.stages.text.filters.histogram.histogram.HistogramFilter._download_histograms() -> None

Download and process histograms from default repo.

Raises:

  • requests.exceptions.RequestException: If download fails.
nemo_curator.stages.text.filters.histogram.histogram.HistogramFilter._read_hist() -> None

Load histogram files.

nemo_curator.stages.text.filters.histogram.histogram.HistogramFilter.keep_document(
score: float
) -> bool
nemo_curator.stages.text.filters.histogram.histogram.HistogramFilter.score_document(
text: str
) -> float

Compute histogram token ratio of a text data instance according to the loaded histogram.

Parameters:

text
str

Text data instance.

Returns: float

Ratio of tokens included in the histogram.