nemo_curator.stages.text.filters.histogram.histogram
nemo_curator.stages.text.filters.histogram.histogram
Module Contents
Classes
API
Bases: DocumentFilter
Histogram filter used by the NLLB paper (https://arxiv.org/pdf/2207.04672). See p30 for details.
The high-level idea of histogram filter can be described as a cheap version of language ID. Basically, it checks what ratio of characters in the data instance are included in the character historgrams collected from trusted data in the corresponding language. If the ratio is too low, then there is a good chance that there is a language ID mismatch and the data instance should be discarded.
Written with reference to the original fairseq implementation at: https://github.com/facebookresearch/fairseq/blob/main/examples/m2m_100/process_data/clean_histogram.py.
Download and process histograms from default repo.
Raises:
requests.exceptions.RequestException: If download fails.
Load histogram files.
Compute histogram token ratio of a text data instance according to the loaded histogram.
Parameters:
Text data instance.
Returns: float
Ratio of tokens included in the histogram.