Text Normalization

Text normalization converts text into its verbalized form. That is, tokens belonging to special semiotic classes to denote things like numbers, times, dates, monetary amounts, etc., that are often written in a way that differs from the way they are verbalized. For example, 10:00 -> ten o’clock, 10:00 a.m. -> ten a m, 10kg -> ten kilograms. Text normalization is used as a preprocessing step before Text to Speech (TTS). It could also be used for preprocessing Automatic Speech Recognition (ASR) training transcripts.

This tool offers prediction on text files and evaluation on Google text normalization dataset [TOOLS-NORM1], [TOOLS-NORM2], [TOOLS-NORM3]. It reaches 81% in sentence accuracy on output-00001-of-00100 of Google text normalization dataset, 97.4% in token accuracy.

Note

We recommend you try the tutorial NeMo/tutorials/tools/Text_Normalization_Tutorial.ipynb.

We use the same semiotic classes as in the Google Text normalization dataset: PLAIN, PUNCT, DATE, CARDINAL, LETTERS, VERBATIM, MEASURE, DECIMAL, ORDINAL, DIGIT, MONEY, TELEPHONE, ELECTRONIC, FRACTION, TIME, ADDRESS. We additionally added the class WHITELIST for all whitelisted tokens whose verbalizations are directly looked up from a user-defined list.

NeMo rule-based system is divided into a tagger and a verbalizer: the tagger is responsible for detecting and classifying semiotic classes in the underlying text, the verbalizer takes the output of the tagger and carries out the normalization. In the example The alarm goes off at 10:30 a.m., the tagger for time detects 10:30 a.m. as a valid time data with hour=10, minutes=30, suffix=a.m., the verbalizer then turns this into ten thirty a m.

The system is designed to be easily debuggable and extendable by more rules. We provide a set of rules that covers the majority of semiotic classes as found in the Google Text normalization dataset for the English language. As with every language there is a long tail of special cases.

Prediction

Example prediction run:

python run_prediction.py  --input=<INPUT_TEXT_FILE> --output=<OUTPUT_PATH>

Evaluation

Example evaluation run:

python run_evaluation.py  --input=./en_with_types/output-00001-of-00100 [--cat CLASS_CATEGORY]

References

TOOLS-NORM1

Peter Ebden and Richard Sproat. The kestrel tts text normalization system. Natural Language Engineering, 21(3):333, 2015.

TOOLS-NORM2

Richard Sproat and Navdeep Jaitly. Rnn approaches to text normalization: a challenge. arXiv preprint arXiv:1611.00068, 2016.

TOOLS-NORM3

Paul Taylor. Text-to-speech synthesis. Cambridge university press, 2009.