Text Normalization

NeMo Text Normalization converts text from written form into its verbalized form. It is used as a preprocessing step before Text to Speech (TTS). It could also be used for preprocessing Automatic Speech Recognition (ASR) training transcripts.

For example, “at 10:00” -> “at ten o’clock” and “it weighs 10kg.” -> “it weights ten kilograms .”.

NeMo Text Normalization [TEXTPROCESSING-NORM5] is based on WFST-grammars [TEXTPROCESSING-NORM3]. We also provide a deployment route to C++ using Sparrowhawk [TEXTPROCESSING-NORM2] – an open-source version of Google Kestrel [TEXTPROCESSING-NORM1]. See Text Procesing Deployment for details.

Classes

The base class for every grammar is GraphFst. This tool is designed as a two-stage application: 1. classification of the input into semiotic tokens and 2. verbalization into written form. For every stage and every semiotic token class there is a corresponding grammar, e.g. taggers.CardinalFst and verbalizers.CardinalFst. Together, they compose the final grammars ClassifyFst and VerbalizeFinalFst that are compiled into WFST and used for inference.

class nemo_text_processing.text_normalization.ClassifyFst(input_case: str, deterministic: bool = True)[source]

Bases: nemo_text_processing.text_normalization.graph_utils.GraphFst

Final class that composes all other classification grammars. This class can process an entire sentence including punctuation. For deployment, this grammar will be compiled and exported to OpenFst Finate State Archiv (FAR) File. More details to deployment at NeMo/tools/text_processing_deployment.

Parameters
  • input_case – accepting either “lower_cased” or “cased” input.

  • deterministic – if True will provide a single transduction option, for False multiple options (used for audio-based normalization)

class nemo_text_processing.text_normalization.VerbalizeFinalFst(deterministic: bool = True)[source]

Bases: nemo_text_processing.text_normalization.graph_utils.GraphFst

Finite state transducer that verbalizes an entire sentence, e.g. tokens { name: “its” } tokens { time { hours: “twelve” minutes: “thirty” } } tokens { name: “now” } tokens { name: “.” } -> its twelve thirty now .

Parameters

deterministic – if True will provide a single transduction option, for False multiple options (used for audio-based normalization)

Prediction

Example prediction run:

python run_prediction.py  <--input INPUT_TEXT_FILE> <--output OUTPUT_PATH> [--input_case INPUT_CASE]

INPUT_CASE specifies whether to treat the input as lower-cased or case sensitive. By default treat the input as cased since this is more informative, especially for abbreviations. Punctuation are outputted with separating spaces after semiotic tokens, e.g. “I see, it is 10:00…” -> “I see, it is ten o’clock . . .”. Inner-sentence white-space characters in the input are not maintained.

Evaluation

Example evaluation run on Google’s text normalization dataset [TEXTPROCESSING-NORM4]:

python run_evaluation.py  --input=./en_with_types/output-00001-of-00100 [--cat CLASS_CATEGORY] [--input_case INPUT_CASE]

References

TEXTPROCESSING-NORM1

Peter Ebden and Richard Sproat. The kestrel tts text normalization system. Natural Language Engineering, 21(3):333, 2015.

TEXTPROCESSING-NORM2

Alexander Gutkin, Linne Ha, Martin Jansche, Knot Pipatsrisawat, and Richard Sproat. Tts for low resource languages: a bangla synthesizer. In 10th Language Resources and Evaluation Conference. 2016.

TEXTPROCESSING-NORM3

Mehryar Mohri. Weighted Automata Algorithms, pages 213–254. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. URL: https://doi.org/10.1007/978-3-642-01492-5_6, doi:10.1007/978-3-642-01492-5_6.

TEXTPROCESSING-NORM4

Richard Sproat and Navdeep Jaitly. Rnn approaches to text normalization: a challenge. arXiv preprint arXiv:1611.00068, 2016.

TEXTPROCESSING-NORM5

Yang Zhang, Evelina Bakhturina, Kyle Gorman, and Boris Ginsburg. Nemo inverse text normalization: from development to production. 2021. arXiv:2104.05055.