Text Normalization

NeMo Text Normalization converts text from written form into its verbalized form. It is used as a preprocessing step before Text to Speech (TTS). It could also be used for preprocessing Automatic Speech Recognition (ASR) training transcripts.

For example, “at 10:00” -> “at ten o’clock” and “it weighs 10kg.” -> “it weights ten kilograms .”.

NeMo Text Normalization [TEXTPROCESSING-NORM5] is based on WFST-grammars [TEXTPROCESSING-NORM3]. We also provide a deployment route to C++ using Sparrowhawk [TEXTPROCESSING-NORM2] – an open-source version of Google Kestrel [TEXTPROCESSING-NORM1]. See Text Procesing Deployment for details.


The base class for every grammar is GraphFst. This tool is designed as a two-stage application: 1. classification of the input into semiotic tokens and 2. verbalization into written form. For every stage and every semiotic token class there is a corresponding grammar, e.g. taggers.CardinalFst and verbalizers.CardinalFst. Together, they compose the final grammars ClassifyFst and VerbalizeFinalFst that are compiled into WFST and used for inference.

class nemo_text_processing.text_normalization.ClassifyFst(input_case: str, deterministic: bool = True)[source]

Bases: nemo_text_processing.text_normalization.graph_utils.GraphFst

Final class that composes all other classification grammars. This class can process an entire sentence including punctuation. For deployment, this grammar will be compiled and exported to OpenFst Finate State Archiv (FAR) File. More details to deployment at NeMo/tools/text_processing_deployment.

  • input_case – accepting either “lower_cased” or “cased” input.

  • deterministic – if True will provide a single transduction option, for False multiple options (used for audio-based normalization)

class nemo_text_processing.text_normalization.VerbalizeFinalFst(deterministic: bool = True)[source]

Bases: nemo_text_processing.text_normalization.graph_utils.GraphFst

Finite state transducer that verbalizes an entire sentence, e.g. tokens { name: “its” } tokens { time { hours: “twelve” minutes: “thirty” } } tokens { name: “now” } tokens { name: “.” } -> its twelve thirty now .


deterministic – if True will provide a single transduction option, for False multiple options (used for audio-based normalization)


Example prediction run:

python run_prediction.py  <--input INPUT_TEXT_FILE> <--output OUTPUT_PATH> [--input_case INPUT_CASE]

INPUT_CASE specifies whether to treat the input as lower-cased or case sensitive. By default treat the input as cased since this is more informative, especially for abbreviations. Punctuation are outputted with separating spaces after semiotic tokens, e.g. “I see, it is 10:00…” -> “I see, it is ten o’clock . . .”. Inner-sentence white-space characters in the input are not maintained.


Example evaluation run on Google’s text normalization dataset [TEXTPROCESSING-NORM4]:

python run_evaluation.py  --input=./en_with_types/output-00001-of-00100 [--cat CLASS_CATEGORY] [--input_case INPUT_CASE]



Peter Ebden and Richard Sproat. The kestrel tts text normalization system. Natural Language Engineering, 21(3):333, 2015.


Alexander Gutkin, Linne Ha, Martin Jansche, Knot Pipatsrisawat, and Richard Sproat. Tts for low resource languages: a bangla synthesizer. In 10th Language Resources and Evaluation Conference. 2016.


Mehryar Mohri. Weighted Automata Algorithms, pages 213–254. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. URL: https://doi.org/10.1007/978-3-642-01492-5_6, doi:10.1007/978-3-642-01492-5_6.


Richard Sproat and Navdeep Jaitly. Rnn approaches to text normalization: a challenge. arXiv preprint arXiv:1611.00068, 2016.


Yang Zhang, Evelina Bakhturina, Kyle Gorman, and Boris Ginsburg. Nemo inverse text normalization: from development to production. 2021. arXiv:2104.05055.