Thutmose Tagger: Single-pass Tagger-based ITN Model#

Inverse text normalization(ITN) converts text from spoken domain (e.g., an ASR output) into its written form:

Input: on may third we paid one hundred and twenty three dollars Output: on may 3 we paid $123

ThutmoseTaggerModel is a single-pass tagger-based model mapping spoken-domain words to written-domain fragments. Additionally this model predicts “semiotic” classes of the spoken words (e.g., words belonging to the spans that are about times, dates, or monetary amounts)

The typical workflow is to first prepare the dataset, which requires to find granular alignments between spoken-domain words and written-domain fragments. An example bash-script for data preparation pipeline is provided: After getting the dataset you can train the model. An example training script is provided: The script for inference from a raw text file is provided here: An example bash-script that runs inference and evaluation is provided here:

Quick Start Guide#

To run the pretrained models see Model Inference.

Available models#

Pretrained Models#


Pretrained Checkpoint


Initial Data#

The initial data from which the dataset is prepared is Google text normalization dataset. It is stored in TAB separated files (.tsv) with three columns. The first column is the “semiotic class” (e.g., numbers, times, dates) , the second is the token in written form, and the third is the spoken form. An example sentence in the dataset is shown below. In the example, <self> denotes that the spoken form is the same as the written form.

PLAIN       The     <self>
PLAIN       company <self>
PLAIN       revenues        <self>
PLAIN       grew    <self>
PLAIN       four    <self>
PLAIN       fold    <self>
PLAIN       between <self>
DATE        2005    two thousand five
PLAIN       and     <self>
DATE        2008    two thousand eight
PUNCT       .       <self>
<eos>       <eos>

More information about the Google Text Normalization Dataset can be found in the paper RNN Approaches to Text Normalization: A Challenge [NLP-TEXTNORM3].

Data preprocessing#

Our preprocessing is rather complicated, because we need to find granular alignments for semiotic spans that are aligned at phrase-level in Google Text Normalization Dataset. Right now we only provide data preparation scripts for English and Russian languages, see and Data preparation includes running the GIZA++ automatic alignment tool, see for installation details. The purpose of the preprocessing scripts is to build the training dataset for the tagging model. The final dataset has a simple 3-column tsv format: 1) input sentence, 2) tags for input words, 3) coordinates of “semiotic” spans if any

this plan was first enacted in nineteen eighty four and continued to be followed for nineteen years    <SELF> <SELF> <SELF> <SELF> <SELF> <SELF> _19 8 4_ <SELF> <SELF> <SELF> <SELF> <SELF> <SELF> _19_ <SELF>    DATE 6 9;CARDINAL 15 16

Model Training#

An example training script is provided: The config file used by default is thutmose_tagger_itn_config.yaml. You can change any of the parameters directly from the config file or update them with the command-line arguments.

Most arguments in the example config file are quite self-explanatory (e.g., refers to the learning rate for training the decoder). We have set most of the hyper-parameters to be the values that we found to be effective (for the English and the Russian subsets of the Google TN dataset). Some arguments that you may want to modify are:

  • lang: The language of the dataset.

  • data.train_ds.data_path: The path to the training file.

  • data.validation_ds.data_path: The path to the validation file.

  • model.language_model.pretrained_model_name: The huggingface transformer model used to initialize the model weights

  • model.label_map: The path/…/label_map.txt. This is the dictionary of possible output tags that model may produce.

  • model.semiotic_classes: The path/to/…/semiotic_classes.txt. This is the list of possible semiotic classes.

Example of a training command:

python examples/nlp/text_normalization_as_tagging/ \
    lang=en \
    data.validation_ds.data_path=<PATH_TO_DATASET_DIR>/valid.tsv \
    data.train_ds.data_path=<PATH_TO_DATASET_DIR>/train.tsv \
    model.language_model.pretrained_model_name=bert-base-uncased \
    model.label_map=<PATH_TO_DATASET_DIR>/label_map.txt \
    model.semiotic_classes=<PATH_TO_DATASET_DIR>/semiotic_classes.txt \

Model Inference#

Run the inference:

python examples/nlp/text_normalization_as_tagging/ \
    pretrained_model=itn_en_thutmose_bert \
    inference.from_file=./test_sent.txt \

The output tsv file consists of 5 columns:

  • Final output text - it is generated from predicted tags after some simple post-processing.

  • Input text.

  • Sequence of predicted tags - one tag for each input word.

  • Sequence of tags after post-processing (some swaps may be applied).

  • Sequence of predicted semiotic classes - one class for each input word.

Model Architecture#

The model first uses a Transformer encoder (e.g., bert-base-uncased) to build a contextualized representation for each input token. It then uses a classification head to predict the tag for each token. Another classification head is used to predict a “semiotic” class label for each token.

Overall, our design is partly inspired by the LaserTagger approach proposed in the paper Encode, tag, realize: High-precision text editing [NLP-TEXTNORM2].

The LaserTagger method is not directly applicable to ITN because it can only regard the whole non-common fragment as a single replacement tag, whereas spoken-to-written conversion, e.g. a date, needs to be aligned on a more granular level. Otherwise, the tag vocabulary should include all possible numbers, dates etc. which is impossible. For example, given an example pair “over four hundred thousand fish” - “over 400,000 fish”, LaserTagger will need a single replacement “400,000” in the tag vocabulary. To overcome this problem, we use another method of collecting the vocabulary of replacement tags, based on automatic alignment of spoken-domain words to small fragments of written-domain text along with <SELF> and <DELETE> tags.



Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, 177–180. Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL:


Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and Aliaksei Severyn. Encode, tag, realize: high-precision text editing. arXiv preprint arXiv:1909.01187, 2019.


Richard Sproat and Navdeep Jaitly. Rnn approaches to text normalization: a challenge. arXiv preprint arXiv:1611.00068, 2016.


Hao Zhang, Richard Sproat, Axel H Ng, Felix Stahlberg, Xiaochang Peng, Kyle Gorman, and Brian Roark. Neural models of text normalization for speech applications. Computational Linguistics, 2019.