SpellMapper (Spellchecking ASR Customization) Model

SpellMapper [NLP-NER1] is a non-autoregressive model for postprocessing of ASR output. It gets as input a single ASR hypothesis (text) and a custom vocabulary and predicts which fragments in the ASR hypothesis should be replaced by which custom words/phrases if any. Unlike traditional spellchecking approaches, which aim to correct known words using language models, SpellMapper’s goal is to correct highly specific user terms, out-of-vocabulary (OOV) words or spelling variations (e.g., “John Koehn”, “Jon Cohen”).

This model is an alternative to word boosting/shallow fusion approaches:

  • does not require retraining ASR model;

  • does not require beam-search/language model (LM);

  • can be applied on top of any English ASR model output;

Though SpellMapper is based on BERT [NLP-NER2] architecture, it uses some non-standard tricks that make it different from other BERT-based models:

  • ten separators ([SEP] tokens) are used to combine the ASR hypothesis and ten candidate phrases into a single input;

  • the model works on character level;

  • subword embeddings are concatenated to the embeddings of each character that belongs to this subword;

Copy
Copied!
            

Example input: [CLS] a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o [SEP] d i d i e r _ s a u m o n [SEP] a s t r o n o m i e [SEP] t r i s t a n _ g u i l l o t [SEP] ... Input segments: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 Example output: 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 0 ...


The model calculates logits for each character x 11 labels:

  • 0 - character doesn’t belong to any candidate,

  • 1..10 - character belongs to candidate with this id.

At inference average pooling is applied to calculate replacement probability for the whole fragments.

We recommend you try this model in a Jupyter notebook (need GPU): NeMo/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb.

A pretrained English checkpoint can be found at HuggingFace.

An example inference pipeline can be found here: NeMo/examples/nlp/spellchecking_asr_customization/run_infer.sh.

An example script on how to train the model can be found here: NeMo/examples/nlp/spellchecking_asr_customization/run_training.sh.

An example script on how to train on large datasets can be found here: NeMo/examples/nlp/spellchecking_asr_customization/run_training_tarred.sh.

The default configuration file for the model can be found here: NeMo/examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml.

Here we describe input/output format of the SpellMapper model.

Note

If you use inference pipeline this format will be hidden inside and you only need to provide an input manifest and user vocabulary and you will get a corrected manifest.

An input line should consist of 4 tab-separated columns:
  1. text of ASR-hypothesis

  2. texts of 10 candidates separated by semicolon

  3. 1-based ids of non-dummy candidates, separated by space

  4. approximate start/end coordinates of non-dummy candidates (correspond to ids in third column)

Example input (in one line):

Copy
Copied!
            

t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d 1 2 6 7 8 9 10 CUSTOM 6 23;CUSTOM 4 10;CUSTOM 4 15;CUSTOM 56 62;CUSTOM 5 19;CUSTOM 28 31;CUSTOM 39 48

Each line in SpellMapper output is tab-separated and consists of 4 columns:
  1. ASR-hypothesis (same as in input)

  2. 10 candidates separated by semicolon (same as in input)

  3. fragment predictions, separated by semicolon, each prediction is a tuple (start, end, candidate_id, probability)

  4. letter predictions - candidate_id predicted for each letter (this is only for debug purposes)

Example output (in one line):

Copy
Copied!
            

t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d 56 62 7 0.99998;4 20 8 0.95181;12 20 8 0.44829;4 17 8 0.99464;12 17 8 0.97645 8 8 8 0 8 8 8 8 8 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 7

For training, the data should consist of 5 files:

  • config.json - BERT config

  • label_map.txt - labels from 0 to 10, do not change

  • semiotic_classes.txt - currently there are only two classes: PLAIN and CUSTOM, do not change

  • train.tsv - training examples

  • test.tsv - validation examples

Note that since all these examples are synthetic, we do not reserve a set for final testing. Instead, we run inference pipeline and compare resulting word error rate (WER) to the WER of baseline ASR output.

One (non-tarred) training example should consist of 4 tab-separated columns:
  1. text of ASR-hypothesis

  2. texts of 10 candidates separated by semicolon

  3. 1-based ids of correct candidates, separated by space, or 0 if none

  4. start/end coordinates of correct candidates (correspond to ids in third column)

Example (in one line):

Copy
Copied!
            

a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o d i d i e r _ s a u m o n;a s t r o n o m i e;t r i s t a n _ g u i l l o t;t r i s t e s s e;m o n a d e;c h r i s t i a n;a s t r o n o m e r;s o l o m o n;d i d i d i d i d i;m e r c y 1 3 CUSTOM 12 23;CUSTOM 28 41

For data preparation see this script

Alexandra Antonova, Evelina Bakhturina, and Boris Ginsburg. Spellmapper: a non-autoregressive neural spellchecker for asr customization with candidate retrieval based on n-gram mappings. 2023. arXiv:2306.02317.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Previous Punctuation and Capitalization Lexical Audio Model
Next Token Classification (Named Entity Recognition) Model
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.