Train Tokenizer
Contents
Train Tokenizer#
There are two popular encoding choices: character encoding and sub-word encoding. Sub-word encoding models are almost nearly identical to the character encoding models. The primary difference lies in the fact that a sub-word encoding model accepts a sub-word tokenized text corpus and emits sub-word tokens in its decoding step. Preparation of the tokenizer is made simple by the process_asr_text_tokenizer.py script in NeMo. We leverage this script to build the text corpus from the manifest directly, then create a tokenizer using that corpus.
Subword Tokenization#
If you are familiar with Natural Language Processing, then you might have heard of the term “subword” frequently. So what is a subword in the first place? Simply put, it is either a single character or a group of characters. When combined according to a tokenization-detokenization algorithm, it generates a set of characters, words, or entire sentences.
Many subword tokenization-detokenization algorithms exist, which can be built using large corpora of text data to tokenize and detokenize the data to and from subwords effectively. Some of the most commonly used subword tokenization methods are Byte Pair Encoding, Word Piece Encoding and Sentence Piece Encoding.
The necessity of subword tokenization for ASR#
It has been found via extensive research in the domain of Neural Machine Translation and Language Modelling that subword tokenization not only reduces the length of the tokenized representation (thereby making sentences shorter and more manageable for models to learn), but also boosts the accuracy of prediction of correct tokens.
The Connectionist Temporal Classification loss function is commonly used to train acoustic models, but this loss function has a few limitations:
Generated tokens are conditionally independent of each other. In other words - the probability of character “l” being predicted after “hel##” is conditionally independent of the previous token - so any other token can also be predicted unless the model has future information!
The length of the generated (target) sequence must be shorter than that of the source sequence.
It turns out - subword tokenization helps alleviate both of these issues!
Sophisticated subword tokenization algorithms build their vocabularies based on large text corpora. To accurately tokenize such large volumes of text with minimal vocabulary size, the subwords that are learned inherently model the interdependency between tokens of that language to some degree.
Looking at the previous example, the token hel## is a single token that represents the relationship h => e => l. When the model predicts the singe token hel##, it implicitly predicts this relationship - even though the subsequent token can be either l (for hell) or ##lo (for hello) and is predicted independently of the previous token!
By reducing the target sentence length by subword tokenization (target sentence here being the characters/subwords transcribed from the audio signal), we entirely sidestep the sequence length limitation of CTC loss!
This means we can perform a larger number of pooling steps in our acoustic models, thereby improving execution speed while simultaneously reducing memory requirements.
First, download the tokenizer creation script from the nemo repository.
import os
BRANCH = 'v1.11.0'
if not os.path.exists("scripts/process_asr_text_tokenizer.py"):
!mkdir scripts
!wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py
mkdir: cannot create directory ‘scripts’: File exists
--2022-10-24 23:55:15-- https://raw.githubusercontent.com/NVIDIA/NeMo/v1.11.0/scripts/tokenizers/process_asr_text_tokenizer.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13359 (13K) [text/plain]
Saving to: ‘scripts/process_asr_text_tokenizer.py’
process_asr_text_to 100%[===================>] 13.05K --.-KB/s in 0s
2022-10-24 23:55:15 (76.9 MB/s) - ‘scripts/process_asr_text_tokenizer.py’ saved [13359/13359]
The script above takes a few important arguments -
either
--manifestor--data_file: If your text data lies inside of an ASR manifest file, then use the--manifestpath. If instead the text data is inside a file with separate lines corresponding to different text lines, then use--data_file. In either case, you can add commas to concatenate different manifests or different data files.--data_root: The output directory (whose subdirectories will be created if not present) where the tokenizers will be placed.--vocab_size: The size of the tokenizer vocabulary. Larger vocabularies can accommodate almost entire words, but the decoder size of any model will grow proportionally.--tokenizer: Can be eitherspeorwpe.sperefers to the Googlesentencepiecelibrary tokenizer.wperefers to the HuggingFace BERT Word Piece tokenizer. Please refer to the papers above for the relevant technique in order to select an appropriate tokenizer.--no_lower_case: When this flag is passed, it will force the tokenizer to create separate tokens for upper and lower case characters. By default, the script will turn all the text to lower case before tokenization (and if upper case characters are passed during training/inference, the tokenizer will emit a token equivalent to Out-Of-Vocabulary). Used primarily for the English language.--spe_type: Thesentencepiecelibrary has a few implementations of the tokenization technique, andspe_typerefers to these implementations. Currently supported types areunigram,bpe,char,word. Defaults tobpe.--spe_character_coverage: Thesentencepiecelibrary considers how much of the original vocabulary it should cover in its “base set” of tokens (akin to the lower and upper case characters of the English language). For almost all languages with small base token sets(<1000 tokens), this should be kept at its default of 1.0. For languages with larger vocabularies (say Japanese, Mandarin, Korean etc), the suggested value is 0.9995.--spe_sample_size: If the dataset is too large, consider using a sampled dataset indicated by a positive integer. By default, any negative value (default = -1) will use the entire dataset.--spe_train_extremely_large_corpus: When training a sentencepiece tokenizer on very large amounts of text, sometimes the tokenizer will run out of memory or wont be able to process so much data on RAM. At some point you might receive the following error - “Input corpus too large, try with train_extremely_large_corpus=true”. If your machine has large amounts of RAM, it might still be possible to build the tokenizer using the above flag. Will silently fail if it runs out of RAM.--log: Whether the script should display log messages
!python ./scripts/process_asr_text_tokenizer.py --manifest=./data/processed/train_manifest_merged.json \
--data_root=./data/processed/tokenizer \
--vocab_size=1024 \
--tokenizer="spe" \
--log
[NeMo W 2022-10-24 23:50:07 optimizers:77] Could not import distributed_fused_adam optimizer from Apex
INFO:root:Corpus already exists at path : ./data/processed/tokenizer/text_corpus/document.txt
WARNING:root:Model file already exists, overriding old model file !
[NeMo I 2022-10-24 23:50:07 sentencepiece_tokenizer:312] Processing ./data/processed/tokenizer/text_corpus/document.txt and store at ./data/processed/tokenizer/tokenizer_spe_bpe_v1024
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/processed/tokenizer/text_corpus/document.txt --model_prefix=./data/processed/tokenizer/tokenizer_spe_bpe_v1024/tokenizer --vocab_size=1024 --shuffle_input_sentence=true --hard_vocab_limit=false --model_type=bpe --character_coverage=1.0 --bos_id=-1 --eos_id=-1 --normalization_rule_name=nmt_nfkc_cf
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with :
trainer_spec {
input: ./data/processed/tokenizer/text_corpus/document.txt
input_format:
model_prefix: ./data/processed/tokenizer/tokenizer_spe_bpe_v1024/tokenizer
model_type: BPE
vocab_size: 1024
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 0
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 4192
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 0
hard_vocab_limit: 0
use_all_vocab: 0
unk_id: 0
bos_id: -1
eos_id: -1
pad_id: -1
unk_piece: <unk>
bos_piece: <s>
eos_piece: </s>
pad_piece: <pad>
unk_surface: ⁇
enable_differential_privacy: 0
differential_privacy_noise_level: 0
differential_privacy_clipping_threshold: 0
}
normalizer_spec {
name: nmt_nfkc_cf
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(350) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(181) LOG(INFO) Loading corpus: ./data/processed/tokenizer/text_corpus/document.txt
trainer_interface.cc(406) LOG(INFO) Loaded all 9978 sentences
trainer_interface.cc(422) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(427) LOG(INFO) Normalizing sentences...
trainer_interface.cc(536) LOG(INFO) all chars count=1439273
trainer_interface.cc(557) LOG(INFO) Alphabet size=64
trainer_interface.cc(558) LOG(INFO) Final character coverage=1
trainer_interface.cc(590) LOG(INFO) Done! preprocessed 9978 sentences.
trainer_interface.cc(596) LOG(INFO) Tokenizing input sentences with whitespace: 9978
trainer_interface.cc(607) LOG(INFO) Done! 26075
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=50898 min_freq=1
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=8123 size=20 all=1569 active=1505 piece=ss
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=5589 size=40 all=2234 active=2170 piece=in
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=3556 size=60 all=3325 active=3261 piece=lich
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=2548 size=80 all=4071 active=4007 piece=▁auch
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=1934 size=100 all=4945 active=4881 piece=▁für
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=1835 min_freq=123
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=1524 size=120 all=5570 active=1624 piece=▁st
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=1273 size=140 all=6429 active=2483 piece=▁europä
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=1096 size=160 all=6919 active=2973 piece=▁wie
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=948 size=180 all=7429 active=3483 piece=▁als
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=795 size=200 all=8251 active=4305 piece=sten
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=794 min_freq=118
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=702 size=220 all=8878 active=1567 piece=egen
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=647 size=240 all=9406 active=2095 piece=▁gr
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=579 size=260 all=9909 active=2598 piece=▁soll
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=534 size=280 all=10191 active=2880 piece=mer
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=482 size=300 all=10462 active=3151 piece=rie
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=480 min_freq=100
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=446 size=320 all=10885 active=1387 piece=rit
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=416 size=340 all=11409 active=1911 piece=lichen
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=393 size=360 all=11870 active=2372 piece=▁fa
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=366 size=380 all=12257 active=2759 piece=liche
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=348 size=400 all=12804 active=3306 piece=dert
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=342 min_freq=89
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=321 size=420 all=13076 active=1260 piece=▁de
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=310 size=440 all=13407 active=1591 piece=▁gemein
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=297 size=460 all=13762 active=1946 piece=▁präsident
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=279 size=480 all=14157 active=2341 piece=▁entwick
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=263 size=500 all=14416 active=2600 piece=gie
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=263 min_freq=78
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=253 size=520 all=14839 active=1397 piece=ssch
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=239 size=540 all=15152 active=1710 piece=▁am
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=224 size=560 all=15355 active=1913 piece=▁letz
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=214 size=580 all=15657 active=2215 piece=arbeit
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=205 size=600 all=16033 active=2591 piece=blem
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=205 min_freq=69
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=194 size=620 all=16244 active=1201 piece=▁jahr
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=188 size=640 all=16515 active=1472 piece=▁c
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=180 size=660 all=16892 active=1849 piece=zehn
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=174 size=680 all=17196 active=2153 piece=▁ener
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=168 size=700 all=17470 active=2427 piece=sicht
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=168 min_freq=62
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=164 size=720 all=17934 active=1441 piece=▁no
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=157 size=740 all=18224 active=1731 piece=▁grenz
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=153 size=760 all=18450 active=1957 piece=▁gesagt
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=147 size=780 all=18661 active=2168 piece=ekt
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=142 size=800 all=18964 active=2471 piece=inn
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=142 min_freq=57
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=138 size=820 all=19202 active=1219 piece=▁einge
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=134 size=840 all=19518 active=1535 piece=▁nun
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=130 size=860 all=19737 active=1754 piece=anken
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=125 size=880 all=20007 active=2024 piece=▁dabei
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=120 size=900 all=20215 active=2232 piece=▁türk
bpe_model_trainer.cc(167) LOG(INFO) Updating active symbols. max_freq=120 min_freq=51
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=117 size=920 all=20372 active=1163 piece=▁öffentlich
bpe_model_trainer.cc(258) LOG(INFO) Added: freq=112 size=940 all=20664 active=1455 piece=▁wei
trainer_interface.cc(685) LOG(INFO) Saving model: ./data/processed/tokenizer/tokenizer_spe_bpe_v1024/tokenizer.model
trainer_interface.cc(697) LOG(INFO) Saving vocabs: ./data/processed/tokenizer/tokenizer_spe_bpe_v1024/tokenizer.vocab
Serialized tokenizer at location : ./data/processed/tokenizer/tokenizer_spe_bpe_v1024
INFO:root:Done!
That’s it! Our tokenizer is now built and stored inside the data_root directory that we provided to the script.
We can inspect the tokenizer vocabulary itself. To keep it manageable, we will print just the first 10 tokens of the vocabulary:
!head -n 10 ./data/processed/tokenizer/tokenizer_spe_bpe_v1024/vocab.txt
##en
##er
d
##ch
##ei
##un
##ie
w
a
s