N-Gram Language Model
Language Models estimate the probability distriubtion of sequences of words. In general, this is a large task, with arbitrary sequence lengths, so it is often assumed that the probability of a word is only dependent on the N words preceding it. This is known as an N-Gram Language Model. An N-Gram model of order N saves the counts of all observed sequences of words in the training data of lengths one (known as unigrams) to lengths N. During inference, if an N-gram sequence not seen during training is queried, the sequence is then simplified to the probability of the N-1 last words, weighted by a calculated backoff probability.
Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.
tao n_gram download_specs -r /results/n_gram/default_specs/ \
-o /specs/lm/n_gram/
When training the model, the dataset should be first converted to the required data format, which requires a singular text file. For a word based N-Gram model, it is recommended to convert all letters to lowercase, remove any undesired punctuation, replace numerics with words, and replace umlauts. These are handled by the provided dataset conversion tool.
To convert to the format of the model data, use the dataset_convert utility, which implements the recommended preprocessing steps mentioned above. The provided dataset_convert utility takes in a directory, and processes all files with a specified extension within the directory into a singular text file. You can write your own converter for the format that you are using for data preprocessing. For a dataset that follows your own annotation format, the resulting data should be contained within a singular file.
!tao n_gram dataset_convert \
source_data_dir=`source_data_dir` \
extension='file_extension' \
target_data_dir=`target_data_dir`
source_data_dir
: The directory location of the your datasettarget_data_file
: The directory location where the converted dataset should be saved
The following is an example of the config spec for training train.yaml
file. You can change
any of these parameters and pass them to the train
command.
model:
intermediate: True
order: 2
pruning:
- 0
training_ds:
is_tarred: false
is_file: true
data_dir: ???
validation_ds:
is_tarred: false
is_file: true
data_dir: ???
vocab_file: ""
encryption_key: "tlt_encode"
Parameter |
Data Type |
Default |
Description |
training_ds.data_dir |
string |
– |
Path to dataset file. |
model.order |
int |
– |
Order of N-Gram model (maximum number of grams) |
vocab_file |
string |
– |
Optional path to vocab file to limit vocabulary learned by model. |
model.intermediate |
boolean |
true |
Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate |
model.pruning |
list[int] |
[0] |
Prune grams with counts less than or equal to threshold provided for each gram. Non-decreasing. Starts with 0 |
export_to |
string |
– |
The path to the trained .tlt model |
The following is an example of the command for training the model:
!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
training_ds.data_dir=PATH_TO_DATA \
model.order=4 \
model.pruning=[0,1,1,3] \
-k $KEY
Required Arguments for Training
-e
: The experiment-specification file to set up trainingmodel.order
: Model ordertraining_ds.data_dir
: The dataset directory-k
: The encryption key
Optional Arguments
model.intermediate
: If true, saves intermediate file format as wellmodel.pruning
: List of pruning thresholds for each gram order, ascending in order. Must be non-decreasing and start with 0.
Training Procedure
At the start of evaluation, TAO Toolkit will print out a log of the experiment specification, a summary of the training dataset, and the model parameters.
As the model starts training, you will see a progress bar. At the end of training, TAO Toolkit will save the model ARPA to the results directory.
The fine-tuning process will continue training using a previously trained model by training a second model on new data and interpolating it with the original model; however, fine-tuning is not used much in this task since training is fast and the trained model is specific to a dataset. Additionally, finetuning requires the original model have intermediate enabled during training. A finetuned model cannot be used for finetuning again.
The following is an example spec for fine-tuning of the model:
# Spec for new model trained
model:
order: 2
pruning:
- 0
- 1
# Spec for finetuning dataset
tuning_ds:
is_tarred: false
is_file: true
data_dir: "/data/data.txt"
# Weight of the second model in interpolation
weight: 0.5
restore_from: "/results/prev_model.tlt"
vocab_file: ""
encryption_key: "tlt_encode"
Parameter |
Data Type |
Default |
Description |
restore_from |
string |
– |
The path to the pre-trained model (tlt with intermediate or kenlm_intermediate + vocab) |
weight |
string |
– |
Weight of the new model in interpolatoin |
tuning_ds.data_dir |
string |
– |
The path to the data that is converted to the specified format |
model.order |
int |
– |
Order of N-Gram model (maximum number of grams) |
vocab_file |
string |
– |
Optional path to vocab file to limit vocabulary learned by model. |
model.intermediate |
boolean |
true |
Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate |
model.pruning |
list[int] |
[0] |
Prune grams with counts less than or equal to threshold provided for each gram. Non-decreasing. Starts with 0 |
export_to |
string |
– |
The path to the finetuned model (Cannot be finetuned again) |
Use the following command to fine-tune the model:
!tao n_gram finetune \
-e /specs/lm/n_gram/finetune.yaml \
restore_from=PATH_TO_MODEL \
data_dir=PATH_TO_DATA \
model.order=3 \
weight=0.5 \
-k $KEY
Required Arguments for Fine-tuning
-e
: The experiment specification file to set up fine-tuningtuning_ds.data_dir
: The dataset directoryrestore_from
: The model to restore frommodel.order
: Order of second model-k
: The encryption key
Optional Arguments
weight
: Weight of the second model in interpolationvocab_file
: File with list of vocab to limit second model to.
You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call.
Fine-tuning Procedure
Fine-tuning the procedure and logs will look similar to the procedure described in the Model Training section, but there will be two progress bars (one for training a new model, and the other for interpolation).
The following is an example spec to evaluate the pre-trained model:
test_ds:
data_dir: ???
restore_from: ???
Parameter |
Data Type |
Default |
Description |
restore_from |
string |
– |
The path to the pre-trained model |
test_ds.data_dir |
string |
– |
The path to the data converted to the specified format |
Use the following command to evaluate the model:
!tao n_gram evaluate \
-e /specs/nlp/intent_slot_classification/evaluate.yaml \
test_ds.data_dir=PATH_TO_DATA \
restore_from=PATH_TO_MODEL
Required Arguments for Evaluation
-e
: The experiment specification file to set up evaluationdata_dir
: Path to the pre-processed data to run evaluation onrestore_from
: Path to the pre-trained model.
Evaluation Procedure
After the previously trained model is initialized, it will run evaluation against the provided test set. You should see metric for the cumulative dataset, as shown.
...
Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa
Perplexity calculated from the 3-gram model is 474.87323201470394
To run inference on the model, specify the list of examples in the spec:
input_batch:
- 'set alarm for seven thirty am'
- 'lower volume by fifty percent'
- 'what is my schedule for tomorrow'
In the output, for each input query, the log-likelihoods of each N-Gram will be printed, as shown in the following example:
Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162
Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>
Use the following command to run inference:
!tao n_gram infer \
-e /specs/lm/n_gram/infer.yaml \
-m trained-model.tlt \
Required Arguments for Inference
-e
: The experiment specification file to set up inference. This requires the-m
: The path to the pre-trained model checkpoint from which to infer. The file should have a.tlt
extension.
Inference Procedure
After the trained model is loaded, it will run on the provided set of query examples and display the total log-likelihood, total perplexity, and individual log-likelihoods of each n-gram
...
Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa
Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162
Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>
Sentence: lower volume by fifty percent
Log likelihood: -22.31029510498047
Perplexity: 5228.565073265182
Enumerate all n-grams from model outputs:
-4.880340576171875 1: lower
-4.325512409210205 1: volume
-2.352994680404663 1: by
-4.150567054748535 1: fifty
-5.198777198791504 1: percent
"percent" is an OOV
-1.4021024703979492 1: </s>
Sentence: what is my schedule for tomorrow
Log likelihood: -19.649606704711914
Perplexity: 641.3375554793575
Enumerate all n-grams from model outputs:
-1.9918817281723022 2: <s> what
-1.2880996465682983 3: <s> what is
-1.9683064222335815 3: what is my
-5.349557876586914 1: schedule
"schedule" is an OOV
-2.0828189849853516 1: for
-5.479092121124268 1: tomorrow
-1.4898509979248047 1: </s>
The following is an example of the spec file for model export:
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# Name of the .arpa EFF archive to be loaded/model to be exported.
restore_from: "/results/pretrain_model.tlt"
# Output EFF archive containing binary.
export_to: exported-model.riva
# Data structure to use (default is probing)
binary_type: "probing"
# probabillity bits (quantization)
binary_q_bits: 0
# backoff bits (quantization)
binary_b_bits: 0
# pointer compression
binary_a_bits: 0
Parameter |
Data Type |
Default |
Description |
restore_from |
string |
– |
The path to the pre-trained model |
export_to |
string |
exported-model.riva |
The path to the exported model |
binary_type |
string |
probing |
Data structure for binary |
binary_q_bits |
int |
0 |
Probability bits (quantization) |
binary_b_bits |
int |
0 |
Back off bits (quantization) |
binary_a_bits |
int |
0 |
Pointer compression |
To export a pre-trained model, run the following:
### For export to Riva
!tao n_gram export \
-e /specs/nlp/intent_slot_classification/export.yaml \
-m finetuned-model.tlt \
export_to=PATH_TO_MODEL \
binary_type=probing \
-k $KEY
Required Arguments for Export
-e
: The experiment specification file to set up inference. This requires theexport_to
: Path to RIVA file to export to.-m
: The path to the pre-trained model checkpoint from which to infer. The file should have a.tlt
extension.-k
: The encryption key
You can use the Riva framework for the deployment of the trained model in the runtime. For more details, refer to the Riva documentation