N-Gram Language Model

Language Models estimate the probability distriubtion of sequences of words. In general, this is a large task, with arbitrary sequence lengths, so it is often assumed that the probability of a word is only dependent on the N words preceding it. This is known as an N-Gram Language Model. An N-Gram model of order N saves the counts of all observed sequences of words in the training data of lengths one (known as unigrams) to lengths N. During inference, if an N-gram sequence not seen during training is queried, the sequence is then simplified to the probability of the N-1 last words, weighted by a calculated backoff probability.

Downloading Sample Spec files

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

Copy
Copied!

            
            tao n_gram download_specs -r /results/n_gram/default_specs/ \
                          -o /specs/lm/n_gram/

Data Format

When training the model, the dataset should be first converted to the required data format, which requires a singular text file. For a word based N-Gram model, it is recommended to convert all letters to lowercase, remove any undesired punctuation, replace numerics with words, and replace umlauts. These are handled by the provided dataset conversion tool.

Dataset Conversion

To convert to the format of the model data, use the dataset_convert utility, which implements the recommended preprocessing steps mentioned above. The provided dataset_convert utility takes in a directory, and processes all files with a specified extension within the directory into a singular text file. You can write your own converter for the format that you are using for data preprocessing. For a dataset that follows your own annotation format, the resulting data should be contained within a singular file.

Copy
Copied!

            
            !tao n_gram dataset_convert \
   source_data_dir=`source_data_dir` \
   extension='file_extension' \
   target_data_dir=`target_data_dir`

source_data_dir: The directory location of the your dataset
target_data_file: The directory location where the converted dataset should be saved

Model Training

The following is an example of the config spec for training train.yaml file. You can change any of these parameters and pass them to the train command.

Copy
Copied!

            
            model:
  intermediate: True
  order: 2
  pruning:
    - 0
training_ds:
  is_tarred: false
  is_file: true
  data_dir: ???

validation_ds:
  is_tarred: false
  is_file: true
  data_dir: ???

vocab_file: ""
encryption_key: "tlt_encode"

Parameter	Data Type	Default	Description
training_ds.data_dir	string	–	Path to dataset file.
model.order	int	–	Order of N-Gram model (maximum number of grams)
vocab_file	string	–	Optional path to vocab file to limit vocabulary learned by model.
model.intermediate	boolean	true	Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate
model.pruning	list[int]	[0]	Prune grams with counts less than or equal to threhold provided for each gram. Non-decreasing. Starts with 0
export_to	string	–	The path to the trained .tlt model

The following is an example of the command for training the model:

Copy
Copied!

            
            !tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
                     training_ds.data_dir=PATH_TO_DATA \
                     model.order=4 \
                     model.pruning=[0,1,1,3]  \
                     -k $KEY

Required Arguments for Training

-e: The experiment-specification file to set up training
model.order: Model order
training_ds.data_dir: The dataset directory
-k: The encryption key

Optional Arguments

model.intermediate: If true, saves intermediate file format as well
model.pruning: List of pruning thresholds for each gram order, ascending in order. Must be non-decreasing and start with 0.

Training Procedure

At the start of evaluation, TAO Toolkit will print out a log of the experiment specification, a summary of the training dataset, and the model parameters.

As the model starts training, you will see a progress bar. At the end of training, TAO Toolkit will save the model ARPA to the results directory.

Model Fine-tuning

The fine-tuning process will continue training using a previously trained model by training a second model on new data and interpolating it with the original model; however, fine-tuning is not used much in this task since training is fast and the trained model is specific to a dataset. Additionally, finetuning requires the original model have intermediate enabled during training. A finetuned model cannot be used for finetuning again.

The following is an example spec for fine-tuning of the model:

Copy
Copied!

            
            # Spec for new model trained
model:
  order: 2
  pruning:
    - 0
    - 1

# Spec for finetuning dataset
tuning_ds:
  is_tarred: false
  is_file: true
  data_dir: "/data/data.txt"

# Weight of the second model in interpolation
weight: 0.5
restore_from: "/results/prev_model.tlt"
vocab_file: ""
encryption_key: "tlt_encode"

Parameter	Data Type	Default	Description
restore_from	string	–	The path to the pre-trained model (tlt with intermediate or kenlm_intermediate + vocab)
weight	string	–	Weight of the new model in interpolatoin
tuning_ds.data_dir	string	–	The path to the data that is converted to the specified format
model.order	int	–	Order of N-Gram model (maximum number of grams)
vocab_file	string	–	Optional path to vocab file to limit vocabulary learned by model.
model.intermediate	boolean	true	Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate
model.pruning	list[int]	[0]	Prune grams with counts less than or equal to threhold provided for each gram. Non-decreasing. Starts with 0
export_to	string	–	The path to the finetuned model (Cannot be finetuned again)

Use the following command to fine-tune the model:

Copy
Copied!

            
            !tao n_gram finetune \
            -e /specs/lm/n_gram/finetune.yaml \
            restore_from=PATH_TO_MODEL \
            data_dir=PATH_TO_DATA \
            model.order=3 \
            weight=0.5 \
            -k $KEY

Required Arguments for Fine-tuning

-e: The experiment specification file to set up fine-tuning
tuning_ds.data_dir: The dataset directory
restore_from: The model to restore from
model.order: Order of second model
-k: The encryption key

Optional Arguments

weight: Weight of the second model in interpolation
vocab_file: File with list of vocab to limit second model to.

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call.

Fine-tuning Procedure

Fine-tuning the procedure and logs will look similar to the procedure described in the Model Training section, but there will be two progress bars (one for training a new model, and the other for interpolation).

Model Evaluation

The following is an example spec to evaluate the pre-trained model:

Copy
Copied!

            
            test_ds:
  data_dir: ???

restore_from: ???

Parameter	Data Type	Default	Description
restore_from	string	–	The path to the pre-trained model
test_ds.data_dir	string	–	The path to the data converted to the specified format

Use the following command to evaluate the model:

Copy
Copied!

            
            !tao n_gram evaluate \
            -e /specs/nlp/intent_slot_classification/evaluate.yaml \
            test_ds.data_dir=PATH_TO_DATA \
            restore_from=PATH_TO_MODEL

Required Arguments for Evaluation

-e: The experiment specification file to set up evaluation
data_dir: Path to the pre-processed data to run evaluation on
restore_from: Path to the pre-trained model.

Evaluation Procedure

After the previously trained model is initialized, it will run evaluation against the provided test set. You should see metric for the cummulative dataset, as shown.

Copy
Copied!

            
            ...
Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa

Perplexity calculated from the 3-gram model is 474.87323201470394

Model Inference

To run inference on the model, specify the list of examples in the spec:

Copy
Copied!

            
            input_batch:
    - 'set alarm for seven thirty am'
    - 'lower volume by fifty percent'
    - 'what is my schedule for tomorrow'

In the output, for each input query, the log-likelihoods of each N-Gram will be printed, as shown in the following example:

Copy
Copied!

            
            Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162

Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>

Use the following command to run inference:

Copy
Copied!

            
            !tao n_gram infer \
            -e /specs/lm/n_gram/infer.yaml \
            -m trained-model.tlt \

Required Arguments for Inference

-e: The experiment specification file to set up inference. This requires the
-m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

Inference Procedure

After the trained model is loaded, it will run on the provided set of query examples and display the total log-likelihood, total perplexity, and individual log-likelihoods of each n-gram

Copy
Copied!

            
            ...
Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa

Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162

Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>

Sentence: lower volume by fifty percent
Log likelihood: -22.31029510498047
Perplexity: 5228.565073265182

Enumerate all n-grams from model outputs:
-4.880340576171875 1: lower
-4.325512409210205 1: volume
-2.352994680404663 1: by
-4.150567054748535 1: fifty
-5.198777198791504 1: percent
    "percent" is an OOV
-1.4021024703979492 1: </s>

Sentence: what is my schedule for tomorrow
Log likelihood: -19.649606704711914
Perplexity: 641.3375554793575

Enumerate all n-grams from model outputs:
-1.9918817281723022 2: <s> what
-1.2880996465682983 3: <s> what is
-1.9683064222335815 3: what is my
-5.349557876586914 1: schedule
    "schedule" is an OOV
-2.0828189849853516 1: for
-5.479092121124268 1: tomorrow
-1.4898509979248047 1: </s>

Model Export

The following is an example of the spec file for model export:

Copy
Copied!

            
            # Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

# Name of the .arpa EFF archive to be loaded/model to be exported.
restore_from: "/results/pretrain_model.tlt"

# Output EFF archive containing binary.
export_to: exported-model.riva

# Data structure to use (default is probing)
binary_type: "probing"

# probabillity bits (quantization)
binary_q_bits: 0

# backoff bits (quantization)
binary_b_bits: 0

# pointer compression
binary_a_bits: 0

Parameter	Data Type	Default	Description
restore_from	string	–	The path to the pre-trained model
export_to	string	exported-model.riva	The path to the exported model
binary_type	string	probing	Data structure for binary
binary_q_bits	int	0	Probability bits (quantization)
binary_b_bits	int	0	Back off bits (quantization)
binary_a_bits	int	0	Pointer compression

To export a pre-trained model, run the following:

Copy
Copied!

            
             ### For export to Riva
!tao n_gram export \
           -e /specs/nlp/intent_slot_classification/export.yaml \
           -m finetuned-model.tlt \
           export_to=PATH_TO_MODEL \
           binary_type=probing \
           -k $KEY

Required Arguments for Export

-e: The experiment specification file to set up inference. This requires the export_to: Path to RIVA file to export to.
-m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.
-k: The encryption key

Model Deployment

You can use the Riva framework for the deployment of the trained model in the runtime. For more details, refer to the Riva documentation