N-Gram Language Model

Language Models estimate the probability distriubtion of sequences of words. In general, this is a large task, with arbitrary sequence lengths, so it is often assumed that the probability of a word is only dependent on the N words preceding it. This is known as an N-Gram Language Model. An N-Gram model of order N saves the counts of all observed sequences of words in the training data of lengths one (known as unigrams) to lengths N. During inference, if an N-gram sequence not seen during training is queried, the sequence is then simplified to the probability of the N-1 last words, weighted by a calculated backoff probability.

Downloading Sample Spec files

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

tao n_gram download_specs -r /results/n_gram/default_specs/ \
                          -o /specs/lm/n_gram/

Data Format

When training the model, the dataset should be first converted to the required data format, which requires a singular text file. For a word based N-Gram model, it is recommended to convert all letters to lowercase, remove any undesired punctuation, replace numerics with words, and replace umlauts. These are handled by the provided dataset conversion tool.

Dataset Conversion

To convert to the format of the model data, use the dataset_convert utility, which implements the recommended preprocessing steps mentioned above. The provided dataset_convert utility takes in a directory, and processes all files with a specified extension within the directory into a singular text file. You can write your own converter for the format that you are using for data preprocessing. For a dataset that follows your own annotation format, the resulting data should be contained within a singular file.

!tao n_gram dataset_convert \
   source_data_dir=`source_data_dir` \
   extension='file_extension' \
   target_data_dir=`target_data_dir`
  • source_data_dir: The directory location of the your dataset

  • target_data_file: The directory location where the converted dataset should be saved

Model Training

The following is an example of the config spec for training train.yaml file. You can change any of these parameters and pass them to the train command.

model:
  intermediate: True
  order: 2
  pruning:
    - 0
training_ds:
  is_tarred: false
  is_file: true
  data_dir: ???

validation_ds:
  is_tarred: false
  is_file: true
  data_dir: ???

vocab_file: ""
encryption_key: "tlt_encode"

Parameter

Data Type

Default

Description

training_ds.data_dir

string

Path to dataset file.

model.order

int

Order of N-Gram model (maximum number of grams)

vocab_file

string

Optional path to vocab file to limit vocabulary learned by model.

model.intermediate

boolean

true

Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate

model.pruning

list[int]

[0]

Prune grams with counts less than or equal to threhold provided for each gram. Non-decreasing. Starts with 0

export_to

string

The path to the trained .tlt model

The following is an example of the command for training the model:

!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
                     training_ds.data_dir=PATH_TO_DATA \
                     model.order=4 \
                     model.pruning=[0,1,1,3]  \
                     -k $KEY

Required Arguments for Training

  • -e: The experiment-specification file to set up training

  • model.order: Model order

  • training_ds.data_dir: The dataset directory

  • -k: The encryption key

Optional Arguments

  • model.intermediate: If true, saves intermediate file format as well

  • model.pruning: List of pruning thresholds for each gram order, ascending in order. Must be non-decreasing and start with 0.

Training Procedure

At the start of evaluation, TAO Toolkit will print out a log of the experiment specification, a summary of the training dataset, and the model parameters.

As the model starts training, you will see a progress bar. At the end of training, TAO Toolkit will save the model ARPA to the results directory.

Model Fine-tuning

The fine-tuning process will continue training using a previously trained model by training a second model on new data and interpolating it with the original model; however, fine-tuning is not used much in this task since training is fast and the trained model is specific to a dataset. Additionally, finetuning requires the original model have intermediate enabled during training. A finetuned model cannot be used for finetuning again.

The following is an example spec for fine-tuning of the model:

# Spec for new model trained
model:
  order: 2
  pruning:
    - 0
    - 1

# Spec for finetuning dataset
tuning_ds:
  is_tarred: false
  is_file: true
  data_dir: "/data/data.txt"

# Weight of the second model in interpolation
weight: 0.5
restore_from: "/results/prev_model.tlt"
vocab_file: ""
encryption_key: "tlt_encode"

Parameter

Data Type

Default

Description

restore_from

string

The path to the pre-trained model (tlt with intermediate or kenlm_intermediate + vocab)

weight

string

Weight of the new model in interpolatoin

tuning_ds.data_dir

string

The path to the data that is converted to the specified format

model.order

int

Order of N-Gram model (maximum number of grams)

vocab_file

string

Optional path to vocab file to limit vocabulary learned by model.

model.intermediate

boolean

true

Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate

model.pruning

list[int]

[0]

Prune grams with counts less than or equal to threhold provided for each gram. Non-decreasing. Starts with 0

export_to

string

The path to the finetuned model (Cannot be finetuned again)

Use the following command to fine-tune the model:

!tao n_gram finetune \
            -e /specs/lm/n_gram/finetune.yaml \
            restore_from=PATH_TO_MODEL \
            data_dir=PATH_TO_DATA \
            model.order=3 \
            weight=0.5 \
            -k $KEY

Required Arguments for Fine-tuning

  • -e: The experiment specification file to set up fine-tuning

  • tuning_ds.data_dir: The dataset directory

  • restore_from: The model to restore from

  • model.order: Order of second model

  • -k: The encryption key

Optional Arguments

  • weight: Weight of the second model in interpolation

  • vocab_file: File with list of vocab to limit second model to.

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call.

Fine-tuning Procedure

Fine-tuning the procedure and logs will look similar to the procedure described in the Model Training section, but there will be two progress bars (one for training a new model, and the other for interpolation).

Model Evaluation

The following is an example spec to evaluate the pre-trained model:

test_ds:
  data_dir: ???

restore_from: ???

Parameter

Data Type

Default

Description

restore_from

string

The path to the pre-trained model

test_ds.data_dir

string

The path to the data converted to the specified format

Use the following command to evaluate the model:

!tao n_gram evaluate \
            -e /specs/nlp/intent_slot_classification/evaluate.yaml \
            test_ds.data_dir=PATH_TO_DATA \
            restore_from=PATH_TO_MODEL

Required Arguments for Evaluation

  • -e: The experiment specification file to set up evaluation

  • data_dir: Path to the pre-processed data to run evaluation on

  • restore_from: Path to the pre-trained model.

Evaluation Procedure

After the previously trained model is initialized, it will run evaluation against the provided test set. You should see metric for the cummulative dataset, as shown.

...
Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa

Perplexity calculated from the 3-gram model is 474.87323201470394

Model Inference

To run inference on the model, specify the list of examples in the spec:

input_batch:
    - 'set alarm for seven thirty am'
    - 'lower volume by fifty percent'
    - 'what is my schedule for tomorrow'

In the output, for each input query, the log-likelihoods of each N-Gram will be printed, as shown in the following example:

Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162

Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>

Use the following command to run inference:

!tao n_gram infer \
            -e /specs/lm/n_gram/infer.yaml \
            -m trained-model.tlt \

Required Arguments for Inference

  • -e: The experiment specification file to set up inference. This requires the

  • -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

Inference Procedure

After the trained model is loaded, it will run on the provided set of query examples and display the total log-likelihood, total perplexity, and individual log-likelihoods of each n-gram

...
Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa

Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162

Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>

Sentence: lower volume by fifty percent
Log likelihood: -22.31029510498047
Perplexity: 5228.565073265182

Enumerate all n-grams from model outputs:
-4.880340576171875 1: lower
-4.325512409210205 1: volume
-2.352994680404663 1: by
-4.150567054748535 1: fifty
-5.198777198791504 1: percent
    "percent" is an OOV
-1.4021024703979492 1: </s>

Sentence: what is my schedule for tomorrow
Log likelihood: -19.649606704711914
Perplexity: 641.3375554793575

Enumerate all n-grams from model outputs:
-1.9918817281723022 2: <s> what
-1.2880996465682983 3: <s> what is
-1.9683064222335815 3: what is my
-5.349557876586914 1: schedule
    "schedule" is an OOV
-2.0828189849853516 1: for
-5.479092121124268 1: tomorrow
-1.4898509979248047 1: </s>

Model Export

The following is an example of the spec file for model export:

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

# Name of the .arpa EFF archive to be loaded/model to be exported.
restore_from: "/results/pretrain_model.tlt"

# Output EFF archive containing binary.
export_to: exported-model.riva

# Data structure to use (default is probing)
binary_type: "probing"

# probabillity bits (quantization)
binary_q_bits: 0

# backoff bits (quantization)
binary_b_bits: 0

# pointer compression
binary_a_bits: 0

Parameter

Data Type

Default

Description

restore_from

string

The path to the pre-trained model

export_to

string

exported-model.riva

The path to the exported model

binary_type

string

probing

Data structure for binary

binary_q_bits

int

0

Probability bits (quantization)

binary_b_bits

int

0

Back off bits (quantization)

binary_a_bits

int

0

Pointer compression

To export a pre-trained model, run the following:

 ### For export to Riva
!tao n_gram export \
           -e /specs/nlp/intent_slot_classification/export.yaml \
           -m finetuned-model.tlt \
           export_to=PATH_TO_MODEL \
           binary_type=probing \
           -k $KEY

Required Arguments for Export

  • -e: The experiment specification file to set up inference. This requires the export_to: Path to RIVA file to export to.

  • -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

  • -k: The encryption key

Model Deployment

You can use the Riva framework for the deployment of the trained model in the runtime. For more details, refer to the Riva documentation