N-Gram Language Model

NVIDIA TAO Toolkit (Latest Release)

Language Models estimate the probability distriubtion of sequences of words. In general, this is a large task, with arbitrary sequence lengths, so it is often assumed that the probability of a word is only dependent on the N words preceding it. This is known as an N-Gram Language Model. An N-Gram model of order N saves the counts of all observed sequences of words in the training data of lengths one (known as unigrams) to lengths N. During inference, if an N-gram sequence not seen during training is queried, the sequence is then simplified to the probability of the N-1 last words, weighted by a calculated backoff probability.

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

Copy
Copied!
            

tao n_gram download_specs -r /results/n_gram/default_specs/ \ -o /specs/lm/n_gram/

When training the model, the dataset should be first converted to the required data format, which requires a singular text file. For a word based N-Gram model, it is recommended to convert all letters to lowercase, remove any undesired punctuation, replace numerics with words, and replace umlauts. These are handled by the provided dataset conversion tool.

To convert to the format of the model data, use the dataset_convert utility, which implements the recommended preprocessing steps mentioned above. The provided dataset_convert utility takes in a directory, and processes all files with a specified extension within the directory into a singular text file. You can write your own converter for the format that you are using for data preprocessing. For a dataset that follows your own annotation format, the resulting data should be contained within a singular file.

Copy
Copied!
            

!tao n_gram dataset_convert \ source_data_dir=`source_data_dir` \ extension='file_extension' \ target_data_dir=`target_data_dir`

  • source_data_dir: The directory location of the your dataset

  • target_data_file: The directory location where the converted dataset should be saved

The following is an example of the config spec for training train.yaml file. You can change any of these parameters and pass them to the train command.

Copy
Copied!
            

model: intermediate: True order: 2 pruning: - 0 training_ds: is_tarred: false is_file: true data_dir: ??? validation_ds: is_tarred: false is_file: true data_dir: ??? vocab_file: "" encryption_key: "tlt_encode"

Parameter Data Type Default Description
training_ds.data_dir string Path to dataset file.
model.order int Order of N-Gram model (maximum number of grams)
vocab_file string Optional path to vocab file to limit vocabulary learned by model.
model.intermediate boolean true Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate
model.pruning list[int] [0] Prune grams with counts less than or equal to threshold provided for each gram. Non-decreasing. Starts with 0
export_to string The path to the trained .tlt model

The following is an example of the command for training the model:

Copy
Copied!
            

!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \ training_ds.data_dir=PATH_TO_DATA \ model.order=4 \ model.pruning=[0,1,1,3] \ -k $KEY

Required Arguments for Training

  • -e: The experiment-specification file to set up training

  • model.order: Model order

  • training_ds.data_dir: The dataset directory

  • -k: The encryption key

Optional Arguments

  • model.intermediate: If true, saves intermediate file format as well

  • model.pruning: List of pruning thresholds for each gram order, ascending in order. Must be non-decreasing and start with 0.

Training Procedure

At the start of evaluation, TAO Toolkit will print out a log of the experiment specification, a summary of the training dataset, and the model parameters.

As the model starts training, you will see a progress bar. At the end of training, TAO Toolkit will save the model ARPA to the results directory.

The fine-tuning process will continue training using a previously trained model by training a second model on new data and interpolating it with the original model; however, fine-tuning is not used much in this task since training is fast and the trained model is specific to a dataset. Additionally, finetuning requires the original model have intermediate enabled during training. A finetuned model cannot be used for finetuning again.

The following is an example spec for fine-tuning of the model:

Copy
Copied!
            

# Spec for new model trained model: order: 2 pruning: - 0 - 1 # Spec for finetuning dataset tuning_ds: is_tarred: false is_file: true data_dir: "/data/data.txt" # Weight of the second model in interpolation weight: 0.5 restore_from: "/results/prev_model.tlt" vocab_file: "" encryption_key: "tlt_encode"

Parameter Data Type Default Description
restore_from string The path to the pre-trained model (tlt with intermediate or kenlm_intermediate + vocab)
weight string Weight of the new model in interpolatoin
tuning_ds.data_dir string The path to the data that is converted to the specified format
model.order int Order of N-Gram model (maximum number of grams)
vocab_file string Optional path to vocab file to limit vocabulary learned by model.
model.intermediate boolean true Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate
model.pruning list[int] [0] Prune grams with counts less than or equal to threshold provided for each gram. Non-decreasing. Starts with 0
export_to string The path to the finetuned model (Cannot be finetuned again)

Use the following command to fine-tune the model:

Copy
Copied!
            

!tao n_gram finetune \ -e /specs/lm/n_gram/finetune.yaml \ restore_from=PATH_TO_MODEL \ data_dir=PATH_TO_DATA \ model.order=3 \ weight=0.5 \ -k $KEY

Required Arguments for Fine-tuning

  • -e: The experiment specification file to set up fine-tuning

  • tuning_ds.data_dir: The dataset directory

  • restore_from: The model to restore from

  • model.order: Order of second model

  • -k: The encryption key

Optional Arguments

  • weight: Weight of the second model in interpolation

  • vocab_file: File with list of vocab to limit second model to.

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call.


Fine-tuning Procedure

Fine-tuning the procedure and logs will look similar to the procedure described in the Model Training section, but there will be two progress bars (one for training a new model, and the other for interpolation).

The following is an example spec to evaluate the pre-trained model:

Copy
Copied!
            

test_ds: data_dir: ??? restore_from: ???

Parameter Data Type Default Description
restore_from string The path to the pre-trained model
test_ds.data_dir string The path to the data converted to the specified format

Use the following command to evaluate the model:

Copy
Copied!
            

!tao n_gram evaluate \ -e /specs/nlp/intent_slot_classification/evaluate.yaml \ test_ds.data_dir=PATH_TO_DATA \ restore_from=PATH_TO_MODEL

Required Arguments for Evaluation

  • -e: The experiment specification file to set up evaluation

  • data_dir: Path to the pre-processed data to run evaluation on

  • restore_from: Path to the pre-trained model.

Evaluation Procedure

After the previously trained model is initialized, it will run evaluation against the provided test set. You should see metric for the cumulative dataset, as shown.

Copy
Copied!
            

... Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa Perplexity calculated from the 3-gram model is 474.87323201470394


To run inference on the model, specify the list of examples in the spec:

Copy
Copied!
            

input_batch: - 'set alarm for seven thirty am' - 'lower volume by fifty percent' - 'what is my schedule for tomorrow'

In the output, for each input query, the log-likelihoods of each N-Gram will be printed, as shown in the following example:

Copy
Copied!
            

Sentence: set alarm for seven thirty am Log likelihood: -24.213985443115234 Perplexity: 2878.3312814538162 Enumerate all n-grams from model outputs: -4.188663482666016 1: set -4.425830841064453 1: alarm -2.1986284255981445 1: for -3.727583646774292 2: for seven -3.8963475227355957 1: thirty -4.2834343910217285 1: am -1.4934978485107422 2: am </s>

Use the following command to run inference:

Copy
Copied!
            

!tao n_gram infer \ -e /specs/lm/n_gram/infer.yaml \ -m trained-model.tlt \

Required Arguments for Inference

  • -e: The experiment specification file to set up inference. This requires the

  • -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

Inference Procedure

After the trained model is loaded, it will run on the provided set of query examples and display the total log-likelihood, total perplexity, and individual log-likelihoods of each n-gram

Copy
Copied!
            

... Loaded 3-gram model from /results/n_gram/train/checkpoints/train_n_gram.arpa Sentence: set alarm for seven thirty am Log likelihood: -24.213985443115234 Perplexity: 2878.3312814538162 Enumerate all n-grams from model outputs: -4.188663482666016 1: set -4.425830841064453 1: alarm -2.1986284255981445 1: for -3.727583646774292 2: for seven -3.8963475227355957 1: thirty -4.2834343910217285 1: am -1.4934978485107422 2: am </s> Sentence: lower volume by fifty percent Log likelihood: -22.31029510498047 Perplexity: 5228.565073265182 Enumerate all n-grams from model outputs: -4.880340576171875 1: lower -4.325512409210205 1: volume -2.352994680404663 1: by -4.150567054748535 1: fifty -5.198777198791504 1: percent "percent" is an OOV -1.4021024703979492 1: </s> Sentence: what is my schedule for tomorrow Log likelihood: -19.649606704711914 Perplexity: 641.3375554793575 Enumerate all n-grams from model outputs: -1.9918817281723022 2: <s> what -1.2880996465682983 3: <s> what is -1.9683064222335815 3: what is my -5.349557876586914 1: schedule "schedule" is an OOV -2.0828189849853516 1: for -5.479092121124268 1: tomorrow -1.4898509979248047 1: </s>


The following is an example of the spec file for model export:

Copy
Copied!
            

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. # Name of the .arpa EFF archive to be loaded/model to be exported. restore_from: "/results/pretrain_model.tlt" # Output EFF archive containing binary. export_to: exported-model.riva # Data structure to use (default is probing) binary_type: "probing" # probabillity bits (quantization) binary_q_bits: 0 # backoff bits (quantization) binary_b_bits: 0 # pointer compression binary_a_bits: 0

Parameter Data Type Default Description
restore_from string The path to the pre-trained model
export_to string exported-model.riva The path to the exported model
binary_type string probing Data structure for binary
binary_q_bits int 0 Probability bits (quantization)
binary_b_bits int 0 Back off bits (quantization)
binary_a_bits int 0 Pointer compression

To export a pre-trained model, run the following:

Copy
Copied!
            

### For export to Riva !tao n_gram export \ -e /specs/nlp/intent_slot_classification/export.yaml \ -m finetuned-model.tlt \ export_to=PATH_TO_MODEL \ binary_type=probing \ -k $KEY

Required Arguments for Export

  • -e: The experiment specification file to set up inference. This requires the export_to: Path to RIVA file to export to.

  • -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

  • -k: The encryption key

You can use the Riva framework for the deployment of the trained model in the runtime. For more details, refer to the Riva documentation

© Copyright 2023, NVIDIA.. Last updated on Sep 5, 2023.