# N-Gram Language Model

Language Models estimate the probability distriubtion of sequences of words. In general, this is a large task, with arbitrary sequence lengths, so it is often assumed that the probability of a word is only dependent on the N words preceding it. This is known as an N-Gram Language Model. An N-Gram model of order N saves the counts of all observed sequences of words in the training data of lengths one (known as unigrams) to lengths N. During inference, if an N-gram sequence not seen during training is queried, the sequence is then simplified to the probability of the N-1 last words, weighted by a calculated backoff probability.

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

Copy
Copied!

tao n_gram download_specs -r /results/n_gram/default_specs/ \
-o /specs/lm/n_gram/


## Data Format

When training the model, the dataset should be first converted to the required data format, which requires a singular text file. For a word based N-Gram model, it is recommended to convert all letters to lowercase, remove any undesired punctuation, replace numerics with words, and replace umlauts. These are handled by the provided dataset conversion tool.

## Dataset Conversion

To convert to the format of the model data, use the dataset_convert utility, which implements the recommended preprocessing steps mentioned above. The provided dataset_convert utility takes in a directory, and processes all files with a specified extension within the directory into a singular text file. You can write your own converter for the format that you are using for data preprocessing. For a dataset that follows your own annotation format, the resulting data should be contained within a singular file.

Copy
Copied!

!tao n_gram dataset_convert \
source_data_dir=source_data_dir \
extension='file_extension' \
target_data_dir=target_data_dir


• source_data_dir: The directory location of the your dataset

• target_data_file: The directory location where the converted dataset should be saved

## Model Training

The following is an example of the config spec for training train.yaml file. You can change any of these parameters and pass them to the train command.

Copy
Copied!

model:
intermediate: True
order: 2
pruning:
- 0
training_ds:
is_tarred: false
is_file: true
data_dir: ???

validation_ds:
is_tarred: false
is_file: true
data_dir: ???

vocab_file: ""
encryption_key: "tlt_encode"


 Parameter Data Type Default Description training_ds.data_dir string – Path to dataset file. model.order int – Order of N-Gram model (maximum number of grams) vocab_file string – Optional path to vocab file to limit vocabulary learned by model. model.intermediate boolean true Choose from [true,false]. If True, creates intermediate file - required for finetune and interpolate model.pruning list[int] [0] Prune grams with counts less than or equal to threhold provided for each gram. Non-decreasing. Starts with 0 export_to string – The path to the trained .tlt model

The following is an example of the command for training the model:

Copy
Copied!

!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
training_ds.data_dir=PATH_TO_DATA \
model.order=4 \
model.pruning=[0,1,1,3]  \


### Required Arguments for Fine-tuning

• -e: The experiment specification file to set up fine-tuning

• tuning_ds.data_dir: The dataset directory

• restore_from: The model to restore from

• model.order: Order of second model

• -k: The encryption key

### Optional Arguments

• weight: Weight of the second model in interpolation

• vocab_file: File with list of vocab to limit second model to.

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call.

### Fine-tuning Procedure

Fine-tuning the procedure and logs will look similar to the procedure described in the Model Training section, but there will be two progress bars (one for training a new model, and the other for interpolation).

## Model Evaluation

The following is an example spec to evaluate the pre-trained model:

Copy
Copied!

test_ds:
data_dir: ???

restore_from: ???


 Parameter Data Type Default Description restore_from string – The path to the pre-trained model test_ds.data_dir string – The path to the data converted to the specified format

Use the following command to evaluate the model:

Copy
Copied!

!tao n_gram evaluate \
-e /specs/nlp/intent_slot_classification/evaluate.yaml \
test_ds.data_dir=PATH_TO_DATA \
restore_from=PATH_TO_MODEL


### Required Arguments for Evaluation

• -e: The experiment specification file to set up evaluation

• data_dir: Path to the pre-processed data to run evaluation on

• restore_from: Path to the pre-trained model.

### Evaluation Procedure

After the previously trained model is initialized, it will run evaluation against the provided test set. You should see metric for the cummulative dataset, as shown.

Copy
Copied!

...

Perplexity calculated from the 3-gram model is 474.87323201470394


## Model Inference

To run inference on the model, specify the list of examples in the spec:

Copy
Copied!

input_batch:
- 'set alarm for seven thirty am'
- 'lower volume by fifty percent'
- 'what is my schedule for tomorrow'


In the output, for each input query, the log-likelihoods of each N-Gram will be printed, as shown in the following example:

Copy
Copied!

Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162

Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>


Use the following command to run inference:

Copy
Copied!

!tao n_gram infer \
-e /specs/lm/n_gram/infer.yaml \
-m trained-model.tlt \


### Required Arguments for Inference

• -e: The experiment specification file to set up inference. This requires the

• -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

### Inference Procedure

After the trained model is loaded, it will run on the provided set of query examples and display the total log-likelihood, total perplexity, and individual log-likelihoods of each n-gram

Copy
Copied!

...

Sentence: set alarm for seven thirty am
Log likelihood: -24.213985443115234
Perplexity: 2878.3312814538162

Enumerate all n-grams from model outputs:
-4.188663482666016 1: set
-4.425830841064453 1: alarm
-2.1986284255981445 1: for
-3.727583646774292 2: for seven
-3.8963475227355957 1: thirty
-4.2834343910217285 1: am
-1.4934978485107422 2: am </s>

Sentence: lower volume by fifty percent
Log likelihood: -22.31029510498047
Perplexity: 5228.565073265182

Enumerate all n-grams from model outputs:
-4.880340576171875 1: lower
-4.325512409210205 1: volume
-2.352994680404663 1: by
-4.150567054748535 1: fifty
-5.198777198791504 1: percent
"percent" is an OOV
-1.4021024703979492 1: </s>

Sentence: what is my schedule for tomorrow
Log likelihood: -19.649606704711914
Perplexity: 641.3375554793575

Enumerate all n-grams from model outputs:
-1.9918817281723022 2: <s> what
-1.2880996465682983 3: <s> what is
-1.9683064222335815 3: what is my
-5.349557876586914 1: schedule
"schedule" is an OOV
-2.0828189849853516 1: for
-5.479092121124268 1: tomorrow
-1.4898509979248047 1: </s>


## Model Export

The following is an example of the spec file for model export:

Copy
Copied!

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

# Name of the .arpa EFF archive to be loaded/model to be exported.
restore_from: "/results/pretrain_model.tlt"

# Output EFF archive containing binary.
export_to: exported-model.riva

# Data structure to use (default is probing)
binary_type: "probing"

# probabillity bits (quantization)
binary_q_bits: 0

# backoff bits (quantization)
binary_b_bits: 0

# pointer compression
binary_a_bits: 0


 Parameter Data Type Default Description restore_from string – The path to the pre-trained model export_to string exported-model.riva The path to the exported model binary_type string probing Data structure for binary binary_q_bits int 0 Probability bits (quantization) binary_b_bits int 0 Back off bits (quantization) binary_a_bits int 0 Pointer compression

To export a pre-trained model, run the following:

Copy
Copied!

 ### For export to Riva
!tao n_gram export \
-e /specs/nlp/intent_slot_classification/export.yaml \
-m finetuned-model.tlt \
export_to=PATH_TO_MODEL \
binary_type=probing \
-k \$KEY


### Required Arguments for Export

• -e: The experiment specification file to set up inference. This requires the export_to: Path to RIVA file to export to.

• -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

• -k: The encryption key

## Model Deployment

You can use the Riva framework for the deployment of the trained model in the runtime. For more details, refer to the Riva documentation