Punctuation and Capitalization

Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. Besides being hard to read, the ASR output could be an input to named entity recognition, machine translation or text-to-speech models. These models could potentially benefit when the input text contains punctuation and the words are capitalized correctly.

For each word in the input text, the model:

  1. predicts a punctuation mark that should follow the word (if any). The model supports commas, periods, and question marks.

  2. predicts if the word should be capitalized or not.

TAO Toolkit provides a sample notebook to outline the end-to-end workflow on how to train a Punctuation and Capitalization model using TAO Toolkit and deploy it in Riva format on NGC resources.

Before proceeding, download sample spec files that are needed for the rest of the subtasks.

Copy
Copied!
            

tao punctuation_and_capitalization download_specs -r /results/punctuation_and_capitalization/default_specs/ \ -o /specs/nlp/punctuation_and_capitalization

Download Spec Required Arguments

  • -o: Path to where the spec files will be stored

  • -r: Output directory to store logs

After running the above, the spec files would be stored under /specs/nlp/punctuation_and_capitalization and you can modify them locally if this directory is mounted to your local folder in ~/.tao_mounts.json file.

This model can work with any text dataset, although it is recommended to balance the data, especially for the punctuation task.

Before pre-processing the data to the required format, the data should be split into train.txt and dev.txt (and optionally test.txt). The development set (or dev set) will be used to evaluate the performance of the model during model training. The hyper-parameters search and model selection should be based on the dev set, while the final evaluation of the selected model should be performed on the test set.

Each line in the train.txt/dev.txt/test.txt should represent one or more full and/or truncated sentences.

Example of the train.txt/dev.txt file:

Copy
Copied!
            

When is the next flight to New York? The next flight is ... ....

The source_data_dir structure should look like this:

Copy
Copied!
            

. |--sourced_data_dir |-- dev.txt |-- test.txt |-- train.txt

Raw data files from the source_data_dir described above will be converted to the following format with dataset_convert: The training and evaluation data is divided into two files: text.txt and labels.txt. Each line of the text.txt file contains text sequences, where words are separated with spaces, i.e., [WORD] [SPACE] [WORD] [SPACE] [WORD], for example:

Copy
Copied!
            

when is the next flight to new york the next flight is ... ...


The labels.txt file contains corresponding labels for each word in text.txt, the labels are separated with spaces. Each label in labels.txt file consists of two symbols:

  • the first symbol of the label indicates what punctuation mark should follow the word (where O means no punctuation needed)

  • the second symbol determines if a word needs to be capitalized or not (where U indicates that the word should be upper cased, and O - no capitalization needed.)

Punctuation marks considered: commas, periods, and question marks; the rest of the punctuation marks were removed from the data.

Each line of the labels.txt should follow the format: [LABEL] [SPACE] [LABEL] [SPACE] [LABEL] (for labels.txt). For example, labels for the above text.txt file should be:

Copy
Copied!
            

OU OO OO OO OO OO OU ?U OU OO OO OO ... ...


The complete list of all possible labels for this task used in this tutorial is: OO, ,O, .O, ?O, OU, ,U, .U, ?U.

Spec file for dataset conversion:

Copy
Copied!
            

# Path to the folder containing the dataset source files source_data_dir: ??? target_data_dir: ??? # list of file names inside source_data_dir to convert list_of_file_names: ['train.txt','dev.txt']

To pre-process the raw text data, stored under sourced_data_dir (see the Dataset section), run the following command:

Copy
Copied!
            

tao punctuation_and_capitalization dataset_convert [-h] \ -e /specs/nlp/punctuation_and_capitalization/dataset_convert.yaml \ -r /results/punctuation_and_capitalization/dataset_convert/ \ source_data_dir=/path/to/source_data_dir \ target_data_dir=/path/to/target_data_dir

Convert Dataset Required Arguments

  • -e: The experiment specification file.

  • source_data_dir - path to the raw data

  • target_data_dir - path to store the processed files

  • -r: Path to the directory to store logs.

Convert Dataset Optional Arguments

  • -h, --help: Show this help message and exit

  • list_of_file_names: List of files in source_data_dir for conversion

Parameter

Datatype

Default

Description

Supported Values

source_data_dir

string

Path to the dataset source data directory

target_data_dir

string

Path to the dataset target data directory

list_of_file_names

List of strings

[‘train.txt’,’dev.txt’]

List of files for conversion

After the conversion, the target_data_dir should contain the following files:

Copy
Copied!
            

. |--target_data_dir |-- labels_dev.txt |-- labels_test.txt |-- labels_train.txt |-- text_dev.txt |-- text_test.txt |-- text_train.txt

To download and convert a dataset from Tatoeba collection of sentences, run:

Copy
Copied!
            

tao punctuation_and_capitalization download_and_convert_tatoeba [-h] \ -e /specs/nlp/punctuation_and_capitalization/download_and_convert_tatoeba.yaml \ -r /results/punctuation_and_capitalization/download_and_convert_tatoeba/ \ target_data_dir=/path/to/`target_data_dir`

Output log from executing punctuation_and_capitalization download_and_convert_tatoeba:

Copy
Copied!
            

Downloading tatoeba dataset Downloading https://downloads.tatoeba.org/exports/sentences.csv to /path/to/target_data_dir/sentences.csv Saving to: ‘/path/to/target_data_dir/sentences.csv’ Processing English sentences... Splitting the dataset into train and dev sets and creating labels and text files Creating text and label files for training Cleaning up /home/ebakhturina/data/tatoeba/sample/dowdload_and_convert Processing of the tatoeba dataset is complete

After running punctuation_and_capitalization download_and_convert_tatoeba, the target_data_dir should contain the following files:

Copy
Copied!
            

. |--target_data_dir |-- labels_dev.txt # labels for the dev set |-- labels_train.txt # labels for the train set |-- sentences.csv # original Tatoeba data |-- text_dev.txt # text dev data |-- text_train.txt # text train data


Download and Convert Tatoeba Dataset Required Arguments

  • -e: The experiment specification file.

  • target_data_dir - path to store the processed files

Optional Arguments

  • -h, --help: Show this help message and exit

In the Punctuation and Capitalization Model, we are jointly training two token-level classifiers on top of a pre-trained language model, such as BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Unless the user provides a pre-trained checkpoint for the language model, the language model is initialized with the pre-trained model from HuggingFace Transformers.

Example spec for training:

Copy
Copied!
            

trainer: max_epochs: 5 # Path to the Data directory containing pre-processed dataset data_dir: ??? # Specifies parameters for the Punctuation and Capitalization model model: # Lists supported punctuation marks punct_label_ids: O: 0 ',': 1 '.': 2 '?': 3 capit_label_ids: O: 0 U: 1 tokenizer: tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece vocab_file: null # path to vocab file tokenizer_model: null # only used if tokenizer is sentencepiece special_tokens: null # Pre-trained language model such as BERT language_model: pretrained_model_name: bert-base-uncased lm_checkpoint: null config_file: null # json file, precedence over config config: null # Specifies parameters of the punctuation and capitalization heads that follow a BERT-based language-model punct_head: punct_num_fc_layers: 1 fc_dropout: 0.1 activation: 'relu' use_transformer_init: true capit_head: capit_num_fc_layers: 1 fc_dropout: 0.1 activation: 'relu' use_transformer_init: true # Specifies the parameters of the dataset to be used for training. training_ds: text_file: text_train.txt labels_file: labels_train.txt shuffle: true num_samples: -1 # number of samples to be considered, -1 means all the dataset batch_size: 64 # Specifies the parameters of the dataset to be used for validation. validation_ds: text_file: text_dev.txt labels_file: labels_dev.txt shuffle: false num_samples: -1 # number of samples to be considered, -1 means all the dataset batch_size: 64 # The parameters for the training optimizer, including learning rate, lr schedule, etc. optim: name: adam lr: 1e-5 weight_decay: 0.00 sched: name: WarmupAnnealing # Scheduler params warmup_steps: null warmup_ratio: 0.1 last_epoch: -1 # pytorch lightning args monitor: val_loss reduce_on_plateau: false

The specification can be roughly grouped into three categories:

  • Parameters that describe the training process

  • Parameters that describe the datasets, and

  • Parameters that describe the model.

More details about parameters in the spec file are provided below:

Parameter

Data Type

Default

Description

data_dir

string

Path to the data converted to the specified above format

trainer.max_epochs

integer

5

Maximum number of epochs to train the model

model.punct_label_ids

dictionary

O: 0, ‘,’: 1, ‘.’: 2, ‘?’: 3

Labels string name to integer mapping for punctuation task, do NOT change

model.capit_label_ids

dictionary

O: 0, U: 1

Labels string name to integer mapping for capitalization task, do NOT change

model.tokenizer.tokenizer_name

string

Will be filled automatically based on model.language_model.pretrained_model_name

Tokenizer name

model.tokenizer.vocab_file

string

null

Path to tokenizer vocabulary

model.tokenizer.tokenizer_model

string

null

Path to tokenizer model (only for sentencepiece tokenizer)

model.language_model.pretrained_model_name

string

bert-base-uncased

Pre-trained language model name (choose from bert-base-cased and bert-base-uncased)

model.language_model.lm_checkpoint

string

null

Path to the pre-trained language model checkpoint

model.language_model.config_file

string

null

Path to the pre-trained language model config file

model.language_model.config

dictionary

null

Config of the pre-trained language model

model.punct_head.punct_num_fc_layers

integer

1

Number of fully connected layers

model.punct_head.fc_dropout

float

0.1

Activation to use between fully connected layers

model.punct_head.activation

string

‘relu’

Dropout to apply to the input hidden states

model.punct_head.use_transrormer_init

bool

True

Whether to initialize the weights of the classifier head with the same approach used in Transformer

model.capit_head.punct_num_fc_layers

integer

1

Number of fully connected layers

model.capit_head.fc_dropout

float

0.1

Activation to use between fully connected layers

model.capit_head.activation

string

‘relu’

Dropout to apply to the input hidden states

model.capit_head.use_transrormer_init

bool

True

Whether to initialize the weights of the classifier head with the same approach used in Transformer

training_ds.text_file

string

text_train.txt

Name of the text training file located at data_dir

training_ds.labels_file

string

labels_train.txt

Name of the labels training file located at data_dir

training_ds.shuffle

bool

True

Whether to shuffle the training data

training_ds.num_samples

integer

-1

Number of samples to use from the training dataset, -1 means all

training_ds.batch_size

integer

64

Training data batch size

validation_ds.text_file

string

text_dev.txt

Name of the text file for evaluation, located at data_dir

validation_ds.labels_file

string

labels_dev.txt

Name of the labels dev file located at data_dir

validation_ds.shuffle

bool

False

Whether to shuffle the dev data

validation_ds.num_samples

integer

-1

Number of samples to use from the dev set, -1 means all

validation_ds.batch_size

integer

64

Dev set batch size

optim.name

string

adam

Optimizer to use for training

optim.lr

float

1e-5

Learning rate to use for training

optim.weight_decay

float

0

Weight decay to use for training

optim.sched.name

string

WarmupAnnealing

Warm up schedule

optim.sched.warmup_ratio

float

0.1

Warm up ratio

Example of the command for training the model:

Copy
Copied!
            

tao punctuation_and_capitalization train [-h] \ -e /specs/nlp/punctuation_and_capitalization/train.yaml \ -r /results/punctuation_and_capitalization/train/ \ -g 4 \ data_dir=/path/to/data_dir \ trainer.max_epochs=2 \ training_ds.num_samples=-1 \ validation_ds.num_samples=-1 \ -k $KEY

Required Arguments for Training

  • -e: The experiment specification file to set up training.

  • -r: Path to the directory to store the results/logs. Note, the trained-model.tlt would be saved in this

    specified folder under a subfolder checkpoints; in our case, it will be saved here: /results/punctuation_and_capitalization/train/checkpoints/trained-model.tlt

  • -k: Encryption key

  • data_dir: Path to the data_dir with the processed data files.

Optional Arguments

  • -h, --help: Show this help message and exit

  • -g: The number of GPUs to be used in evaluation in a multi-GPU scenario (default: 1).

  • Other arguments to override fields in the specification file.

Note

While the arguments are defined in the spec file, if you wish to override these parameter definitions in the spec file and experiment with them, you may do so over command line by simply defining the param. For example, the sample spec file mentioned above has validation_ds.batch_size set to 64. However, if you see that the GPU utilization can be optimized further by using larger a batch size, you may override to the desired value, by adding the field validation_ds.batch_size=128 over command line. You may repeat this with any of the parameters defined in the sample spec file.

Snippets of the output log from executing the punctuation_and_capitalization train command:

Copy
Copied!
            

# complete model's spec file will be shown [NeMo I] Spec file: restore_from: ??? exp_manager: explicit_log_dir: null exp_dir: null name: trained-model version: null use_datetime_version: true resume_if_exists: true resume_past_end: false resume_ignore_no_checkpoint: true create_tensorboard_logger: false summary_writer_kwargs: null create_wandb_logger: false wandb_logger_kwargs: null create_checkpoint_callback: true checkpoint_callback_params: filepath: null monitor: val_loss verbose: true save_last: true save_top_k: 3 save_weights_only: false mode: auto period: 1 prefix: null postfix: .tlt save_best_model: false files_to_copy: null model: tokenizer: ... ... # The dataset will be processed and tokenized [NeMo I punctuation_capitalization_model:251] Setting model.dataset.data_dir to sample/. [NeMo I punctuation_capitalization_dataset:289] Processing text_train.txt [NeMo I punctuation_capitalization_dataset:333] Using the provided label_ids dictionary. [NeMo I punctuation_capitalization_dataset:408] Labels: {'O': 0, ',': 1, '.': 2, '?': 3} [NeMo I punctuation_capitalization_dataset:409] Labels mapping saved to : sample/punct_label_ids.csv [NeMo I punctuation_capitalization_dataset:408] Labels: {'O': 0, 'U': 1} [NeMo I punctuation_capitalization_dataset:409] Labels mapping saved to : sample/capit_label_ids.csv [NeMo I punctuation_capitalization_dataset:134] Max length: 35 [NeMo I data_preprocessing:295] Some stats of the lengths of the sequences: # During training, you're going to see a progress bar for both training and evaluation of the model that is done during model training. # Once the training is complete, the results are going to be saved to the specified locations [NeMo I train:126] Experiment logs saved to 'nemo_experiments/trained-model' [NeMo I train:129] Trained model saved to 'nemo_experiments/trained-model/2021/checkpoints/trained-model.tlt'


Important Parameters

Below is the list of parameters that could help improve the model:

  • classification head parameters:
    • the number of layers in the classification heads (model.punct_head.punct_num_fc_layers and model.capit_head.capit_num_fc_layers)

    • dropout value between layers (model.punct_head.fc_dropout and model.capit_head.fc_dropout)

  • optimizer (model.optim.name, for example, adam)

  • learning rate (model.optim.lr, for example, 5e-5)

In the previous section Training a punctuation and capitalization model,

the Punctuation and Capitalization model was initialized with a pre-trained language model, but the classifiers were trained from scratch. Now that a user has trained the Punctuation and Capitalization model successfully (e.g., called trained-model.tlt), there may be scenarios where users are required to retrain this trained-model.tlt on a new smaller dataset. TAO conversational AI applications provide a separate tool called fine-tune to enable this.

Example for spec for fine-tuning of the model:

Copy
Copied!
            

trainer: max_epochs: 1 # DEMO purposes # 100 data_dir: ??? # Fine-tuning settings: training dataset. finetuning_ds: text_file: text_train.txt labels_file: labels_train.txt shuffle: true num_samples: -1 # number of samples to be considered, -1 means all the dataset batch_size: 64 # Fine-tuning settings: validation dataset. validation_ds: text_file: text_dev.txt labels_file: labels_dev.txt shuffle: false num_samples: -1 # number of samples to be considered, -1 means all the dataset batch_size: 64 # Fine-tuning settings: different optimizer. optim: name: adam lr: 2e-5

Parameter

Data Type

Default

Description

data_dir

string

Path to the data converted to the specified above format

trainer.max_epochs

integer

5

Maximum number of epochs to train the model

finetuning_ds.text_file

string

text_train.txt

Name of the text training file located at data_dir

finetuning_ds.labels_file

string

labels_train.txt

Name of the labels training file located at data_dir

finetuning_ds.shuffle

bool

True

Whether to shuffle the training data

finetuning_ds.num_samples

integer

-1

Number of samples to use from the training dataset, -1 means all

finetuning_ds.batch_size

integer

64

Training data batch size

validation_ds.text_file

string

text_dev.txt

Name of the text file for evaluation, located at data_dir

validation_ds.labels_file

string

labels_dev.txt

Name of the labels dev file located at data_dir

validation_ds.shuffle

bool

False

Whether to shuffle the dev data

validation_ds.num_samples

integer

-1

Number of samples to use from the dev set, -1 means all

validation_ds.batch_size

integer

64

Dev set batch size

optim.name

string

adam

Optimizer to use for training

optim.lr

float

2e-5

Learning rate to use for training

Use the following command to fine-tune the model:

Copy
Copied!
            

tao punctuation_and_capitalization finetune [-h] -e /specs/nlp/punctuation_and_capitalization/finetune.yaml \ -r /results/punctuation_and_capitalization/finetune/ \ -m /results/punctuation_and_capitalization/train/checkpoints/trained-model.tlt \ -g 1 \ data_dir=/path/to/`data_dir` \ trainer.max_epochs=3 \ -k $KEY

Required Arguments for Fine-tuning

  • -e: The experiment specification file to set up fine-tuning

  • -r: Path to the directory to store the results of the fine-tuning.

  • -m: Path to the pre-trained model to use for fine-tuning.

  • data_dir: Path to data directory with the pre-processed data to use for fine-tuning.

  • -k: Encryption key

Optional Arguments

  • -h, --help: Show this help message and exit

  • -g: The number of GPUs to be used in evaluation in a multi-GPU scenario (default: 1)

  • -exp_manager.name="my_model_finetuned": This argument can be used to change the default name of the fine-tuned model

    from finetuned-model.tlt to my_model_finetuned.tlt.

  • Other arguments to override fields in the specification file.

Output log for the tao punctuation_and_capitalization finetune command:

Copy
Copied!
            

Model restored from '/path/to/trained-model.tlt' # The rest of the log is similar to the output log snippet for :code:`punctuation_and_capitalization train`.


Spec example to evaluate the pre-trained model:

Copy
Copied!
            

# Name of the .tlt from which the model will be loaded. restore_from: trained-model.tlt # Test settings: dataset. data_dir: ??? test_ds: text_file: text_dev.txt labels_file: labels_dev.txt batch_size: 64 shuffle: false num_samples: -1 # number of samples to be considered, -1 means all the dataset

Use the following command to evaluate the model:

Copy
Copied!
            

tao punctuation_and_capitalization evaluate [-h] \ -e /specs/nlp/punctuation_and_capitalization/evaluate.yaml \ -m /results/punctuation_and_capitalization/train/checkpoints/trained-model.tlt \ -g 1 \ data_dir=/path/to/data_dir \ -k $KEY

Required Arguments for Evaluation

  • -e: The experiment specification file to set up evaluation.

  • -r: Path to the directory to store the results.

  • data_dir: Path to data directory with the pre-processed data to use for evaluation.

  • -m: Path to the pre-trained model checkpoint for evaluation. Should be a .tlt file.

  • -k: Encryption key

Optional Arguments

  • -h, --help: Show this help message and exit

punctuation_and_capitalization evaluate generates two classification reports: one for capitalization task and another one for punctuation task. These classification reports include the following metrics:

  • Precision

  • Recall

  • F1

More details about these metrics can be found here.

Output log from executing the above command (note, the values below are for demonstration purposes only):

Copy
Copied!
            

Punctuation report: label precision recall f1 support O (label_id: 0) 100.00 97.00 98.48 100 , (label_id: 1) 100.00 100.00 100.00 4 . (label_id: 2) 76.92 100.00 86.96 10 ? (label_id: 3) 0.00 0.00 0.00 0 ------------------- micro avg 97.37 97.37 97.37 114 macro avg 92.31 99.00 95.14 114 weighted avg 97.98 97.37 97.52 114 Capitalization report: label precision recall f1 support O (label_id: 0) 93.62 90.72 92.15 97 U (label_id: 1) 55.00 64.71 59.46 17 ------------------- micro avg 86.84 86.84 86.84 114 macro avg 74.31 77.71 75.80 114 weighted avg 87.86 86.84 87.27 114

Parameter

Data Type

Default

Description

data_dir

string

Path to the data converted to the specified above format

test_ds.text_file

string

text_dev.txt

Name of the text file to run evaluation on located at data_dir

test_ds.labels_file

string

labels_dev.txt

Name of the labels dev file located at data_dir

test_ds.shuffle

bool

False

Whether to shuffle the dev data

test_ds.num_samples

integer

-1

Number of samples to use from the dev set, -1 means all

test_ds.batch_size

integer

64

Dev set batch size

During inference, a batch of input sentences, listed in the spec files, are passed through the trained model to add punctuation and capitalize words.

Before doing inference on the model, specify the list of examples in the spec, for example:

Copy
Copied!
            

input_batch: - 'what can i do for you today' - 'how are you'

To run inference:

Copy
Copied!
            

tao punctuation_and_capitalization infer [-h] -e /specs/nlp/punctuation_and_capitalization/infer.yaml \ -r /results/punctuation_and_capitalization/infer/ \ -g 1 \ -m /results/punctuation_and_capitalization/finetune/checkpoints/finetuned-model.tlt \ -k $KEY

Output log from executing the above command:

Copy
Copied!
            

The prediction results of some sample queries with the trained model: Query : what can i do for you today Result: What can I do for you today? Query : how are you Result: How are you?

Required Arguments for Inference

  • -e: The experiment specification file to set up inference. This requires the input_batch with the list of examples to run inference on.

  • -r: Path to the directory to store the results.

  • -m: Path to the pre-trained model checkpoint from which to infer. Should be a .tlt file.

  • -k: Encryption key

Optional Arguments

  • -h, --help: Show this help message and exit

  • -g: The number of GPUs to be used for fine-tuning in a multi-GPU scenario (default: 1).

  • Other arguments to override fields in the specification file.

A pre-trained model could be exported to RIVA format (this format contains model checkpoint and model artifacts required for successful deployment of the trained .tlt models to Riva Services). For more details about Riva, see this.

Example of the spec file for model export:

Copy
Copied!
            

# Name of the .tlt EFF archive to be loaded/model to be exported. restore_from: trained-model.tlt # Set export format: RIVA export_format: RIVA # Output EFF archive containing model checkpoint and artifacts required for Riva Services export_to: exported-model.riva

Parameter

Data Type

Default

Description

restore_from

string

trained-model.tlt

Path to the pre-trained model

export_format

string

Export format: RIVA

export_to

string

exported-model.riva

Path to the exported model

To export a pre-trained model for deployment, run:

Copy
Copied!
            

### For export to Riva format tao punctuation_and_capitalization export [-h]\ -e /specs/nlp/punctuation_and_capitalization/export.yaml \ -r /results/punctuation_and_capitalization/export/ \ -m /results/punctuation_and_capitalization/train/checkpoints/trained-model.tlt \ -k $KEY \ export_format=RIVA \ export_to=my-exported-model.riva

Required Arguments for Export

  • -e: The experiment specification file to set up inference. This requires the input_batch with the list of examples to run inference on.

  • -r: Path to the directory to store the results.

  • -m: Path to the pre-trained model checkpoint from which to infer. Should be a .tlt file.

  • -k: Encryption key

Optional Arguments

  • -h, --help: Show this help message and exit

  • export_to: To change the default name of the exported model

Output log:

Copy
Copied!
            

Spec file: restore_from: path/to/trained-model.tlt export_to: exported-model.riva export_format: RIVA exp_manager: task_name: export explicit_log_dir: /results/punctuation_and_capitalization/export/ encryption_key: $KEY Experiment logs saved to '/results/punctuation_and_capitalization/export/' Exported model to '/results/punctuation_and_capitalization/export/exported-model.riva'


© Copyright 2020, NVIDIA. Last updated on Aug 24, 2021.