Text Classification - NVIDIA Docs

Introduction

Text Classification Model is a sequence classification model based on BERT-based encoders. It can be used for a variety of tasks like text classification, sentiment analysis, domain/intent detection for dialogue systems, etc. The model takes a text input and classifies it into predefined categories. Most of the BERT-based encoders supported by HuggingFace including BERT, RoBERTa, DistilBERT, XLNet, etc can be used with this model.

TAO Toolkit provides a sample notebook to outline the end-to-end workflow on how to train a Text classification model using TAO Toolkit and deploy it in Riva format on NGC resources.

Downloading Sample Spec files

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

Copy
Copied!

            
            tao text_classification download_specs -r /results/text_classification/default_specs/ \
                                       -o /specs/nlp/text_classification

where

-r : path to the logs that the command generates
-o : path where the spec files are downloaded

Data Format

The text classification model uses NeMo toolkit and supports NeMo’s data format. It requires the data to be stored in TAB separated files (.tsv) with two columns of sentence and label. Each line of the data file contains text sequences, where words are separated with spaces and the label is separated with [TAB], i.e.:

Copy
Copied!

            
            [WORD][SPACE][WORD][SPACE][WORD][TAB][LABEL]

Labels need to be integers starting from 0. Some examples taken from SST2 dataset, which is a two-class problem:

Copy
Copied!

            
            saw how bad this movie was  0
lend some dignity to a dumb story   0
the greatest musicians  1

You may need three separate files for train, validation and test with this format.

Dataset Conversion

If your dataset is stored in another format, you need to convert it to this format to use this model. There are some conversion scripts available for datasets SST2 and IMDB to convert them from their original format to NeMo’s format. To convert the original dataset to NeMo’s format, you can use dataset_convert utility as the following example:

Copy
Copied!

            
            tao text_classification dataset_convert \
    -e /specs/nlp/text_classification/dataset_convert.yaml \
    source_data_dir=SOURCE_PATH \
    target_data_dir=TARGET_PATH \
    dataset_name=sst2

Arguments:

source_data_dir: directory of your dataset
target_data_dir: directory to save the converted dataset
datase_name: name of the dataset to convert (“imdb” and “sst2” are currently supported)

It converts the SST2 dataset stored as SOURCE_PATH to NeMo’s format and saves the new dataset at TARGET_PATH.

You may download the SST2 dataset from ‘https://dl.fbaipublicfiles.com/glue/data/SST-2.zip’ and extract it into the folder specified by SOURCE_PATH. After the conversion, the TARGET_PATH should contain the following files:

Copy
Copied!

            
            .
|--TARGET_PATH
  |-- train.tsv
  |-- dev.tsv

Model Training

Example of config spec for training train.yaml file. You can change any of these parameters and pass them to the training command.

Copy
Copied!

            
            trainer:
  max_epochs: 100

model:
  # Labels that will be used to "decode" predictions.
  class_labels:
    class_labels_file : null # optional to specify a file containing the list of the labels

  tokenizer:
      tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
      vocab_file: null # path to vocab file
      tokenizer_model: null # only used if tokenizer is sentencepiece
      special_tokens: null

  language_model:
    pretrained_model_name: bert-base-uncased
    lm_checkpoint: null
    config_file: null # json file, precedence over config
    config: null

  classifier_head:
    # This comes directly from number of labels/target classes.
    num_output_layers: 2
    fc_dropout: 0.1


training_ds:
  file_path: ???
  batch_size: 64
  shuffle: true
  num_samples: -1 # number of samples to be considered, -1 means all the dataset
  num_workers: 3
  drop_last: false
  pin_memory: false

validation_ds:
  file_path: ???
  batch_size: 64
  shuffle: false
  num_samples: -1 # number of samples to be considered, -1 means all the dataset
  num_workers: 3
  drop_last: false
  pin_memory: false

optim:
  name: adam
  lr: 2e-5
  # optimizer arguments
  betas: [0.9, 0.999]
  weight_decay: 0.001

  # scheduler setup
  sched:
    name: WarmupAnnealing
    # Scheduler params
    warmup_steps: null
    warmup_ratio: 0.1
    last_epoch: -1
    # pytorch lightning args
    monitor: val_loss
    reduce_on_plateau: false

Example of the command for training the model on four GPUs for 50 epochs:

Copy
Copied!

            
            tao text_classification train -e /specs/nlp/text_classification/train.yaml \
training_ds.file_path=PATH_TO_TRAIN_FILE \
trainer.max_epochs=50 \
-g 4  \
-k $KEY

By default, the final model after training is done is saved in ‘trained-model.tlt’.

Required Arguments for Training

-e: The experiment specification file to set up training.
training_ds.file_path: Path to the training ‘.tsv’ file
-k: Encryption key

Optional Arguments

trainer.max_epochs: Training epochs number.
-g: Number of GPUs to use for training
Other arguments to override fields in the specification file.

The following table lists some of the parameters you may use in the config files and set them from command line when training a model:

Parameter	Data Type	Default	Description
model.class_labels.class_labels_file	string	null	Path to an optional file containing the labels; each line is the string label corresponding to a label
model.intent_loss_weight	float	0.6	Relation of intent to slot loss in total loss
model.tokenizer.tokenizer_name	string	Will be filled automatically based on model.language_model.pretrained_model_name	Tokenizer name
model.tokenizer.vocab_file	string	null	Path to tokenizer vocabulary
model.tokenizer.tokenizer_model	string	null	Path to tokenizer model (only for sentencepiece tokenizer)
model.tokenizer.special_tokens	string	null	Special tokens of the tokenizer if it exists
model.language_model.max_seq_length	integer	50	Maximal length of the input queries (in tokens)
model.language_model.pretrained_model_name	string	bert-base-uncased	Pre-trained language model name (choose from bert-base-cased, bert-base-uncased, megatron_bert_345m_uncased, distilbert-base-uncased and biomegatron-bert-345m-uncased
model.language_model.lm_checkpoint	string	null	Path to the pre-trained language model checkpoint
model.language_model.config_file	string	null	Path to the pre-trained language model config file
model.language_model.config	dictionary	null	Config of the pre-trained language model
model.head.num_output_layers	integer	2	Number of fully connected layers of the Classifier on top of Bert model
model.head.fc_dropout	float	0.1	Dropout ratio of the fully connected layers
{training,validation,test}_ds.file_path	string	??	Path of the training ‘.tsv file
{training,validation,test}_ds.batch_size	integer	32	Data loader’s batch size
{training,validation,test}_ds.num_workers	integer	2	Number of worker threads for data loader
{training,validation,test}_ds.shuffle	boolean	true (training), false (test and validation)	Shuffles data for each epoch
{training,validation,test}_ds.drop_last	boolean	false	Specifies if last batch of data needs to get dropped if it is smaller than batch size
{training,validation,test}_ds.pin_memory	boolean	false	Enables pin_memory of PyTorch’s data loader to enhance speed
{training,validation,test}_ds.num_samples	integer	-1	Number of samples to be used from the dataset; -1 means all samples
optim.name	string	adam	Optimizer to use for training
optim.lr	float	2e-5	Learning rate to use for training
optim.weight_decay	float	0.01	Weight decay to use for training
optim.sched.name	string	WarmupAnnealing	Warmup schedule
optim.sched.warmup_ratio	float	0.1	Warmup ratio

Training Procedure

At the start of each training experiment, TAO Toolkit will print out a log of the experiment specification, including any parameters added or overridden via the command line. It will also show additional information, such as which GPUs are available and where logs will be saved. Then it shows some samples from the datasets with their corresponding inputs to the model.

Copy
Copied!

            
            GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[NeMo W 2021-01-20 19:49:30 exp_manager:304] There was no checkpoint folder at checkpoint_dir :/home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-20_19-49-30/checkpoints. Training from scratch.
[NeMo I 2021-01-20 19:49:30 exp_manager:194] Experiments will be logged at /home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-20_19-49-30

Then for each dataset provided, it shows some samples from the dataset with their corresponding inputs to the model. It also provides some stats on the lengths of sequences in the dataset.

Copy
Copied!

            
            [NeMo I 2021-01-20 19:49:36 text_classification_dataset:120] Read 67350 examples from ../data/SST-2/train.tsv.
[NeMo I 2021-01-20 19:49:37 text_classification_dataset:233] *** Example ***
[NeMo I 2021-01-20 19:49:37 text_classification_dataset:234] example 0: ['girl-meets-girl', 'romantic', 'comedy']
[NeMo I 2021-01-20 19:49:37 text_classification_dataset:235] subtokens: [CLS] girl - meets - girl romantic comedy [SEP]
[NeMo I 2021-01-20 19:49:37 text_classification_dataset:236] input_ids: 101 2611 1011 6010 1011 2611 6298 4038 102
[NeMo I 2021-01-20 19:49:37 text_classification_dataset:237] segment_ids: 0 0 0 0 0 0 0 0 0
[NeMo I 2021-01-20 19:49:37 text_classification_dataset:238] input_mask: 1 1 1 1 1 1 1 1 1
[NeMo I 2021-01-20 19:49:37 text_classification_dataset:239] label: 1

Before training starts, information on the optimizer and scheduler will be shown in the logs:

Copy
Copied!

            
            [NeMo I 2021-01-20 19:50:19 modelPT:830] Optimizer config = Adam (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.999]
        eps: 1e-08
        lr: 2e-05
        weight_decay: 0.01
    )
[NeMo I 2021-01-20 19:50:19 lr_scheduler:621] Scheduler "<nemo.core.optim.lr_scheduler.WarmupAnnealing object at 0x7fcd2232b160>"
    will be used during training (effective maximum steps = 1053) -
    Parameters :
    (warmup_steps: null
    warmup_ratio: 0.1
    last_epoch: -1
    max_steps: 1053
    )

You should next see a full printout of the number of parameters in each module and submodule, as well as the total number of trainable and non-trainable parameters in the model. For example, this model has 100M parameters in total:

Copy
Copied!

            
                | Name                                                   | Type                 | Params
--------------------------------------------------------------------------------------------------
0   | bert_model                                             | BertEncoder          | 109 M
1   | bert_model.embeddings                                  | BertEmbeddings       | 23.8 M
2   | bert_model.embeddings.word_embeddings                  | Embedding            | 23.4 M
3   | bert_model.embeddings.position_embeddings              | Embedding            | 393 K
4   | bert_model.embeddings.token_type_embeddings            | Embedding            | 1.5 K
5   | bert_model.embeddings.LayerNorm                        | LayerNorm            | 1.5 K
6   | bert_model.embeddings.dropout                          | Dropout              | 0
7   | bert_model.encoder                                     | BertEncoder          | 85.1 M
8   | bert_model.encoder.layer                               | ModuleList           | 85.1 M
9   | bert_model.encoder.layer.0                             | BertLayer            | 7.1 M
10  | bert_model.encoder.layer.0.attention                   | BertAttention        | 2.4 M
11  | bert_model.encoder.layer.0.attention.self              | BertSelfAttention    | 1.8 M
12  | bert_model.encoder.layer.0.attention.self.query        | Linear               | 590 K
...
212 | bert_model.encoder.layer.11.output.dropout             | Dropout              | 0
213 | bert_model.pooler                                      | BertPooler           | 590 K
214 | bert_model.pooler.dense                                | Linear               | 590 K
215 | bert_model.pooler.activation                           | Tanh                 | 0
216 | classifier                                             | SequenceClassifier   | 592 K
217 | classifier.dropout                                     | Dropout              | 0
218 | classifier.mlp                                         | MultiLayerPerceptron | 592 K
219 | classifier.mlp.layer0                                  | Linear               | 590 K
220 | classifier.mlp.layer2                                  | Linear               | 1.5 K
221 | loss                                                   | CrossEntropyLoss     | 0
222 | classification_report                                  | ClassificationReport | 0
--------------------------------------------------------------------------------------------------
110 M     Trainable params
0         Non-trainable params
110 M     Total params

As the model starts training, you should see a progress bar per epoch.

Copy
Copied!

            
            Epoch 0: 100%|████████████████████████████| 1067/1067 [03:10<00:00,  5.60it/s, loss=0.252, val_loss=0.258, Epoch 0, global step 1052: val_loss reached 0.25792 (best 0.25792), saving model to "/home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-20_20-19-44/checkpoints/trained-model---val_loss=0.26-epoch=0.ckpt" as top 3
Epoch 1: 100%|████████████████████████████| 1067/1067 [03:10<00:00,  5.60it/s, loss=0.187, val_loss=0.245, Epoch 1, global step 2105: val_loss reached 0.24499 (best 0.24499), saving model to "/home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-20_20-19-44/checkpoints/trained-model---val_loss=0.24-epoch=1.ckpt" as top 3
Epoch 2: 100%|████████████████████████████| 1067/1067 [03:09<00:00,  5.62it/s, loss=0.158, val_loss=0.235, Epoch 2, global step 3158: val_loss reached 0.23505 (best 0.23505), saving model to "/home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-20_20-19-44/checkpoints/trained-model---val_loss=0.24-epoch=2.ckpt" as top 3
...

After each epoch, you should see a summary table of metrics on the validation set.

Copy
Copied!

            
            Validating:  100%|████████████████████████████| 14/14 [00:00<00:00, 13.94it/s]
[NeMo I 2021-01-20 19:53:32 text_classification_model:173] val_report:
    label                                                precision    recall       f1           support
    label_id: 0                                             91.97      88.32      90.11        428
    label_id: 1                                             89.15      92.57      90.83        444
    -------------------
    micro avg                                               90.48      90.48      90.48        872
    macro avg                                               90.56      90.44      90.47        872
    weighted avg                                            90.54      90.48      90.47        872

At the end of training, TAO Toolkit will save the last checkpoint at the path specified by the experiment spec file before finishing.

Copy
Copied!

            
            Saving latest checkpoint...
[NeMo I 2021-01-20 21:09:39 train:124] Experiment logs saved to '/home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-20_21-06-17'
[NeMo I 2021-01-20 21:09:39 train:127] Trained model saved to '/home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-20_21-06-17/checkpoints/trained-model.tlt'

The output logs for the evaluation and fine-tuning look similar.

Training Suggestions

When you want to train this model on other data or with different batch sizes, you may need to tune at least the configs of your optimizer and scheduler like the learning rate and weight decay. Higher effective batch sizes need larger learning rate. Effective batch size is the total number of your samples per each update step. For example, when your batch size per GPU is set to 64, and you use four GPUs with accumulate_grad_batches of two, then your effective batch size would be 512=64*4*2. You may use other BERT-like models or models with different sizes based on your performance requirements.

Model Fine-tuning

There are scenarios where users are required to re-train or fine-tune a pretrained TAO model like trained-model.tlt on a new dataset. TAO toolkit provides a separate tool called fine-tune to enable this.

Example of spec file to be used for fine-tuning of a model:

Copy
Copied!

            
            trainer:
  max_epochs: 100
data_dir: ???

# Fine-tuning settings: training dataset.
finetuning_ds:
  file_path: ???
  batch_size: 64
  shuffle: false
  num_samples: -1 # number of samples to be considered, -1 means all the dataset
  num_workers: 3
  drop_last: false
  pin_memory: false

# Fine-tuning settings: validation dataset.
validation_ds:
  file_path: ???
  batch_size: 64
  shuffle: false
  num_samples: -1 # number of samples to be considered, -1 means all the dataset
  num_workers: 3
  drop_last: false
  pin_memory: false

# Fine-tuning settings: different optimizer.
optim:
  name: adam
  lr: 2e-5
  betas: [0.9, 0.9998]
  weight_decay: 0.001

Use the following command to fine-tune a pre-trained model on a training file specified by ‘finetuning_ds.file_path’:

Copy
Copied!

            
            tao text_classification finetune [-h]  -e /specs/nlp/text_classification/finetune.yaml \
                                                  -r PATH_TO_RESULT_FOLDER \
                                                  -m PATH_OF_PRETRAINED_TLT_MODEL \
                                                  -g 1 \
                                                  finetuning_ds.file_path=PATH_TO_TRAIN_FILE \
                                                  trainer.max_epochs=3 \
                                                  -k $KEY

Required Arguments for Fine-tuning

-e: The experiment specification file to set up fine-tuning
-r: Path to the directory to store the results of the fine-tuning.
-m: Path to the pre-trained model to use for fine-tuning.
training_ds.file_path: Path to the training ‘.tsv’ file
-k: Encryption key

Optional Arguments

-h, --help: Show this help message and exit
-g: The number of GPUs to be used in evaluation in a multi-GPU scenario (default: 1).
Other arguments to override fields in the specification file.

Model Evaluation

The evaluation tool enables the user to evaluate a saved model in TAO format on a dataset.

Spec example to evaluate the pre-trained model on test data:

Copy
Copied!

            
            restore_from: trained-model.tlt

test_ds:
  file_path: PATH_TO_TEST_FILE
  num_workers: 2
  batch_size: 32
  shuffle: false
  num_samples: -1

Use the following command to evaluate the model:

Copy
Copied!

            
            tao text_classification evaluate \
-e /specs/nlp/text_classification/evaluate.yaml \
test_ds.file_path=PATH_TO_TEST_FILE \

Required Arguments for Evaluation

-e: The experiment specification file to set up evaluation.
test_ds.file_path: Path to the test ‘.tsv’ file

The output should be similar to the training process and the metrics table is shown at the end:

Copy
Copied!

            
            Validating:  100%|████████████████████████████| 14/14 [00:00<00:00, 13.94it/s]
[NeMo I 2021-01-20 19:53:32 text_classification_model:173] val_report:
    label                                                precision    recall       f1           support
    label_id: 0                                             91.97      88.32      90.11        428
    label_id: 1                                             89.15      92.57      90.83        444
    -------------------
    micro avg                                               90.48      90.48      90.48        872
    macro avg                                               90.56      90.44      90.47        872
    weighted avg                                            90.54      90.48      90.47        872

This table contains the metrics for each class separately, like precision, recall, F1, and support. It also shows Micro Average, Macro Average, and Weighted Average, which may show the overall performance of the model on all classes.

Model Inference

Inference tool would take some inputs in text format and produces the predictions of a saved model for them. To run inference on the model, specify the list of examples in the spec file “infer.yaml”, for example:

Copy
Copied!

            
              input_batch:
- "by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."
- "director rob marshall went out gunning to make a great one ."
- "uneasy mishmash of styles and genres ."
- "I love exotic science fiction / fantasy movies but this one was very unpleasant to watch . Suggestions and images of child abuse , mutilated bodies (live or dead) , other gruesome scenes , plot holes , boring acting made this a regretable experience , The basic idea of entering another person's mind is not even new to the movies or TV (An Outer Limits episode was better at exploring this idea) . i gave it 4 / 10 since some special effects were nice ."

The list of inputs specified by ‘input_batch’ would be passed through the model to get the label predictions.

To run the inference on a trained model ‘trained-model.tlt’:

Copy
Copied!

            
            tao text_classification infer \
-e /specs/nlp/text_classification/infer.yaml \
-m trained-model.tlt \

Required Arguments for Inference

-e: The experiment specification file to set up inference. This requires the input_batch with the list of examples to run inference on.
-m: Path to the pre-trained model checkpoint from which to infer. Should be a .tlt file.

The output should be like this:

Copy
Copied!

            
            [NeMo I 2021-01-27 16:58:32 infer:68] Spec file:
    restore_from: trained-model.tlt
    exp_manager:
      task_name: infer
      explicit_log_dir: ./
    input_batch:
    - by the end of no such thing the audience , like beatrice , has a watchful affection
      for the monster .
    - director rob marshall went out gunning to make a great one .
    - uneasy mishmash of styles and genres .
    - I love exotic science fiction / fantasy movies but this one was very unpleasant
      to watch . Suggestions and images of child abuse , mutilated bodies (live or dead)
      , other gruesome scenes , plot holes , boring acting made this a regretable experience
      , The basic idea of entering another person's mind is not even new to the movies
      or TV (An Outer Limits episode was better at exploring this idea) . i gave it 4
      / 10 since some special effects were nice .
    encryption_key: null
...
[NeMo I 2021-01-27 16:58:50 infer:101] Query: by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .
[NeMo I 2021-01-27 16:58:50 infer:102] Predicted label: positive
[NeMo I 2021-01-27 16:58:50 infer:101] Query: director rob marshall went out gunning to make a great one .
[NeMo I 2021-01-27 16:58:50 infer:102] Predicted label: positive
[NeMo I 2021-01-27 16:58:50 infer:101] Query: uneasy mishmash of styles and genres .
[NeMo I 2021-01-27 16:58:50 infer:102] Predicted label: negative
[NeMo I 2021-01-27 16:58:50 infer:101] Query: I love exotic science fiction / fantasy movies but this one was very unpleasant to watch . Suggestions and images of child abuse , mutilated bodies (live or dead) , other gruesome scenes , plot holes , boring acting made this a regretable experience , The basic idea of entering another person's mind is not even new to the movies or TV (An Outer Limits episode was better at exploring this idea) . i gave it 4 / 10 since some special effects were nice .
[NeMo I 2021-01-27 16:58:50 infer:102] Predicted label: negative

Each query would be printed out along with its predicted label.

Model Export

You may use the export toolkit to convert a pre-trained saved TAO model into Riva format. This format would enable faster inference. An example of the spec file for model export:

Copy
Copied!

            
            # Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt

# Set export format to RIVA
export_format: RIVA

# Output EFF archive containing Riva file.
export_to: exported-model.riva

Parameter	Data Type	Default	Description
restore_from	string	trained-model.tlt	Path to the pre-trained model
export_format	string	ONNX	Export format, choose from: ONNX or RIVA
export_to	string	exported-model.eonnx	Path to the exported model

To export a pre-trained model to RIVA’s format, run:

Copy
Copied!

            
            tao text_classification export \
    -e /specs/nlp/text_classification/export.yaml \
    -m finetuned-model.tlt \
    -k $KEY \
    export_format=RIVA \
    export_to=exported-model.riva

Required Arguments for Export

-e: The experiment specification file to set up inference. This requires the input_batch with the list of examples to run inference on.
-m: Path to the pre-trained model checkpoint from which to infer. Should be a .tlt file.
-k: Encryption key

The output should be something like this:

Copy
Copied!

            
            [NeMo I 2021-01-27 17:04:05 export:37] Spec file:
    restore_from: ./trained-model.tlt
    export_to: exported-model.riva
    export_format: RIVA
    exp_manager:
      task_name: export
      explicit_log_dir: ./
    encryption_key: null
...
[NeMo W 2021-02-01 16:00:22 exp_manager:27] Exp_manager is logging to `./``, but it already exists.
[NeMo W 2021-02-01 16:00:28 modelPT:193] Using /tmp/tmpmke24h_1/tokenizer.vocab_file instead of tokenizer.vocab_file.
[NeMo W 2021-02-01 16:00:35 modelPT:193] Using /tmp/tmpmke24h_1/label_ids.csv instead of ../data/SST-2/label_ids.csv.
[NeMo I 2021-02-01 16:00:37 export:52] Model restored from '/home/user/tao-toolkit-pyt/nlp/text_classification/entrypoint/nemo_experiments/trained-model/2021-01-27_16-53-38/checkpoints/trained-model.tlt'
[NeMo I 2021-02-01 16:01:08 export:66] Experiment logs saved to '.'
[NeMo I 2021-02-01 16:01:08 export:67] Exported model to './exported-model.riva'