Question Answering

Introduction

With the Question Answering, or Reading Comprehension, task, given a question and a passage of content (context) that may contain an answer for the question, the model will predict the span within the text with a start and end position indicating the answer to the question. For datasets like SQuAD 2.0, this model supports cases when the answer is not contained in the content.

For every word in the context of a given question, the model will be trained to predict:

  • The likelihood this word is the start of the span

  • The likelihood this word is the end of the span

The model chooses the start and end words with maximal probabilities. When the content does not contain the answer, we would like the start and end span to be set for the first token.

A pretrained BERT encoder with two span prediction heads is used for the prediction start and the end position of the answer. The span predictions are token classifiers consisting of a single linear layer.

TLT provides a sample notebook to outline the end-to-end workflow on how to train a Question Answering model using TLT and deploy it in Jarvis format on NGC resources.

Downloading Sample Spec files

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

tlt question_answering download_specs -r /results/question_answering/default_specs/ \
                                      -o /specs/nlp/questions_answering

Data Format

This model expects the dataset in SQuAD format (i.e., a JSON file for each dataset split). The code snippet below shows an example of the training file. Each title has one or multiple paragraph entries, each consisting of the “context” and question-answer entries. Each question-answer entry has:

  • A question

  • A globally unique id

  • The Boolean flag “is_impossible”, which shows whether a question is answerable or not

  • (if the question is answerable) One answer entry containing the text span and its starting character index in the context.

  • (if the question is not answerable) An empty “answers” list

The evaluation files (for validation and testing) follow the above format, except that it can provide more than one answer to the same question. The inference file also follows the above format, except that it does not require the “answers” and “is_impossible” keywords.

The following is an example of the data format (JSON file):

{
    "data": [
        {
            "title": "Super_Bowl_50",
            "paragraphs": [
                {
                    "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
                    "qas": [
                        {
                            "question": "Where did Super Bowl 50 take place?",
                            "is_impossible": "false",
                            "id": "56be4db0acb8001400a502ee",
                            "answers": [
                                {
                                    "answer_start": "403",
                                    "text": "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question": "What was the winning score of the Super Bowl 50?",
                            "is_impossible": "true",
                            "id": "56be4db0acb8001400a502ez",
                            "answers": [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Dataset Conversion

To perform training of the QA model on the SQuAD dataset, you must first download it from here. You can choose either SQuAD version 1.1, which does not contain questions without the answer and has 100,000+ question-answer pairs on 500+ articles–or the newer SQuAD version 2.0, which combines the 100,000 questions from SQuAD 1.1 with over 50,000 unanswerable questions. To do well with SQuAD2.0, a system must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

After downloading the files, you should have a squad data folder that contains the following four files for training and evaluation:

|--squad
     |-- v1.1/train-v1.1.json
     |-- v1.1/dev-v1.1.json
     |-- v2.0/train-v2.0.json
     |-- v2.0/dev-v2.0.json

Model Training

The following is an example of the config spec for training (train.yaml) file. You can change any of these parameters and pass them to the training command.

trainer:
  max_epochs: 2

  # Name of the .tlt file where trained model will be saved.
  save_to: trained-model.tlt

model:

  dataset:
      do_lower_case: true
      version_2_with_negative: true

  tokenizer:
      tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
      vocab_file: null # path to vocab file
      tokenizer_model: null # only used if tokenizer is sentencepiece
      special_tokens: null

  language_model:
    pretrained_model_name: bert-base-uncased
    lm_checkpoint: null
    config_file: null # json file, precedence over config
    config: null

  token_classifier:
    num_layers: 1
    dropout: 0.0
    num_classes: 2
    activation: relu
    log_softmax: false
    use_transformer_init: true


training_ds:
  file: ??? # e.g. squad/v1.1/train-v2.0.json
  batch_size: 12 # per GPU
  shuffle: true
  num_samples: -1

validation_ds:
  file: ??? # e.g. squad/v1.1/dev-v2.0.json
  batch_size: 12 # per GPU
  shuffle: false
  num_samples: -1

optim:
  # optimizer arguments
  name: adamw
  lr: 3e-5
  betas: [0.9, 0.999]
  weight_decay: 0.0
  # scheduler config override
  sched:
    name: SquareRootAnnealing
    warmup_steps: null
    warmup_ratio: 0.0
    last_epoch: -1

    # pytorch lightning args
    monitor: val_loss
    reduce_on_plateau: false

Parameter

Data Type

Default

Description

trainer.max_epochs

integer

2

The number of epochs to train

save_to

string

trained-model.tlt

The filename of the trained model

data_dir

string

The path to the data converted to the specified format

model.tokenizer.tokenizer_name

string

Will be filled automatically based on model.language_model.pretrained_model_name

The tokenizer name

model.tokenizer.vocab_file

string

null

The path to tokenizer vocabulary

model.tokenizer.tokenizer_model

string

null

The path to tokenizer model (for sentencepiece tokenizer only)

model.tokenizer.special_tokens

string

null

Special tokens for the tokenizer (if they exist)

model.language_model.pretrained_model_name

string

bert-base-uncased

The pre-trained language model name (choose from bert-base-cased, bert-base-uncased, megatron-bert-345m-cased and megatron-bert-345m-uncased)

model.language_model.lm_checkpoint

string

null

The path to the pre-trained language model checkpoint

model.language_model.config_file

string

null

The path to the pre-trained language model config file

model.language_model.config

dictionary

null

The config of the pre-trained language model

model.token_classifier.num_layers

integer

1

The number of fully connected layers of the Classifier on top of the Bert model

model.token_classifier.dropout

float

0.0

The dropout ratio of the fully connected layers

model.token_classifier.num_classes

integer

2

The number of Classifiers (two for QA)

model.token_classifier.activation

string

relu

The activation function to use

model.token_classifier.log_softmax

boolean

false

A flag specifying whether to use log soft max

training_ds.file

string

The training file names

training_ds.batch_size

integer

12

The training data batch size

training_ds.shuffle

bool

true

A flag specifying whether to shuffle the training data

training_ds.num_samples

integer

-1

The number of samples to use from the training dataset (use -1 to specify all samples)

validation_ds.file

string

The validation file names

validation_ds.batch_size

integer

12

The validation data batch size

validation_ds.shuffle

bool

false

A flag specifying whether to shuffle the validation data

validation_ds.num_samples

integer

-1

The number of samples to use from the validation dataset (use -1 to specify all samples)

optim.name

string

adamw

The optimizer to use for training

optim.lr

float

2e-5

The learning rate to use for training

optim.weight_decay

float

0.0

The weight decay to use for training

optim.sched.name

string

SquareRootAnnealing

The warmup schedule

optim.sched.warmup_ratio

float

0.0

The warmup ratio

The following is an example of the command for training the model:

!tlt question_answering train -e /specs/nlp/question_answering/train.yaml \
                        data_dir=PATH_TO_DATA \
                        trainer.max_epochs=2 \
                        trainer.amp_level="O1" \
                        trainer.precision=16 \
                        -g 1

Note

The first time you are performing training, it will take an extra 5-10 minutes to process the dataset for training. For future training runs, it will use the processed dataset, which is automatically cached in the files in the same directory as the data.

Required Arguments for Training

  • -e: The experiment specification file to set up training.

  • data_dir: The dataset directory

Optional Arguments

  • trainer.max_epochs: The number of training epochs

  • -g: The number of GPUs to use for training

  • trainer.amp_level and trainer.precision: These fields allow you to use 16-bit mixed precision to accelerate training.

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call (e.g., trainer.val_check_interval=0.25).

Training Procedure

At the start of evaluation, TLT will print out a log of the experiment specification, then load and preprocess the trained data. For the SQuAD dataset, it can initially take several minutes to Tokenize the content. For subsequent faster runs, the preprocessed dataset will be cached in the files at the same directory as the original dataset. Then it will also display the detailed model architecture.

As the model starts training, you should see a progress bar per epoch. Since QA datasets like SQuAD are big, it is usually enough to train for two epochs. If you want to better see the training progress, you can add the trainer.val_check_interval parameter to the script with a value less than one (e.g., trainer.val_check_interval=0.25, which specifies four evaluations on the validation dataset per 1 epoch of training).

At the end of training, TLT will save the best checkpoint on the validation dataset at the path specified by the experiment spec file before finishing.

TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
[NeMo W 2021-01-28 14:52:19 exp_manager:299] There was no checkpoint folder at checkpoint_dir :results/checkpoints. Training from scratch.
[NeMo I 2021-01-28 14:52:19 exp_manager:186] Experiments will be logged at results
...
Validating: 100%|███████████████████████████| 1020/1020 [01:00<00:00, 21.60it/s][NeMo I 2021-01-29 10:17:18 qa_model:175] val exact match 50.10528088941295
[NeMo I 2021-01-29 10:17:18 qa_model:176] val f1 50.10528088941295
Epoch 0:  25%|██▎      | 3770/15076 [09:18<27:54,  6.75it/s, loss=1.34, lr=3e-5]
      Epoch 0, global step 2748: val_loss reached 1.19158 (best 1.19158), saving model to...

Model Fine-tuning

The following is an example spec for fine-tuning of the model:

trainer:
  max_epochs: 1

# Name of the .tlt file where finetuned model will be saved.
save_to: finetuned-model.tlt

# Fine-tuning settings: training dataset.
finetuning_ds:
  file: ??? # e.g. squad/v1.1/train-v1.1.json
  num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset

# Fine-tuning settings: validation dataset.
validation_ds:
  file: ??? # e.g. squad/v1.1/dev-v1.1.json
  num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset

# Fine-tuning settings: different optimizer.
optim:
  name: adamw
  lr: 5e-6

Parameter

Data Type

Default

Description

trainer.max_epochs

integer

2

The number of epochs to train

restore_from

string

trained-model.tlt

The path to the pre-trained model

save_to

string

finetuned-model.tlt

The path to save trained model to

finetuning_ds.file

string

The data file for fine tuning

finetuning_ds.num_samples

integer

500

The number of samples to use from the fine-tuning dataset (use -1 to specify all samples)

validation_ds.file

string

The validation data file

validation_ds.num_samples

integer

500

The number of samples to use from the validation dataset (use -1 to specify all samples)

optim.name

string

adam

The optimizer to use for training

optim.lr

float

1e-5

The learning rate to use for training

Use the following command to fine-tune the model:

!tlt question_answering finetune \
                        -e /specs/nlp/question_answering/finetune.yaml \
                        -g 1 \
                        data_dir=PATH_TO_DATA

Required Arguments for Fine-tuning

  • -e: The experiment specification file to set up fine-tuning

  • data_dir: The path to the data

Optional Arguments

  • -g: The number of GPUs to be use for evaluation in a multi-GPU scenario (default: 1)

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call (e.g. trainer.val_check_interval=0.25).

Fine-tuning Procedure

Fine-tuning procedure and logs will look similar to described in the Model Training section, with the addition of the model that is initially loaded from a previously trained checkpoint.

Model Evaluation

The following is an example spec to evaluate the pre-trained model:

# Test settings: dataset.
test_ds:
  file: ??? # e.g. squad/v1.1/dev-v1.1.json
  batch_size: 32
  shuffle: false
  num_samples: 500 # DEMO purposes -1 # number of samples to be considered, -1 means the whole the dataset

Parameter

Data Type

Default

Description

test_ds.file

string

The evaluation data file

test_ds.batch_size

integer

32

The training data batch size

test_ds.shuffle

bool

false

A flag specifying whether to shuffle the training data

test_ds.num_samples

integer

500

The number of samples to use from the training dataset (use -1 to specify all samples)

Use the following command to evaluate the model:

!tlt question_answering evaluate \
                        -e /specs/nlp/question_answering/evaluate.yaml \
                        data_dir=PATH_TO_DATA

Required Arguments for Evaluation

  • -e: The experiment specification file to set up evaluation

  • data_dir: The path to the pre-processed data to run evaluation on

Evaluation Procedure

After the previously trained model is initialized, it will run evaluation against the provided test set. With Extractive QA models, when the answer span is returned by the model, accuracy evaluation uses two metrics: The exact match (EM) and F1 score of the returned answer spans compared to the right answers. The overall EM and F1 scores are computed for a model by averaging the individual example scores.

  • Exact match: If the answer span is exactly equal to the correct one, it returns 1; otherwise, it returns 0. When assessing against a negative example (SQuAD 2.0), if the model predicts any text at all, it automatically receives a 0 for that example.

  • F1: The F1 score is a common metric for classification problems and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it is computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: Precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth. F1 = 2 *  (precision * recall) / (precision + recall)

Testing: 100%|████████████████████████████████| 383/383 [01:36<00:00,  3.77it/s][NeMo I 2021-01-29 10:26:40 qa_model:175] test exact match 50.11370336056599
[NeMo I 2021-01-29 10:26:40 qa_model:176] test f1 50.11370336056599
Testing: 100%|████████████████████████████████| 383/383 [02:07<00:00,  3.01it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_exact_match': 50.11370336056599,
 'test_f1': 50.11370336056599,
 'test_loss': tensor(1.1229, device='cuda:0')}

Model Inference

# Name of  file containing data used as inputs during the inference.
input_file: ??? # e.g. squad/v1.1/dev-v1.1.json

# Name of output nbest list file to store predictions to
output_nbest_file: nbest.txt

# Name of output file to store predictions to
output_prediction_file: prediction.txt

Parameter

Data Type

Default

Description

input_file

string

The file containing the data used as inputs during the inference

output_nbest_file

string

nbest.txt

The name of the output nbest list file to store predictions in

output_prediction_file

string

prediction.txt

The name of the output file to store predictions in

The following example shows how to run inference:

!tlt question_answering infer \
                        -e /specs/nlp/question_answering/infer.yaml \
                        -m trained-model.tlt \

Required Arguments for Inference

  • -e: The experiment specification file to set up inference. This requires the input_batch with the list of examples to run inference on.

  • -m: The path to the pre-trained model checkpoint from which to infer. The value should be a .tlt file.

Inference Procedure

After the trained model is loaded, it will run on the input file, which is in the same format as the file used for training and evaluation. It will create a prediction.txt prediction output file with prediction spans for each question in the input file.

Model Export

The following is an example of the spec file for model export:

# Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt

# Set export format: ONNX | JARVIS
export_format: ONNX

# Output EFF archive containing ONNX.
export_to: exported-model.eonnx

Parameter

Data Type

Default

Description

restore_from

string

trained-model.tlt

The path to the pre-trained model

export_format

string

ONNX

The export format (either “ONNX” or “JARVIS”)

export_to

string

exported-model.eonnx

The path to the exported model

To export a pre-trained model, run the following:

 ### For export to ONNX
!tlt question_answering export \
                        -e /specs/nlp/question_answering/export.yaml \
                        -m finetuned-model.tlt \
                        -k $KEY

Required Arguments for Export

  • -e: The experiment specification file to set up inference. This requires the input_batch with a list of examples to run inference on.

  • -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

  • -k: The encryption key

Model Deployment

You can use the Jarvis framework for the deployment of the trained model in the runtime. For more details, refer to the Jarvis documentation