Question Answering¶

Introduction¶

With the Question Answering, or Reading Comprehension, task, given a question and a passage of content (context) that may contain an answer for the question, the model will predict the span within the text with a start and end position indicating the answer to the question. For datasets like SQuAD 2.0, this model supports cases when the answer is not contained in the content.

For every word in the context of a given question, the model will be trained to predict:

The likelihood this word is the start of the span
The likelihood this word is the end of the span

The model chooses the start and end words with maximal probabilities. When the content does not contain the answer, we would like the start and end span to be set for the first token.

A pretrained BERT encoder with two span prediction heads is used for the prediction start and the end position of the answer. The span predictions are token classifiers consisting of a single linear layer.

TLT provides a sample notebook to outline the end-to-end workflow on how to train a Question Answering model using TLT and deploy it in Jarvis format on NGC resources.

Downloading Sample Spec files¶

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

tlt question_answering download_specs -r /results/question_answering/default_specs/ \
                                      -o /specs/nlp/questions_answering

Data Format¶

This model expects the dataset in SQuAD format (i.e., a JSON file for each dataset split). The code snippet below shows an example of the training file. Each title has one or multiple paragraph entries, each consisting of the “context” and question-answer entries. Each question-answer entry has:

A question
A globally unique id
The Boolean flag “is_impossible”, which shows whether a question is answerable or not
(if the question is answerable) One answer entry containing the text span and its starting character index in the context.
(if the question is not answerable) An empty “answers” list

The evaluation files (for validation and testing) follow the above format, except that it can provide more than one answer to the same question. The inference file also follows the above format, except that it does not require the “answers” and “is_impossible” keywords.

The following is an example of the data format (JSON file):

{
    "data": [
        {
            "title": "Super_Bowl_50",
            "paragraphs": [
                {
                    "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
                    "qas": [
                        {
                            "question": "Where did Super Bowl 50 take place?",
                            "is_impossible": "false",
                            "id": "56be4db0acb8001400a502ee",
                            "answers": [
                                {
                                    "answer_start": "403",
                                    "text": "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question": "What was the winning score of the Super Bowl 50?",
                            "is_impossible": "true",
                            "id": "56be4db0acb8001400a502ez",
                            "answers": [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Dataset Conversion¶

To perform training of the QA model on the SQuAD dataset, you must first download it from here. You can choose either SQuAD version 1.1, which does not contain questions without the answer and has 100,000+ question-answer pairs on 500+ articles–or the newer SQuAD version 2.0, which combines the 100,000 questions from SQuAD 1.1 with over 50,000 unanswerable questions. To do well with SQuAD2.0, a system must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

After downloading the files, you should have a squad data folder that contains the following four files for training and evaluation:

|--squad
     |-- v1.1/train-v1.1.json
     |-- v1.1/dev-v1.1.json
     |-- v2.0/train-v2.0.json
     |-- v2.0/dev-v2.0.json

Model Training¶

The following is an example of the config spec for training (train.yaml) file. You can change any of these parameters and pass them to the training command.

trainer:
  max_epochs: 2

  # Name of the .tlt file where trained model will be saved.
  save_to: trained-model.tlt

model:

  dataset:
      do_lower_case: true
      version_2_with_negative: true

  tokenizer:
      tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
      vocab_file: null # path to vocab file
      tokenizer_model: null # only used if tokenizer is sentencepiece
      special_tokens: null

  language_model:
    pretrained_model_name: bert-base-uncased
    lm_checkpoint: null
    config_file: null # json file, precedence over config
    config: null

  token_classifier:
    num_layers: 1
    dropout: 0.0
    num_classes: 2
    activation: relu
    log_softmax: false
    use_transformer_init: true


training_ds:
  file: ??? # e.g. squad/v1.1/train-v2.0.json
  batch_size: 12 # per GPU
  shuffle: true
  num_samples: -1

validation_ds:
  file: ??? # e.g. squad/v1.1/dev-v2.0.json
  batch_size: 12 # per GPU
  shuffle: false
  num_samples: -1

optim:
  # optimizer arguments
  name: adamw
  lr: 3e-5
  betas: [0.9, 0.999]
  weight_decay: 0.0
  # scheduler config override
  sched:
    name: SquareRootAnnealing
    warmup_steps: null
    warmup_ratio: 0.0
    last_epoch: -1

    # pytorch lightning args
    monitor: val_loss
    reduce_on_plateau: false

Parameter	Data Type	Default	Description
trainer.max_epochs	integer	2	The number of epochs to train
save_to	string	trained-model.tlt	The filename of the trained model
data_dir	string	–	The path to the data converted to the specified format
model.tokenizer.tokenizer_name	string	Will be filled automatically based on model.language_model.pretrained_model_name	The tokenizer name
model.tokenizer.vocab_file	string	null	The path to tokenizer vocabulary
model.tokenizer.tokenizer_model	string	null	The path to tokenizer model (for sentencepiece tokenizer only)
model.tokenizer.special_tokens	string	null	Special tokens for the tokenizer (if they exist)
model.language_model.pretrained_model_name	string	bert-base-uncased	The pre-trained language model name (choose from bert-base-cased, bert-base-uncased, megatron-bert-345m-cased and megatron-bert-345m-uncased)
model.language_model.lm_checkpoint	string	null	The path to the pre-trained language model checkpoint
model.language_model.config_file	string	null	The path to the pre-trained language model config file
model.language_model.config	dictionary	null	The config of the pre-trained language model
model.token_classifier.num_layers	integer	1	The number of fully connected layers of the Classifier on top of the Bert model
model.token_classifier.dropout	float	0.0	The dropout ratio of the fully connected layers
model.token_classifier.num_classes	integer	2	The number of Classifiers (two for QA)
model.token_classifier.activation	string	relu	The activation function to use
model.token_classifier.log_softmax	boolean	false	A flag specifying whether to use log soft max
training_ds.file	string	–	The training file names
training_ds.batch_size	integer	12	The training data batch size
training_ds.shuffle	bool	true	A flag specifying whether to shuffle the training data
training_ds.num_samples	integer	-1	The number of samples to use from the training dataset (use -1 to specify all samples)
validation_ds.file	string	–	The validation file names
validation_ds.batch_size	integer	12	The validation data batch size
validation_ds.shuffle	bool	false	A flag specifying whether to shuffle the validation data
validation_ds.num_samples	integer	-1	The number of samples to use from the validation dataset (use -1 to specify all samples)
optim.name	string	adamw	The optimizer to use for training
optim.lr	float	2e-5	The learning rate to use for training
optim.weight_decay	float	0.0	The weight decay to use for training
optim.sched.name	string	SquareRootAnnealing	The warmup schedule
optim.sched.warmup_ratio	float	0.0	The warmup ratio

The following is an example of the command for training the model:

!tlt question_answering train -e /specs/nlp/question_answering/train.yaml \
                        data_dir=PATH_TO_DATA \
                        trainer.max_epochs=2 \
                        trainer.amp_level="O1" \
                        trainer.precision=16 \
                        -g 1

Note

The first time you are performing training, it will take an extra 5-10 minutes to process the dataset for training. For future training runs, it will use the processed dataset, which is automatically cached in the files in the same directory as the data.

Required Arguments for Training¶

-e: The experiment specification file to set up training.
data_dir: The dataset directory

Optional Arguments¶

trainer.max_epochs: The number of training epochs
-g: The number of GPUs to use for training
trainer.amp_level and trainer.precision: These fields allow you to use 16-bit mixed precision to accelerate training.

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call (e.g., trainer.val_check_interval=0.25).

Training Procedure¶

At the start of evaluation, TLT will print out a log of the experiment specification, then load and preprocess the trained data. For the SQuAD dataset, it can initially take several minutes to Tokenize the content. For subsequent faster runs, the preprocessed dataset will be cached in the files at the same directory as the original dataset. Then it will also display the detailed model architecture.

As the model starts training, you should see a progress bar per epoch. Since QA datasets like SQuAD are big, it is usually enough to train for two epochs. If you want to better see the training progress, you can add the trainer.val_check_interval parameter to the script with a value less than one (e.g., trainer.val_check_interval=0.25, which specifies four evaluations on the validation dataset per 1 epoch of training).

At the end of training, TLT will save the best checkpoint on the validation dataset at the path specified by the experiment spec file before finishing.

TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
[NeMo W 2021-01-28 14:52:19 exp_manager:299] There was no checkpoint folder at checkpoint_dir :results/checkpoints. Training from scratch.
[NeMo I 2021-01-28 14:52:19 exp_manager:186] Experiments will be logged at results
...
Validating: 100%|███████████████████████████| 1020/1020 [01:00<00:00, 21.60it/s][NeMo I 2021-01-29 10:17:18 qa_model:175] val exact match 50.10528088941295
[NeMo I 2021-01-29 10:17:18 qa_model:176] val f1 50.10528088941295
Epoch 0:  25%|██▎      | 3770/15076 [09:18<27:54,  6.75it/s, loss=1.34, lr=3e-5]
      Epoch 0, global step 2748: val_loss reached 1.19158 (best 1.19158), saving model to...

Model Fine-tuning¶

The following is an example spec for fine-tuning of the model:

trainer:
  max_epochs: 1

# Name of the .tlt file where finetuned model will be saved.
save_to: finetuned-model.tlt

# Fine-tuning settings: training dataset.
finetuning_ds:
  file: ??? # e.g. squad/v1.1/train-v1.1.json
  num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset

# Fine-tuning settings: validation dataset.
validation_ds:
  file: ??? # e.g. squad/v1.1/dev-v1.1.json
  num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset

# Fine-tuning settings: different optimizer.
optim:
  name: adamw
  lr: 5e-6

Parameter	Data Type	Default	Description
trainer.max_epochs	integer	2	The number of epochs to train
restore_from	string	trained-model.tlt	The path to the pre-trained model
save_to	string	finetuned-model.tlt	The path to save trained model to
finetuning_ds.file	string	–	The data file for fine tuning
finetuning_ds.num_samples	integer	500	The number of samples to use from the fine-tuning dataset (use -1 to specify all samples)
validation_ds.file	string	–	The validation data file
validation_ds.num_samples	integer	500	The number of samples to use from the validation dataset (use -1 to specify all samples)
optim.name	string	adam	The optimizer to use for training
optim.lr	float	1e-5	The learning rate to use for training

Use the following command to fine-tune the model:

!tlt question_answering finetune \
                        -e /specs/nlp/question_answering/finetune.yaml \
                        -g 1 \
                        data_dir=PATH_TO_DATA

Required Arguments for Fine-tuning¶

-e: The experiment specification file to set up fine-tuning
data_dir: The path to the data

Optional Arguments¶

-g: The number of GPUs to be use for evaluation in a multi-GPU scenario (default: 1)

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call (e.g. trainer.val_check_interval=0.25).

Fine-tuning Procedure¶

Fine-tuning procedure and logs will look similar to described in the Model Training section, with the addition of the model that is initially loaded from a previously trained checkpoint.

Model Evaluation¶

The following is an example spec to evaluate the pre-trained model:

# Test settings: dataset.
test_ds:
  file: ??? # e.g. squad/v1.1/dev-v1.1.json
  batch_size: 32
  shuffle: false
  num_samples: 500 # DEMO purposes -1 # number of samples to be considered, -1 means the whole the dataset

Parameter	Data Type	Default	Description
test_ds.file	string	–	The evaluation data file
test_ds.batch_size	integer	32	The training data batch size
test_ds.shuffle	bool	false	A flag specifying whether to shuffle the training data
test_ds.num_samples	integer	500	The number of samples to use from the training dataset (use -1 to specify all samples)

Use the following command to evaluate the model:

!tlt question_answering evaluate \
                        -e /specs/nlp/question_answering/evaluate.yaml \
                        data_dir=PATH_TO_DATA

Required Arguments for Evaluation¶

-e: The experiment specification file to set up evaluation
data_dir: The path to the pre-processed data to run evaluation on

Evaluation Procedure¶

After the previously trained model is initialized, it will run evaluation against the provided test set. With Extractive QA models, when the answer span is returned by the model, accuracy evaluation uses two metrics: The exact match (EM) and F1 score of the returned answer spans compared to the right answers. The overall EM and F1 scores are computed for a model by averaging the individual example scores.

Exact match: If the answer span is exactly equal to the correct one, it returns 1; otherwise, it returns 0. When assessing against a negative example (SQuAD 2.0), if the model predicts any text at all, it automatically receives a 0 for that example.
F1: The F1 score is a common metric for classification problems and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it is computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: Precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth. F1 = 2 * (precision * recall) / (precision + recall)

Testing: 100%|████████████████████████████████| 383/383 [01:36<00:00,  3.77it/s][NeMo I 2021-01-29 10:26:40 qa_model:175] test exact match 50.11370336056599
[NeMo I 2021-01-29 10:26:40 qa_model:176] test f1 50.11370336056599
Testing: 100%|████████████████████████████████| 383/383 [02:07<00:00,  3.01it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_exact_match': 50.11370336056599,
 'test_f1': 50.11370336056599,
 'test_loss': tensor(1.1229, device='cuda:0')}

Model Inference¶

# Name of  file containing data used as inputs during the inference.
input_file: ??? # e.g. squad/v1.1/dev-v1.1.json

# Name of output nbest list file to store predictions to
output_nbest_file: nbest.txt

# Name of output file to store predictions to
output_prediction_file: prediction.txt

Parameter	Data Type	Default	Description
input_file	string	–	The file containing the data used as inputs during the inference
output_nbest_file	string	nbest.txt	The name of the output nbest list file to store predictions in
output_prediction_file	string	prediction.txt	The name of the output file to store predictions in

The following example shows how to run inference:

!tlt question_answering infer \
                        -e /specs/nlp/question_answering/infer.yaml \
                        -m trained-model.tlt \

Required Arguments for Inference¶

-e: The experiment specification file to set up inference. This requires the input_batch with the list of examples to run inference on.
-m: The path to the pre-trained model checkpoint from which to infer. The value should be a .tlt file.

Inference Procedure¶

After the trained model is loaded, it will run on the input file, which is in the same format as the file used for training and evaluation. It will create a prediction.txt prediction output file with prediction spans for each question in the input file.

Model Export¶

The following is an example of the spec file for model export:

# Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt

# Set export format: ONNX | JARVIS
export_format: ONNX

# Output EFF archive containing ONNX.
export_to: exported-model.eonnx

Parameter	Data Type	Default	Description
restore_from	string	trained-model.tlt	The path to the pre-trained model
export_format	string	ONNX	The export format (either “ONNX” or “JARVIS”)
export_to	string	exported-model.eonnx	The path to the exported model

To export a pre-trained model, run the following:

 ### For export to ONNX
!tlt question_answering export \
                        -e /specs/nlp/question_answering/export.yaml \
                        -m finetuned-model.tlt \
                        -k $KEY

Required Arguments for Export¶

-e: The experiment specification file to set up inference. This requires the input_batch with a list of examples to run inference on.
-m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.
-k: The encryption key

Model Deployment¶

You can use the Jarvis framework for the deployment of the trained model in the runtime. For more details, refer to the Jarvis documentation