Question Answering¶
Introduction¶
With the Question Answering, or Reading Comprehension, task, given a question and a passage of content (context) that may contain an answer for the question, the model will predict the span within the text with a start and end position indicating the answer to the question. For datasets like SQuAD 2.0, this model supports cases when the answer is not contained in the content.
For every word in the context of a given question, the model will be trained to predict:
The likelihood this word is the start of the span
The likelihood this word is the end of the span
The model chooses the start and end words with maximal probabilities. When the content does not contain the answer, we would like the start and end span to be set for the first token.
A pretrained BERT encoder with two span prediction heads is used for the prediction start and the end position of the answer. The span predictions are token classifiers consisting of a single linear layer.
TLT provides a sample notebook to outline the end-to-end workflow on how to train a Question Answering model using TLT and deploy it in Jarvis format on NGC resources.
Downloading Sample Spec files¶
Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.
tlt question_answering download_specs -r /results/question_answering/default_specs/ \
-o /specs/nlp/questions_answering
Data Format¶
This model expects the dataset in SQuAD format (i.e., a JSON file for each dataset split). The code snippet below shows an example of the training file. Each title has one or multiple paragraph entries, each consisting of the “context” and question-answer entries. Each question-answer entry has:
A question
A globally unique id
The Boolean flag “is_impossible”, which shows whether a question is answerable or not
(if the question is answerable) One answer entry containing the text span and its starting character index in the context.
(if the question is not answerable) An empty “answers” list
The evaluation files (for validation and testing) follow the above format, except that it can provide more than one answer to the same question. The inference file also follows the above format, except that it does not require the “answers” and “is_impossible” keywords.
The following is an example of the data format (JSON file):
{
"data": [
{
"title": "Super_Bowl_50",
"paragraphs": [
{
"context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
"qas": [
{
"question": "Where did Super Bowl 50 take place?",
"is_impossible": "false",
"id": "56be4db0acb8001400a502ee",
"answers": [
{
"answer_start": "403",
"text": "Santa Clara, California"
}
]
},
{
"question": "What was the winning score of the Super Bowl 50?",
"is_impossible": "true",
"id": "56be4db0acb8001400a502ez",
"answers": [
]
}
]
}
]
}
]
}
Dataset Conversion¶
To perform training of the QA model on the SQuAD dataset, you must first download it from here. You can choose either SQuAD version 1.1, which does not contain questions without the answer and has 100,000+ question-answer pairs on 500+ articles–or the newer SQuAD version 2.0, which combines the 100,000 questions from SQuAD 1.1 with over 50,000 unanswerable questions. To do well with SQuAD2.0, a system must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
After downloading the files, you should have a squad
data folder that contains the
following four files for training and evaluation:
|--squad
|-- v1.1/train-v1.1.json
|-- v1.1/dev-v1.1.json
|-- v2.0/train-v2.0.json
|-- v2.0/dev-v2.0.json
Model Training¶
The following is an example of the config spec for training (train.yaml
) file. You can
change any of these parameters and pass them to the training command.
trainer:
max_epochs: 2
# Name of the .tlt file where trained model will be saved.
save_to: trained-model.tlt
model:
dataset:
do_lower_case: true
version_2_with_negative: true
tokenizer:
tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
vocab_file: null # path to vocab file
tokenizer_model: null # only used if tokenizer is sentencepiece
special_tokens: null
language_model:
pretrained_model_name: bert-base-uncased
lm_checkpoint: null
config_file: null # json file, precedence over config
config: null
token_classifier:
num_layers: 1
dropout: 0.0
num_classes: 2
activation: relu
log_softmax: false
use_transformer_init: true
training_ds:
file: ??? # e.g. squad/v1.1/train-v2.0.json
batch_size: 12 # per GPU
shuffle: true
num_samples: -1
validation_ds:
file: ??? # e.g. squad/v1.1/dev-v2.0.json
batch_size: 12 # per GPU
shuffle: false
num_samples: -1
optim:
# optimizer arguments
name: adamw
lr: 3e-5
betas: [0.9, 0.999]
weight_decay: 0.0
# scheduler config override
sched:
name: SquareRootAnnealing
warmup_steps: null
warmup_ratio: 0.0
last_epoch: -1
# pytorch lightning args
monitor: val_loss
reduce_on_plateau: false
Parameter |
Data Type |
Default |
Description |
trainer.max_epochs |
integer |
2 |
The number of epochs to train |
save_to |
string |
trained-model.tlt |
The filename of the trained model |
data_dir |
string |
– |
The path to the data converted to the specified format |
model.tokenizer.tokenizer_name |
string |
Will be filled automatically based on model.language_model.pretrained_model_name |
The tokenizer name |
model.tokenizer.vocab_file |
string |
null |
The path to tokenizer vocabulary |
model.tokenizer.tokenizer_model |
string |
null |
The path to tokenizer model (for sentencepiece tokenizer only) |
model.tokenizer.special_tokens |
string |
null |
Special tokens for the tokenizer (if they exist) |
model.language_model.pretrained_model_name |
string |
bert-base-uncased |
The pre-trained language model name (choose from bert-base-cased, bert-base-uncased, megatron-bert-345m-cased and megatron-bert-345m-uncased) |
model.language_model.lm_checkpoint |
string |
null |
The path to the pre-trained language model checkpoint |
model.language_model.config_file |
string |
null |
The path to the pre-trained language model config file |
model.language_model.config |
dictionary |
null |
The config of the pre-trained language model |
model.token_classifier.num_layers |
integer |
1 |
The number of fully connected layers of the Classifier on top of the Bert model |
model.token_classifier.dropout |
float |
0.0 |
The dropout ratio of the fully connected layers |
model.token_classifier.num_classes |
integer |
2 |
The number of Classifiers (two for QA) |
model.token_classifier.activation |
string |
relu |
The activation function to use |
model.token_classifier.log_softmax |
boolean |
false |
A flag specifying whether to use log soft max |
training_ds.file |
string |
– |
The training file names |
training_ds.batch_size |
integer |
12 |
The training data batch size |
training_ds.shuffle |
bool |
true |
A flag specifying whether to shuffle the training data |
training_ds.num_samples |
integer |
-1 |
The number of samples to use from the training dataset (use -1 to specify all samples) |
validation_ds.file |
string |
– |
The validation file names |
validation_ds.batch_size |
integer |
12 |
The validation data batch size |
validation_ds.shuffle |
bool |
false |
A flag specifying whether to shuffle the validation data |
validation_ds.num_samples |
integer |
-1 |
The number of samples to use from the validation dataset (use -1 to specify all samples) |
optim.name |
string |
adamw |
The optimizer to use for training |
optim.lr |
float |
2e-5 |
The learning rate to use for training |
optim.weight_decay |
float |
0.0 |
The weight decay to use for training |
optim.sched.name |
string |
SquareRootAnnealing |
The warmup schedule |
optim.sched.warmup_ratio |
float |
0.0 |
The warmup ratio |
The following is an example of the command for training the model:
!tlt question_answering train -e /specs/nlp/question_answering/train.yaml \
data_dir=PATH_TO_DATA \
trainer.max_epochs=2 \
trainer.amp_level="O1" \
trainer.precision=16 \
-g 1
Note
The first time you are performing training, it will take an extra 5-10 minutes to process the dataset for training. For future training runs, it will use the processed dataset, which is automatically cached in the files in the same directory as the data.
Required Arguments for Training¶
-e
: The experiment specification file to set up training.data_dir
: The dataset directory
Optional Arguments¶
trainer.max_epochs
: The number of training epochs-g
: The number of GPUs to use for trainingtrainer.amp_level
andtrainer.precision
: These fields allow you to use 16-bit mixed precision to accelerate training.
Note
You can use other arguments to override fields in the specification file.
To do so, use the name of the config parameter with a desired value and pass it as a parameter in
the script call (e.g., trainer.val_check_interval=0.25
).
Training Procedure¶
At the start of evaluation, TLT will print out a log of the experiment specification, then load and preprocess the trained data. For the SQuAD dataset, it can initially take several minutes to Tokenize the content. For subsequent faster runs, the preprocessed dataset will be cached in the files at the same directory as the original dataset. Then it will also display the detailed model architecture.
As the model starts training, you should see a progress bar per epoch.
Since QA datasets like SQuAD are big, it is usually enough to train for two epochs.
If you want to better see the training progress, you can add the trainer.val_check_interval
parameter to the script with a value less than one (e.g., trainer.val_check_interval=0.25
,
which specifies four evaluations on the validation dataset per 1 epoch of training).
At the end of training, TLT will save the best checkpoint on the validation dataset at the path specified by the experiment spec file before finishing.
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
[NeMo W 2021-01-28 14:52:19 exp_manager:299] There was no checkpoint folder at checkpoint_dir :results/checkpoints. Training from scratch.
[NeMo I 2021-01-28 14:52:19 exp_manager:186] Experiments will be logged at results
...
Validating: 100%|███████████████████████████| 1020/1020 [01:00<00:00, 21.60it/s][NeMo I 2021-01-29 10:17:18 qa_model:175] val exact match 50.10528088941295
[NeMo I 2021-01-29 10:17:18 qa_model:176] val f1 50.10528088941295
Epoch 0: 25%|██▎ | 3770/15076 [09:18<27:54, 6.75it/s, loss=1.34, lr=3e-5]
Epoch 0, global step 2748: val_loss reached 1.19158 (best 1.19158), saving model to...
Model Fine-tuning¶
The following is an example spec for fine-tuning of the model:
trainer:
max_epochs: 1
# Name of the .tlt file where finetuned model will be saved.
save_to: finetuned-model.tlt
# Fine-tuning settings: training dataset.
finetuning_ds:
file: ??? # e.g. squad/v1.1/train-v1.1.json
num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset
# Fine-tuning settings: validation dataset.
validation_ds:
file: ??? # e.g. squad/v1.1/dev-v1.1.json
num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset
# Fine-tuning settings: different optimizer.
optim:
name: adamw
lr: 5e-6
Parameter |
Data Type |
Default |
Description |
trainer.max_epochs |
integer |
2 |
The number of epochs to train |
restore_from |
string |
trained-model.tlt |
The path to the pre-trained model |
save_to |
string |
finetuned-model.tlt |
The path to save trained model to |
finetuning_ds.file |
string |
– |
The data file for fine tuning |
finetuning_ds.num_samples |
integer |
500 |
The number of samples to use from the fine-tuning dataset (use -1 to specify all samples) |
validation_ds.file |
string |
– |
The validation data file |
validation_ds.num_samples |
integer |
500 |
The number of samples to use from the validation dataset (use -1 to specify all samples) |
optim.name |
string |
adam |
The optimizer to use for training |
optim.lr |
float |
1e-5 |
The learning rate to use for training |
Use the following command to fine-tune the model:
!tlt question_answering finetune \
-e /specs/nlp/question_answering/finetune.yaml \
-g 1 \
data_dir=PATH_TO_DATA
Required Arguments for Fine-tuning¶
-e
: The experiment specification file to set up fine-tuningdata_dir
: The path to the data
Optional Arguments¶
-g
: The number of GPUs to be use for evaluation in a multi-GPU scenario (default: 1)
Note
You can use other arguments to override fields in the specification file.
To do so, use the name of the config parameter with a desired value and pass it as a parameter in
the script call (e.g. trainer.val_check_interval=0.25
).
Fine-tuning Procedure¶
Fine-tuning procedure and logs will look similar to described in the Model Training section, with the addition of the model that is initially loaded from a previously trained checkpoint.
Model Evaluation¶
The following is an example spec to evaluate the pre-trained model:
# Test settings: dataset.
test_ds:
file: ??? # e.g. squad/v1.1/dev-v1.1.json
batch_size: 32
shuffle: false
num_samples: 500 # DEMO purposes -1 # number of samples to be considered, -1 means the whole the dataset
Parameter |
Data Type |
Default |
Description |
test_ds.file |
string |
– |
The evaluation data file |
test_ds.batch_size |
integer |
32 |
The training data batch size |
test_ds.shuffle |
bool |
false |
A flag specifying whether to shuffle the training data |
test_ds.num_samples |
integer |
500 |
The number of samples to use from the training dataset (use -1 to specify all samples) |
Use the following command to evaluate the model:
!tlt question_answering evaluate \
-e /specs/nlp/question_answering/evaluate.yaml \
data_dir=PATH_TO_DATA
Required Arguments for Evaluation¶
-e
: The experiment specification file to set up evaluationdata_dir
: The path to the pre-processed data to run evaluation on
Evaluation Procedure¶
After the previously trained model is initialized, it will run evaluation against the provided test set. With Extractive QA models, when the answer span is returned by the model, accuracy evaluation uses two metrics: The exact match (EM) and F1 score of the returned answer spans compared to the right answers. The overall EM and F1 scores are computed for a model by averaging the individual example scores.
Exact match
: If the answer span is exactly equal to the correct one, it returns 1; otherwise, it returns 0. When assessing against a negative example (SQuAD 2.0), if the model predicts any text at all, it automatically receives a 0 for that example.F1
: The F1 score is a common metric for classification problems and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it is computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: Precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.F1 = 2 * (precision * recall) / (precision + recall)
Testing: 100%|████████████████████████████████| 383/383 [01:36<00:00, 3.77it/s][NeMo I 2021-01-29 10:26:40 qa_model:175] test exact match 50.11370336056599
[NeMo I 2021-01-29 10:26:40 qa_model:176] test f1 50.11370336056599
Testing: 100%|████████████████████████████████| 383/383 [02:07<00:00, 3.01it/s]
--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_exact_match': 50.11370336056599,
'test_f1': 50.11370336056599,
'test_loss': tensor(1.1229, device='cuda:0')}
Model Inference¶
# Name of file containing data used as inputs during the inference.
input_file: ??? # e.g. squad/v1.1/dev-v1.1.json
# Name of output nbest list file to store predictions to
output_nbest_file: nbest.txt
# Name of output file to store predictions to
output_prediction_file: prediction.txt
Parameter |
Data Type |
Default |
Description |
input_file |
string |
– |
The file containing the data used as inputs during the inference |
output_nbest_file |
string |
nbest.txt |
The name of the output nbest list file to store predictions in |
output_prediction_file |
string |
prediction.txt |
The name of the output file to store predictions in |
The following example shows how to run inference:
!tlt question_answering infer \
-e /specs/nlp/question_answering/infer.yaml \
-m trained-model.tlt \
Required Arguments for Inference¶
-e
: The experiment specification file to set up inference. This requires theinput_batch
with the list of examples to run inference on.-m
: The path to the pre-trained model checkpoint from which to infer. The value should be a.tlt
file.
Inference Procedure¶
After the trained model is loaded, it will run on the input file, which is in the same format as
the file used for training and evaluation. It will create a prediction.txt
prediction output
file with prediction spans for each question in the input file.
Model Export¶
The following is an example of the spec file for model export:
# Name of the .tlt EFF archive to be loaded/model to be exported.
restore_from: trained-model.tlt
# Set export format: ONNX | JARVIS
export_format: ONNX
# Output EFF archive containing ONNX.
export_to: exported-model.eonnx
Parameter |
Data Type |
Default |
Description |
restore_from |
string |
trained-model.tlt |
The path to the pre-trained model |
export_format |
string |
ONNX |
The export format (either “ONNX” or “JARVIS”) |
export_to |
string |
exported-model.eonnx |
The path to the exported model |
To export a pre-trained model, run the following:
### For export to ONNX
!tlt question_answering export \
-e /specs/nlp/question_answering/export.yaml \
-m finetuned-model.tlt \
-k $KEY
Required Arguments for Export¶
-e
: The experiment specification file to set up inference. This requires theinput_batch
with a list of examples to run inference on.-m
: The path to the pre-trained model checkpoint from which to infer. The file should have a.tlt
extension.-k
: The encryption key
Model Deployment¶
You can use the Jarvis framework for the deployment of the trained model in the runtime. For more details, refer to the Jarvis documentation