Question Answering

With the Question Answering, or Reading Comprehension, task, given a question and a passage of content (context) that may contain an answer for the question, the model will predict the span within the text with a start and end position indicating the answer to the question. For datasets like SQuAD 2.0, this model supports cases when the answer is not contained in the content.

For every word in the context of a given question, the model will be trained to predict:

  • The likelihood this word is the start of the span

  • The likelihood this word is the end of the span

The model chooses the start and end words with maximal probabilities. When the content does not contain the answer, we would like the start and end span to be set for the first token.

A pretrained BERT encoder with two span prediction heads is used for the prediction start and the end position of the answer. The span predictions are token classifiers consisting of a single linear layer.

TAO Toolkit provides a sample notebook to outline the end-to-end workflow on how to train a Question Answering model using TAO Toolkit and deploy it in Riva format on NGC resources.

Before proceeding, let’s download sample spec files that we would need for the rest of the subtasks.

Copy
Copied!
            

tao question_answering download_specs -r /results/question_answering/default_specs/ \ -o /specs/nlp/questions_answering

This model expects the dataset in SQuAD format (i.e., a JSON file for each dataset split). The code snippet below shows an example of the training file. Each title has one or multiple paragraph entries, each consisting of the “context” and question-answer entries. Each question-answer entry has:

  • A question

  • A globally unique id

  • The Boolean flag “is_impossible”, which shows whether a question is answerable or not

  • (if the question is answerable) One answer entry containing the text span and its starting character index in the context.

  • (if the question is not answerable) An empty “answers” list

The evaluation files (for validation and testing) follow the above format, except that it can provide more than one answer to the same question. The inference file also follows the above format, except that it does not require the “answers” and “is_impossible” keywords.

The following is an example of the data format (JSON file):

Copy
Copied!
            

{ "data": [ { "title": "Super_Bowl_50", "paragraphs": [ { "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the\"golden anniversary\"with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as\"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.", "qas": [ { "question": "Where did Super Bowl 50 take place?", "is_impossible": "false", "id": "56be4db0acb8001400a502ee", "answers": [ { "answer_start": "403", "text": "Santa Clara, California" } ] }, { "question": "What was the winning score of the Super Bowl 50?", "is_impossible": "true", "id": "56be4db0acb8001400a502ez", "answers": [ ] } ] } ] } ] }

To perform training of the QA model on the SQuAD dataset, you must first download it from here. You can choose either SQuAD version 1.1, which does not contain questions without the answer and has 100,000+ question-answer pairs on 500+ articles–or the newer SQuAD version 2.0, which combines the 100,000 questions from SQuAD 1.1 with over 50,000 unanswerable questions. To do well with SQuAD2.0, a system must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

After downloading the files, you should have a squad data folder that contains the following four files for training and evaluation:

Copy
Copied!
            

|--squad |-- v1.1/train-v1.1.json |-- v1.1/dev-v1.1.json |-- v2.0/train-v2.0.json |-- v2.0/dev-v2.0.json

The following is an example of the config spec for training (train.yaml) file. You can change any of these parameters and pass them to the training command.

Copy
Copied!
            

trainer: max_epochs: 2 # Name of the .tlt file where trained model will be saved. save_to: trained-model.tlt model: dataset: do_lower_case: true version_2_with_negative: true tokenizer: tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece vocab_file: null # path to vocab file tokenizer_model: null # only used if tokenizer is sentencepiece special_tokens: null language_model: pretrained_model_name: bert-base-uncased lm_checkpoint: null config_file: null # json file, precedence over config config: null token_classifier: num_layers: 1 dropout: 0.0 num_classes: 2 activation: relu log_softmax: false use_transformer_init: true training_ds: file: ??? # e.g. squad/v1.1/train-v2.0.json batch_size: 12 # per GPU shuffle: true num_samples: -1 validation_ds: file: ??? # e.g. squad/v1.1/dev-v2.0.json batch_size: 12 # per GPU shuffle: false num_samples: -1 optim: # optimizer arguments name: adamw lr: 3e-5 betas: [0.9, 0.999] weight_decay: 0.0 # scheduler config override sched: name: SquareRootAnnealing warmup_steps: null warmup_ratio: 0.0 last_epoch: -1 # pytorch lightning args monitor: val_loss reduce_on_plateau: false

Parameter Data Type Default Description
trainer.max_epochs integer 2 The number of epochs to train
save_to string trained-model.tlt The filename of the trained model
data_dir string The path to the data converted to the specified format
model.tokenizer.tokenizer_name string Will be filled automatically based on model.language_model.pretrained_model_name The tokenizer name
model.tokenizer.vocab_file string null The path to tokenizer vocabulary
model.tokenizer.tokenizer_model string null The path to tokenizer model (for sentencepiece tokenizer only)
model.tokenizer.special_tokens string null Special tokens for the tokenizer (if they exist)

model.language_model.pretrained_model_name

string

bert-base-uncased

The pre-trained language model name (choose from bert-base-cased, bert-base-uncased,
distilbert-base-uncased, )

model.language_model.lm_checkpoint string null The path to the pre-trained language model checkpoint
model.language_model.config_file string null The path to the pre-trained language model config file
model.language_model.config dictionary null The config of the pre-trained language model
model.token_classifier.num_layers integer 1 The number of fully connected layers of the Classifier on top of the Bert model
model.token_classifier.dropout float 0.0 The dropout ratio of the fully connected layers
model.token_classifier.num_classes integer 2 The number of Classifiers (two for QA)
model.token_classifier.activation string relu The activation function to use
model.token_classifier.log_softmax boolean false A flag specifying whether to use log soft max
training_ds.file string The training file names
training_ds.batch_size integer 12 The training data batch size
training_ds.shuffle bool true A flag specifying whether to shuffle the training data
training_ds.num_samples integer -1 The number of samples to use from the training dataset (use -1 to specify all samples)
validation_ds.file string The validation file names
validation_ds.batch_size integer 12 The validation data batch size
validation_ds.shuffle bool false A flag specifying whether to shuffle the validation data
validation_ds.num_samples integer -1 The number of samples to use from the validation dataset (use -1 to specify all samples)
optim.name string adamw The optimizer to use for training
optim.lr float 2e-5 The learning rate to use for training
optim.weight_decay float 0.0 The weight decay to use for training
optim.sched.name string SquareRootAnnealing The warmup schedule
optim.sched.warmup_ratio float 0.0 The warmup ratio

The following is an example of the command for training the model:

Copy
Copied!
            

!tao question_answering train -e /specs/nlp/question_answering/train.yaml \ data_dir=PATH_TO_DATA \ trainer.max_epochs=2 \ trainer.amp_level="O1" \ trainer.precision=16 \ -g 1

Note

The first time you are performing training, it will take an extra 5-10 minutes to process the dataset for training. For future training runs, it will use the processed dataset, which is automatically cached in the files in the same directory as the data.

Required Arguments for Training

  • -e: The experiment specification file to set up training.

  • data_dir: The dataset directory

Optional Arguments

  • trainer.max_epochs: The number of training epochs

  • -g: The number of GPUs to use for training

  • trainer.amp_level and trainer.precision: These fields allow you to use 16-bit mixed precision to accelerate training.

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call (e.g., trainer.val_check_interval=0.25).


Training Procedure

At the start of evaluation, TAO Toolkit will print out a log of the experiment specification, then load and preprocess the trained data. For the SQuAD dataset, it can initially take several minutes to Tokenize the content. For subsequent faster runs, the preprocessed dataset will be cached in the files at the same directory as the original dataset. Then it will also display the detailed model architecture.

As the model starts training, you should see a progress bar per epoch. Since QA datasets like SQuAD are big, it is usually enough to train for two epochs. If you want to better see the training progress, you can add the trainer.val_check_interval parameter to the script with a value less than one (e.g., trainer.val_check_interval=0.25, which specifies four evaluations on the validation dataset per 1 epoch of training).

At the end of training, TAO Toolkit will save the best checkpoint on the validation dataset at the path specified by the experiment spec file before finishing.

Copy
Copied!
            

TPU available: None, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2] [NeMo W 2021-01-28 14:52:19 exp_manager:299] There was no checkpoint folder at checkpoint_dir :results/checkpoints. Training from scratch. [NeMo I 2021-01-28 14:52:19 exp_manager:186] Experiments will be logged at results ... Validating: 100%|███████████████████████████| 1020/1020 [01:00<00:00, 21.60it/s][NeMo I 2021-01-29 10:17:18 qa_model:175] val exact match 50.10528088941295 [NeMo I 2021-01-29 10:17:18 qa_model:176] val f1 50.10528088941295 Epoch 0: 25%|██▎ | 3770/15076 [09:18<27:54, 6.75it/s, loss=1.34, lr=3e-5] Epoch 0, global step 2748: val_loss reached 1.19158 (best 1.19158), saving model to...


The following is an example spec for fine-tuning of the model:

Copy
Copied!
            

trainer: max_epochs: 1 # Name of the .tlt file where finetuned model will be saved. save_to: finetuned-model.tlt # Fine-tuning settings: training dataset. finetuning_ds: file: ??? # e.g. squad/v1.1/train-v1.1.json num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset # Fine-tuning settings: validation dataset. validation_ds: file: ??? # e.g. squad/v1.1/dev-v1.1.json num_samples: 500 # DEMO purposes # -1 # number of samples to be considered, -1 means all the dataset # Fine-tuning settings: different optimizer. optim: name: adamw lr: 5e-6

Parameter Data Type Default Description
trainer.max_epochs integer 2 The number of epochs to train
restore_from string trained-model.tlt The path to the pre-trained model
save_to string finetuned-model.tlt The path to save trained model to
finetuning_ds.file string The data file for fine tuning
finetuning_ds.num_samples integer 500 The number of samples to use from the fine-tuning dataset (use -1 to specify all samples)
validation_ds.file string The validation data file
validation_ds.num_samples integer 500 The number of samples to use from the validation dataset (use -1 to specify all samples)
optim.name string adam The optimizer to use for training
optim.lr float 1e-5 The learning rate to use for training

Use the following command to fine-tune the model:

Copy
Copied!
            

!tao question_answering finetune \ -e /specs/nlp/question_answering/finetune.yaml \ -g 1 \ data_dir=PATH_TO_DATA

Required Arguments for Fine-tuning

  • -e: The experiment specification file to set up fine-tuning

  • data_dir: The path to the data

Optional Arguments

  • -g: The number of GPUs to be use for evaluation in a multi-GPU scenario (default: 1)

Note

You can use other arguments to override fields in the specification file. To do so, use the name of the config parameter with a desired value and pass it as a parameter in the script call (e.g. trainer.val_check_interval=0.25).


Fine-tuning Procedure

Fine-tuning procedure and logs will look similar to described in the Model Training section, with the addition of the model that is initially loaded from a previously trained checkpoint.

The following is an example spec to evaluate the pre-trained model:

Copy
Copied!
            

# Test settings: dataset. test_ds: file: ??? # e.g. squad/v1.1/dev-v1.1.json batch_size: 32 shuffle: false num_samples: 500 # DEMO purposes -1 # number of samples to be considered, -1 means the whole the dataset

Parameter Data Type Default Description
test_ds.file string The evaluation data file
test_ds.batch_size integer 32 The training data batch size
test_ds.shuffle bool false A flag specifying whether to shuffle the training data
test_ds.num_samples integer 500 The number of samples to use from the training dataset (use -1 to specify all samples)

Use the following command to evaluate the model:

Copy
Copied!
            

!tao question_answering evaluate \ -e /specs/nlp/question_answering/evaluate.yaml \ data_dir=PATH_TO_DATA

Required Arguments for Evaluation

  • -e: The experiment specification file to set up evaluation

  • data_dir: The path to the pre-processed data to run evaluation on

Evaluation Procedure

After the previously trained model is initialized, it will run evaluation against the provided test set. With Extractive QA models, when the answer span is returned by the model, accuracy evaluation uses two metrics: The exact match (EM) and F1 score of the returned answer spans compared to the right answers. The overall EM and F1 scores are computed for a model by averaging the individual example scores.

  • Exact match: If the answer span is exactly equal to the correct one, it returns 1; otherwise, it returns 0. When assessing against a negative example (SQuAD 2.0), if the model predicts any text at all, it automatically receives a 0 for that example.

  • F1: The F1 score is a common metric for classification problems and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it is computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: Precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth. F1 = 2 * (precision * recall) / (precision + recall)

Copy
Copied!
            

Testing: 100%|███████████████████████████████| 383/383 [01:36<00:00, 3.77it/s][NeMo I 2021-01-29 10:26:40 qa_model:175] test exact match 50.11370336056599 [NeMo I 2021-01-29 10:26:40 qa_model:176] test f1 50.11370336056599 Testing: 100%|████████████████████████████████| 383/383 [02:07<00:00, 3.01it/s] -------------------------------------------------------------------------------- DATALOADER:0 TEST RESULTS {'test_exact_match': 50.11370336056599, 'test_f1': 50.11370336056599, 'test_loss': tensor(1.1229, device='cuda:0')}


Copy
Copied!
            

# Name of file containing data used as inputs during the inference. input_file: ??? # e.g. squad/v1.1/dev-v1.1.json # Name of output nbest list file to store predictions to output_nbest_file: nbest.txt # Name of output file to store predictions to output_prediction_file: prediction.txt

Parameter Data Type Default Description
input_file string The file containing the data used as inputs during the inference
output_nbest_file string nbest.txt The name of the output nbest list file to store predictions in
output_prediction_file string prediction.txt The name of the output file to store predictions in

The following example shows how to run inference:

Copy
Copied!
            

!tao question_answering infer \ -e /specs/nlp/question_answering/infer.yaml \ -m trained-model.tlt \

Required Arguments for Inference

  • -e: The experiment specification file to set up inference. This requires the input_batch with the list of examples to run inference on.

  • -m: The path to the pre-trained model checkpoint from which to infer. The value should be a .tlt file.

Inference Procedure

After the trained model is loaded, it will run on the input file, which is in the same format as the file used for training and evaluation. It will create a prediction.txt prediction output file with prediction spans for each question in the input file.

The following is an example of the spec file for model export:

Copy
Copied!
            

# Name of the .tlt EFF archive to be loaded/model to be exported. restore_from: trained-model.tlt # Set export format: ONNX | RIVA export_format: ONNX # Output EFF archive containing ONNX. export_to: exported-model.eonnx

Parameter Data Type Default Description
restore_from string trained-model.tlt The path to the pre-trained model
export_format string ONNX The export format (either “ONNX” or “RIVA”)
export_to string exported-model.eonnx The path to the exported model

To export a pre-trained model, run the following:

Copy
Copied!
            

### For export to ONNX !tao question_answering export \ -e /specs/nlp/question_answering/export.yaml \ -m finetuned-model.tlt \ -k $KEY

Required Arguments for Export

  • -e: The experiment specification file to set up inference. This requires the input_batch with a list of examples to run inference on.

  • -m: The path to the pre-trained model checkpoint from which to infer. The file should have a .tlt extension.

  • -k: The encryption key

You can use the Riva framework for the deployment of the trained model in the runtime. For more details, refer to the Riva documentation

© Copyright 2023, NVIDIA.. Last updated on Sep 5, 2023.