Question Answering Model

With the Question Answering, or Reading Comprehension, task, given a question and a passage of content (context) that may contain an answer for the question, the model will predict the span within the text with a start and end position indicating the answer to the question. For datasets like SQuAD 2.0, this model supports cases when the answer is not contained in the content.

For every word in the context of a given question, the model will be trained to predict:

  • The likelihood this word is the start of the span

  • The likelihood this word is the end of the span

The model chooses the start and end words with maximal probabilities. When the content does not contain the answer, we would like the start and end span to be set for the first token.

A pretrained BERT encoder with two span prediction heads is used for the prediction start and the end position of the answer. The span predictions are token classifiers consisting of a single linear layer.

Quick Start

from nemo.collections.nlp.models import QAModel

# to get the list of pre-trained models
QAModel.list_available_models()

# Download and load the pre-trained BERT-based model
model = QAModel.from_pretrained("qa_squadv1.1_bertbase")

# try the model on a few examples
model.inference(test_file)

Note

We recommend you try this model in a Jupyter notebook (can run on Google’s Colab.): NeMo/tutorials/nlp/Question_Answering_Squad.ipynb.

Connect to an instance with a GPU (Runtime -> Change runtime type -> select “GPU” for hardware accelerator)

An example script on how to train and evaluate the model could be found here: NeMo/examples/nlp/question_answering/question_answering_squad.py.

The default configuration file for the model could be found at: NeMo/examples/nlp/question_answering/conf/question_answering_squad.yaml.

Data Format

This model expects the dataset in SQuAD format (i.e., a JSON file for each dataset split). The code snippet below shows an example of the training file. Each title has one or multiple paragraph entries, each consisting of the “context” and question-answer entries. Each question-answer entry has:

  • A question

  • A globally unique id

  • The Boolean flag “is_impossible”, which shows whether a question is answerable or not

  • (if the question is answerable) One answer entry containing the text span and its starting character index in the context.

  • (if the question is not answerable) An empty “answers” list

The evaluation files (for validation and testing) follow the above format, except that it can provide more than one answer to the same question. The inference file also follows the above format, except that it does not require the “answers” and “is_impossible” keywords.

The following is an example of the data format (JSON file):

{
    "data": [
        {
            "title": "Super_Bowl_50",
            "paragraphs": [
                {
                    "context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
                    "qas": [
                        {
                            "question": "Where did Super Bowl 50 take place?",
                            "is_impossible": "false",
                            "id": "56be4db0acb8001400a502ee",
                            "answers": [
                                {
                                    "answer_start": "403",
                                    "text": "Santa Clara, California"
                                }
                            ]
                        },
                        {
                            "question": "What was the winning score of the Super Bowl 50?",
                            "is_impossible": "true",
                            "id": "56be4db0acb8001400a502ez",
                            "answers": [
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

Dataset Download

To perform training of the QA model on the SQuAD dataset, you must first download it from here or run:

python get_squad.py

There are two versions: SQuAD version 1.1, which does not contain questions without the answer and has 100,000+ question-answer pairs on 500+ articles–or the newer SQuAD version 2.0, which combines the 100,000 questions from SQuAD 1.1 with over 50,000 unanswerable questions. To do well with SQuAD2.0, a system must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

After downloading the files, you should have a squad data folder that contains the following four files for training and evaluation:

.
|--squad
     |-- v1.1/train-v1.1.json
     |-- v1.1/dev-v1.1.json
     |-- v2.0/train-v2.0.json
     |-- v2.0/dev-v2.0.json

Model Training

In the Question Answering Model, we are training a span prediction head on top of a pre-trained language model, such as BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [NLP-QA1]. Unless the user provides a pre-trained checkpoint for the language model, the language model is initialized with the pre-trained model from HuggingFace Transformers.

Example of model configuration file for training the model could be found at: NeMo/examples/nlp/question_answering/conf/question_answering_squad_config.yaml.

The specification can be roughly grouped into three categories:

  • Parameters that describe the training process: trainer

  • Parameters that describe the datasets: model.dataset, model.train_ds, model.validation_ds, model.test_ds

  • Parameters that describe the model: model

More details about parameters in the spec file could be found below:

Parameter

Data Type

Description

pretrained_model

string

Pretrained QAModel model from list_available_models() or path to a .nemo file

do_training

bool

If true kicks off training otherwise skips training and continues with evaluation/inference

model.dataset.version_2_with_negative

bool

Set to true to allow examples without an answer, e.g. for SQuADv2.0

model.dataset.do_lower_case

bool

If true converts text to lower case, only import for inference/evaluation

model.dataset.use_cache

bool

If true either loads all preprocessed data from cache or saves preprocessed data for future use

training_ds.file

string

The training file path

training_ds.num_samples

integer

The number of samples to use from the training dataset (use -1 to specify all samples)

validation_ds.file

string

The validation file path

validation_ds.num_samples

integer

The number of samples to use from the validation dataset (use -1 to specify all samples)

test_ds.file

string

The test file path (optional)

test_ds.num_samples

integer

The number of samples to use from the test dataset (use -1 to specify all samples)

Example of the command for training the model:

python question_answering_squad.py \
       model.train_ds.file=<PATH_TO_TRAIN_DATA_FILE>  \
       model.validation_ds.file=<PATH_TO_VALIDATION_DATA_FILE>  \
       model.dataset.version_2_with_negative=<ALLOW_UNANSWERABLE_SAMPLES>  \
       model.dataset.do_lower_case=<DO_LOWER_CASE> \
       trainer.max_epochs=<NUM_EPOCHS> \
       trainer.gpus=[<CHANGE_TO_GPU(s)_YOU_WANT_TO_USE>]

Note

The first time you are performing training, it will take an extra 5-10 minutes to process the dataset for training. For future training runs, it will use the processed dataset if model.dataset.use_cache=true, which is automatically cached in the files in the same directory as the data.

Required Arguments for Training

  • model.train_ds.file: Path to the training file in JSON format.

  • model.validation_ds.file: Path to the validation file in JSON format.

Fine-tuning Procedure

Fine-tuning procedure and logs will look similar to described in the Model Training section, with the addition of the model that is initially loaded from a previously trained checkpoint, e.g. by specifying pretrained_model=<PRETRAINED_MODEL_NAME>.

Inference

An example script on how to run inference on a few examples, could be found at examples/nlp/question_answering/question_answering_squad.py.

To run inference with the pre-trained model on a few examples, run:

python question_answering_squad.py \
       pretrained_model=<PRETRAINED_MODEL> \
       model.dataset.version_2_with_negative=<ALLOW_UNANSWERABLE_SAMPLES>  \
       model.dataset.do_lower_case=<DO_LOWER_CASE>  \
       do_training=false \
       model.validation_ds.file=<PATH_TO_INFERENCE_DATA_FILE>

Required Arguments for inference:

  • pretrained_model: pretrained QAModel model from list_available_models() or path to a .nemo file

Model Evaluation

An example script on how to evaluate the pre-trained model, could be found at examples/nlp/question_answering/question_answering_squad.py.

To run evaluation of the pre-trained model, run:

python question_answering_squad.py \
       pretrained_model=<PRETRAINED_MODEL> \
       model.dataset.version_2_with_negative=<ALLOW_UNANSWERABLE_SAMPLES>  \
       model.dataset.do_lower_case=<DO_LOWER_CASE>  \
       do_training=false \
       model.test_ds.file=<PATH_TO_TEST_DATA_FILE>

Required Arguments:

  • pretrained_model: pretrained QAModel model from list_available_models() or path to a .nemo file

  • model.test_ds.file: Path to test file.

During evaluation of the test_ds, the script generates the following metrics:

  • Exact Match (EM)

  • F1

More details about these metrics could be found here.

References

NLP-QA1

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.