Model Evaluation
In the context of model evaluation and inferencing, the input for RETRO is set up differently than during training. Specifically, the model’s input consists of two chunks only: one for the prompt and one for the answer to be generated. Unlike training, these chunks don’t necessarily have a fixed length of 64 tokens; instead, they match the length of the tokenized prompt. Additionally, each prompt requires context neighbors, which correspond to the first chunk. These neighbors are then passed through RETRO’s encoder to generate text for the second chunk.
For this zero-shot evaluation setup, RETRO is tested on the Natural Question (NQ) and TriviaQA (TQA) test sets. Each test sample includes a question, context neighbors for the question, and ground-truth answers. Similar to the inference process, the question is set in the first chunk, the retrieved neighbors correspond to the first chunk, and they go through the encoder to generate the answer in the second chunk.
The test set for evaluation is a .json file which contains samples that use the following format:
{
"question": "who got the first nobel prize in physics",
"answers": [
"Wilhelm Conrad Rontgen"
],
"ctxs": [
{
"id": "628713",
"title": "Nobel Prize in Physics",
"text": "Nobel Prize in Physics The Nobel Prize in Physics () is a yearly award given by the Royal Swedish Academy of Sciences for those who have made the most outstanding contributions for ma> },
{
"id": "284495",
"title": "Nobel Prize",
"text": "His son, George Paget Thomson, received the same prize in 1937 for showing that they also have the properties of waves. William Henry Bragg and his son, William Lawrence Bragg, shared> },
...
]
}
Base Model Evaluation
When inferencing, the first step is to set the value of the stages
variable in conf/config.yaml
to “evaluation”.
Next, define the configuration used for evaluation by setting the evaluation
variable in conf/config.yaml
to a specific evaluation config file path. For example, set the evaluation
variable to retro/evaluate_nq
, which specifies the configuration file as conf/evaluation/retro/evaluate_nq.yaml
.
The following sections describe the common and specific instructions for running evaluation on a Slurm cluster.
Common
Set the run
configurations in conf/evaluation/retro/evaluate_nq.yaml
to define the job-specific configuration:
run:
name: ${.eval_name}_${.model_train_name}
time_limit: "4:00:00"
dependency: "singleton"
nodes: 1
ntasks_per_node: 1
eval_name: eval_nq # nq: Natural Question; tqa: TriviaQA
model_train_name: retro_300m
results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}
Then, set the path to the model’s checkpoint and test set. At the moment, this script only supports distributed checkpoints created in the training step.
checkpoint_dir: /path/to/checkpoint_dir
checkpoint_name: /checkpoint_name
qa_file_path: /path/to/qa_test.json
pred_file_path: /path/to/qa_prediction.txt
Slurm
Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"
To run only the evaluation pipeline and exclude the data preparation,
training, conversion, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- evaluation
Then, run the following command. All the configurations will be read from conf/config.yaml
and conf/evaluation/retro/evaluate_nq.yaml
:
python3 main.py