Model Evaluation - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide (Latest) Model Evaluation

User Guide (Latest Version)

In the context of model evaluation and inferencing, the input for RETRO is set up differently than during training. Specifically, the model’s input consists of two chunks only: one for the prompt and one for the answer to be generated. Unlike training, these chunks don’t necessarily have a fixed length of 64 tokens; instead, they match the length of the tokenized prompt. Additionally, each prompt requires context neighbors, which correspond to the first chunk. These neighbors are then passed through RETRO’s encoder to generate text for the second chunk.

For this zero-shot evaluation setup, RETRO is tested on the Natural Question (NQ) and TriviaQA (TQA) test sets. Each test sample includes a question, context neighbors for the question, and ground-truth answers. Similar to the inference process, the question is set in the first chunk, the retrieved neighbors correspond to the first chunk, and they go through the encoder to generate the answer in the second chunk.

The test set for evaluation is a .json file which contains samples that use the following format:

Copy
Copied!

            
            {
      "question": "who got the first nobel prize in physics",
      "answers": [
            "Wilhelm Conrad Rontgen"
      ],
      "ctxs": [
            {
               "id": "628713",
               "title": "Nobel Prize in Physics",
               "text": "Nobel Prize in Physics The Nobel Prize in Physics () is a yearly award given by the Royal Swedish Academy of Sciences for those who have made the most outstanding contributions for ma>            },
            {
               "id": "284495",
               "title": "Nobel Prize",
               "text": "His son, George Paget Thomson, received the same prize in 1937 for showing that they also have the properties of waves. William Henry Bragg and his son, William Lawrence Bragg, shared>            },
            ...
      ]
}

Base Model Evaluation

When inferencing, the first step is to set the value of the stages variable in conf/config.yaml to “evaluation”. Next, define the configuration used for evaluation by setting the evaluation variable in conf/config.yaml to a specific evaluation config file path. For example, set the evaluation variable to retro/evaluate_nq, which specifies the configuration file as conf/evaluation/retro/evaluate_nq.yaml.

The following sections describe the common and specific instructions for running evaluation on a Slurm cluster.

Common

Set the run configurations in conf/evaluation/retro/evaluate_nq.yaml to define the job-specific configuration:

Copy
Copied!

            
            run:
   name: ${.eval_name}_${.model_train_name}
   time_limit: "4:00:00"
   dependency: "singleton"
   nodes: 1
   ntasks_per_node: 1
   eval_name: eval_nq # nq: Natural Question; tqa: TriviaQA
   model_train_name: retro_300m
   results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

Then, set the path to the model’s checkpoint and test set. At the moment, this script only supports distributed checkpoints created in the training step.

Copy
Copied!

            
            checkpoint_dir: /path/to/checkpoint_dir
checkpoint_name: /checkpoint_name
qa_file_path: /path/to/qa_test.json
pred_file_path: /path/to/qa_prediction.txt

Slurm

Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!

            
            partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"

To run only the evaluation pipeline and exclude the data preparation, training, conversion, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - evaluation

Then, run the following command. All the configurations will be read from conf/config.yaml and conf/evaluation/retro/evaluate_nq.yaml:

Copy
Copied!

            
            python3 main.py

Previous Model Inferencing

Next Embedding Models