Model Inferencing

In the context of model inferencing, the input for RETRO is set up differently than during training. Specifically, the model’s input consists of two chunks only: one for the prompt and one for the answer to be generated. Unlike training, these chunks don’t necessarily have a fixed length of 64 tokens; instead, they match the length of the tokenized prompt. Additionally, each prompt requires context neighbors, which correspond to the first chunk. These neighbors are then passed through RETRO’s encoder to generate text for the second chunk.

Base Model Inferencing

When inferencing, the first step is to set the value of the stages variable in conf/config.yaml to “fw_inference”. Next, define the configuration used for inferencing by setting the fw_inference variable in conf/config.yaml to a specific inference config file path. For example, set the fw_inference variable to retro/retro_inference, which specifies the configuration file as conf/fw_inference/retro/retro_inference.yaml.

The following sections describe the common and specific instructions for running evaluation on a Slurm cluster.

Common

Set the run configurations in conf/fw_inference/retro/retro_inference.yaml to define the job-specific configuration:

run:
   name: ${.eval_name}_${.model_train_name}
   time_limit: "4:00:00"
   dependency: "singleton"
   nodes: 1
   ntasks_per_node: 1
   eval_name: retro_inference
   model_train_name: retro_300m
   results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

Then, set the path to the model’s checkpoint and set the prompt for generating. Currently, the inference pipeline only supports a batch size of 1, and only supports distributed checkpoints created in the training step.

checkpoint_dir: /path/to/checkpoint_dir
checkpoint_name: /checkpoint_name
prompt: "sample prompt"
neighbors:
- "sample neighbor 1"
- "sample neighbor 2"

Slurm

Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"

To run only the inference pipeline and exclude the data preparation, training, conversion, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - fw_inference

Then, run the following command. All the configurations will be read from conf/config.yaml and conf/fw_inference/retro/retro_inference.yaml:

python3 main.py