Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Model Inferencing
In the context of model inferencing, the input for RETRO is set up differently than during training. Specifically, the model’s input consists of two chunks only: one for the prompt and one for the answer to be generated. Unlike training, these chunks don’t necessarily have a fixed length of 64 tokens; instead, they match the length of the tokenized prompt. Additionally, each prompt requires context neighbors, which correspond to the first chunk. These neighbors are then passed through RETRO’s encoder to generate text for the second chunk.
Base Model Inferencing
When inferencing, the first step is to set the value of the stages
variable in conf/config.yaml
to “fw_inference”.
Next, define the configuration used for inferencing by setting the fw_inference
variable in conf/config.yaml
to a specific inference config file path. For example, set the fw_inference
variable to retro/retro_inference
, which specifies the configuration file as conf/fw_inference/retro/retro_inference.yaml
.
The following sections describe the common and specific instructions for running evaluation on a Slurm cluster.
Common
Set the run
configurations in conf/fw_inference/retro/retro_inference.yaml
to define the job-specific configuration:
run:
name: ${.eval_name}_${.model_train_name}
time_limit: "4:00:00"
dependency: "singleton"
nodes: 1
ntasks_per_node: 1
eval_name: retro_inference
model_train_name: retro_300m
results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}
Then, set the path to the model’s checkpoint and set the prompt for generating. Currently, the inference pipeline only supports a batch size of 1, and only supports distributed checkpoints created in the training step.
checkpoint_dir: /path/to/checkpoint_dir
checkpoint_name: /checkpoint_name
prompt: "sample prompt"
neighbors:
- "sample neighbor 1"
- "sample neighbor 2"
Slurm
Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"
To run only the inference pipeline and exclude the data preparation,
training, conversion, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- fw_inference
Then, run the following command. All the configurations will be read from conf/config.yaml
and conf/fw_inference/retro/retro_inference.yaml
:
python3 main.py