Model Inferencing - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide (Latest) Model Inferencing

User Guide (Latest Version)

In the context of model inferencing, the input for RETRO is set up differently than during training. Specifically, the model’s input consists of two chunks only: one for the prompt and one for the answer to be generated. Unlike training, these chunks don’t necessarily have a fixed length of 64 tokens; instead, they match the length of the tokenized prompt. Additionally, each prompt requires context neighbors, which correspond to the first chunk. These neighbors are then passed through RETRO’s encoder to generate text for the second chunk.

Base Model Inferencing

When inferencing, the first step is to set the value of the stages variable in conf/config.yaml to “fw_inference”. Next, define the configuration used for inferencing by setting the fw_inference variable in conf/config.yaml to a specific inference config file path. For example, set the fw_inference variable to retro/retro_inference, which specifies the configuration file as conf/fw_inference/retro/retro_inference.yaml.

The following sections describe the common and specific instructions for running evaluation on a Slurm cluster.

Common

Set the run configurations in conf/fw_inference/retro/retro_inference.yaml to define the job-specific configuration:

Copy
Copied!

            
            run:
   name: ${.eval_name}_${.model_train_name}
   time_limit: "4:00:00"
   dependency: "singleton"
   nodes: 1
   ntasks_per_node: 1
   eval_name: retro_inference
   model_train_name: retro_300m
   results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

Then, set the path to the model’s checkpoint and set the prompt for generating. Currently, the inference pipeline only supports a batch size of 1, and only supports distributed checkpoints created in the training step.

Copy
Copied!

            
            checkpoint_dir: /path/to/checkpoint_dir
checkpoint_name: /checkpoint_name
prompt: "sampleprompt"
neighbors:
- "sampleneighbor1"
- "sampleneighbor2"

Slurm

Set the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!

            
            partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
job_name_prefix: 'nemo-megatron-'
srun_args:
- "--no-container-mount-home"

To run only the inference pipeline and exclude the data preparation, training, conversion, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - fw_inference

Then, run the following command. All the configurations will be read from conf/config.yaml and conf/fw_inference/retro/retro_inference.yaml:

Copy
Copied!

            
            python3 main.py

Previous Training with Predefined Configurations

Next Model Evaluation