Checkpoint Conversion

NVIDIA provides a simple tool to convert the checkpoints from .ckpt format to .nemo format. This tool will be used at a later time for evaluation (in T5 models) and for inference.

You specify the configuration to be used for checkpoint conversion by setting the conversion configuration in conf/config.yaml to the pathname of the conversion configuration file to be used. For mT5 models the conversion configuration must be set to mt5/convert_mt5. This value is stored in conf/conversion/mt5/convert_mt5.yaml.

You must include the conversion in stages to run the conversion pipeline.

Common

To specify the input checkpoint to be used for conversion for mT5 models, set the model configuration in conf/conversion/mt5/convert_mt5.yaml:

model:
  model_type: t5 # gpt or t5, use t5 for mt5 as well
  checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
  checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
  hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
  model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
  vocab_file: null
  merge_file: null
  tokenizer_model: ${data_dir}/mc4/bpe/mt5_tokenizer.model

To specify the output location and file name for the converted .nemo file for mT5 models, set the run configuration in conf/conversion/convert_mt5.yaml:

run:
  name: convert_${conversion.run.model_train_name}
  nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
  time_limit: "2:00:00"
  ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
  convert_name: convert_nemo
  model_train_name: mt5_390m
  train_dir: ${base_results_dir}/${.model_train_name}
  results_dir: ${.train_dir}/${.convert_name}
  output_path: ${.train_dir}/${.convert_name}
  nemo_file_name: megatron_mt5.nemo # name of nemo checkpoint; must be .nemo file

Slurm

You define the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - conversion

Then enter:

python3 main.py

Base Command Platform

To run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.

To run the conversion pipeline to convert a mT5 390M checkpoint stored in /mount/results/mt5_390m/results/checkpoints, enter:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py conversion=convert_mt5 \
stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts
data_dir=/mount/data \
conversion.run.model_train_name=mt5_390m \
base_results_dir=/mount/results conversion.run.results_dir=/mount/results/mt5_390m/results/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/mt5_390m/checkpoints \
conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \
>> /results/convert_mt5_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.