Checkpoint Conversion
NVIDIA provides a simple tool to convert the checkpoints from .ckpt
format to .nemo
format. This tool will be used at a later time for evaluation (in T5 models) and for inference.
You specify the configuration to be used for checkpoint conversion by setting the conversion
configuration in conf/config.yaml
to the pathname of the conversion configuration file to be used. For mT5 models the conversion configuration must be set to mt5/convert_mt5
.
This value is stored in conf/conversion/mt5/convert_mt5.yaml
.
You must include the conversion
in stages
to run the conversion
pipeline.
Common
To specify the input checkpoint to be used for conversion for mT5
models, set the model
configuration in
conf/conversion/mt5/convert_mt5.yaml
:
model:
model_type: t5 # gpt or t5, use t5 for mt5 as well
checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
vocab_file: null
merge_file: null
tokenizer_model: ${data_dir}/mc4/bpe/mt5_tokenizer.model
To specify the output location and file name for the converted .nemo
file for mT5 models, set the run
configuration in
conf/conversion/convert_mt5.yaml
:
run:
name: convert_${conversion.run.model_train_name}
nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
time_limit: "2:00:00"
ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
convert_name: convert_nemo
model_train_name: mt5_390m
train_dir: ${base_results_dir}/${.model_train_name}
results_dir: ${.train_dir}/${.convert_name}
output_path: ${.train_dir}/${.convert_name}
nemo_file_name: megatron_mt5.nemo # name of nemo checkpoint; must be .nemo file
Slurm
You define the configuration for a Slurm cluster in conf/cluster/bcm.yaml
:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Example
To run only the conversion pipeline and not the data preparation,
training, evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- conversion
Then enter:
python3 main.py
Base Command Platform
To run the conversion script on Base Command Platform, set the
cluster_type
configuration in conf/config.yaml
to bcp
. You can
also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.
To run the conversion pipeline to convert a mT5 390M checkpoint stored
in /mount/results/mt5_390m/results/checkpoints
, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py conversion=convert_mt5 \
stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts
data_dir=/mount/data \
conversion.run.model_train_name=mt5_390m \
base_results_dir=/mount/results conversion.run.results_dir=/mount/results/mt5_390m/results/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/mt5_390m/checkpoints \
conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \
>> /results/convert_mt5_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data
,
and the results workspace in /mount/results
. stdout
and stderr
are redirected to the file /results/data_gpt3_log.txt
,
which you can download from NGC. You may add any other configuration required to modify the command’s behavior.