Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Checkpoint Conversion#
NVIDIA provides a simple tool to convert the checkpoints from .ckpt
format to .nemo format. This tool will be used at a later time for evaluation (in T5 models) and for inference.
You specify the configuration to be used for checkpoint conversion by setting the conversion configuration in conf/config.yaml
to the pathname of the conversion configuration file to be used. For mT5 models the conversion configuration must be set to mt5/convert_mt5.
This value is stored in conf/conversion/mt5/convert_mt5.yaml.
You must include the conversion in stages to run the conversion
pipeline.
Common#
To specify the input checkpoint to be used for conversion for mT5
models, set the model configuration in
conf/conversion/mt5/convert_mt5.yaml:
model:
model_type: t5 # gpt or t5, use t5 for mt5 as well
checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
vocab_file: null
merge_file: null
tokenizer_model: ${data_dir}/mc4/bpe/mt5_tokenizer.model
To specify the output location and file name for the converted .nemo
file for mT5 models, set the run configuration in
conf/conversion/convert_mt5.yaml:
run:
name: convert_${conversion.run.model_train_name}
nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
time_limit: "2:00:00"
ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
convert_name: convert_nemo
model_train_name: mt5_390m
train_dir: ${base_results_dir}/${.model_train_name}
results_dir: ${.train_dir}/${.convert_name}
output_path: ${.train_dir}/${.convert_name}
nemo_file_name: megatron_mt5.nemo # name of nemo checkpoint; must be .nemo file
Slurm#
You define the configuration for a Slurm cluster in conf/cluster/bcm.yaml:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Example
To run only the conversion pipeline and not the data preparation,
training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:
stages:
- conversion
Then enter:
python3 main.py
Base Command Platform#
To run the conversion script on Base Command Platform, set the
cluster_type configuration in conf/config.yaml to bcp. You can
also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.
To run the conversion pipeline to convert a mT5 390M checkpoint stored
in /mount/results/mt5_390m/results/checkpoints, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py conversion=convert_mt5 \
stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts
data_dir=/mount/data \
conversion.run.model_train_name=mt5_390m \
base_results_dir=/mount/results conversion.run.results_dir=/mount/results/mt5_390m/results/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/mt5_390m/checkpoints \
conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \
>> /results/convert_mt5_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data,
and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt,
which you can download from NGC. You may add any other configuration required to modify the command’s behavior.