Checkpoint Conversion

NVIDIA provides a simple tool to convert the checkpoints from .ckpt format to .nemo format. This tool will be used at a later time for evaluation (in T5 models) and for inference.

You specify the configuration to be used for checkpoint conversion by setting the conversion configuration in conf/config.yaml to the pathname of the conversion configuration file to be used. For mT5 models the conversion configuration must be set to mt5/convert_mt5. This value is stored in conf/conversion/mt5/convert_mt5.yaml.

You must include the conversion in stages to run the conversion pipeline.

To specify the input checkpoint to be used for conversion for mT5 models, set the model configuration in conf/conversion/mt5/convert_mt5.yaml:

Copy
Copied!
            

model: model_type: t5 # gpt or t5, use t5 for mt5 as well checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt) hparams_file: ${conversion.run.train_dir}/results/hparams.yaml tensor_model_parallel_size: 1 pipeline_model_parallel_size: 1 model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} vocab_file: null merge_file: null tokenizer_model: ${data_dir}/mc4/bpe/mt5_tokenizer.model

To specify the output location and file name for the converted .nemo file for mT5 models, set the run configuration in conf/conversion/convert_mt5.yaml:

Copy
Copied!
            

run: name: convert_${conversion.run.model_train_name} nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node time_limit: "2:00:00" ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}} convert_name: convert_nemo model_train_name: mt5_390m train_dir: ${base_results_dir}/${.model_train_name} results_dir: ${.train_dir}/${.convert_name} output_path: ${.train_dir}/${.convert_name} nemo_file_name: megatron_mt5.nemo # name of nemo checkpoint; must be .nemo file

You define the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - conversion

Then enter:

Copy
Copied!
            

python3 main.py

To run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.

To run the conversion pipeline to convert a mT5 390M checkpoint stored in /mount/results/mt5_390m/results/checkpoints, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py conversion=convert_mt5 \ stages=<conversion> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \ conversion.run.model_train_name=mt5_390m \ base_results_dir=/mount/results conversion.run.results_dir=/mount/results/mt5_390m/results/convert_nemo \ conversion.model.checkpoint_folder=/mount/results/mt5_390m/checkpoints \ conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \ >> /results/convert_mt5_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.

Previous Training with Predefined Configurations
Next Model Evaluation
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.