Checkpoint Conversion

In conf/config.yaml, the conversion configuration should be set to the pathname of the config file to be used for conversion. For T5 models this should be set to t5/convert_t5 which corresponds to conf/conversion/t5/convert_t5.yaml.

The conversion value must be included in stages to run the conversion pipeline.

To specify the input checkpoint to be used for conversion for T5 models, use the model configuration in conf/conversion/t5/convert_t5.yaml:

Copy
Copied!
            

model: model_type: t5 # gpt or t5 checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt) hparams_file: ${conversion.run.train_dir}/results/hparams.yaml tensor_model_parallel_size: 1 # 1 for 220m, 2 for 3b pipeline_model_parallel_size: 1 model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} vocab_file: ${data_dir}/bpe/vocab.txt merge_file: null

To specify the output location and file name of the converted .nemo file for T5 models, use the run configuration in conf/conversion/t5/convert_t5.yaml:

Copy
Copied!
            

run: name: convert_${conversion.run.model_train_name} nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node time_limit: "2:00:00" ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}} convert_name: convert_nemo model_train_name: t5_220m train_dir: ${base_results_dir}/${.model_train_name} results_dir: ${.train_dir}/${.convert_name} output_path: ${.train_dir}/${.convert_name} nemo_file_name: megatron_t5.nemo # name of nemo checkpoint; must be .nemo file

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation or inference pipelines set the conf/config.yaml file to:

Copy
Copied!
            

stages: - conversion

Then enter:

Copy
Copied!
            

python3 main.py

In order to run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The conversion script must be launched in a multi-node job.

To run the conversion pipeline to convert a T5 220M checkpoint stored in /mount/results/t5_220m/results/checkpoints, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py conversion=convert_t5 \ stages=<conversion> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results conversion.model.vocab_file=/mount/data/bpe/vocab.txt \ conversion.run.model_train_name=t5_220m conversion.run.results_dir=/mount/results/t5_220m/results/convert_nemo \ conversion.model.checkpoint_folder=/mount/results/t5_220m/checkpoints \ conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \ >> /results/convert_t5_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.

Previous Changing Embedding Type
Next Model Evaluation
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.