Checkpoint Conversion

NVIDIA provides a simple tool to convert the checkpoints from .ckpt format to .nemo format. This tool will be used at a later time for evaluation (in T5 models) and for inference.

The configuration you use for checkpoint conversion must be specified by setting the conversion configuration in conf/config.yaml to the pathname of the conversion configuration file to be used. Its default value is gpt3/convert_gpt3, which is stored in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.

In conf/config.yaml, set the conversion configuration to the path of the config file to be used for conversion. The default value is gpt3/convert_gpt3, which is stored in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.

The conversion value must be included in the stages configuration to run the conversion pipeline.

Common

To specify the input checkpoint to be used for conversion for GPT models, use the model configuration in conf/conversion/convert_gpt3.yaml:

Copy
Copied!
            

model: model_type: gpt # gpt or t5 checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt) hparams_file: ${conversion.run.train_dir}/results/hparams.yaml tensor_model_parallel_size: 2 # 1 for 126m, 2 for 5b, and 8 for 20b or larger models pipeline_model_parallel_size: 1 model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} vocab_file: ${data_dir}/bpe/vocab.json merge_file: ${data_dir}/bpe/merges.txt

To specify the output location and filename of the converted .nemo file for GPT models, use the run configuration in conf/conversion/gpt3/convert_gpt3.yaml:

Copy
Copied!
            

run: name: convert_${conversion.run.model_train_name} nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node time_limit: "2:00:00" ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}} convert_name: convert_nemo model_train_name: gpt3_5b train_dir: ${base_results_dir}/${.model_train_name} results_dir: ${.train_dir}/${.convert_name} output_path: ${.train_dir}/${.convert_name} nemo_file_name: megatron_gpt.nemo # name of nemo checkpoint; must be .nemo file

Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - conversion

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.

To convert a 126M checkpoint stored in /mount/results/gpt3_126m/results/checkpoints, use the command:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<conversion> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \ data_dir=/mount/data/the_pile_gpt3 \ base_results_dir=/mount/results \ conversion.run.model_train_name=gpt3_126m \ conversion.model.vocab_file=/mount/data/bpe/vocab.json \ conversion.model.merge_file=/mount/data/bpe/merges.txt \ conversion.run.results_dir=/mount/results/gpt3_126m/convert_nemo \ conversion.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints conversion.model.tensor_model_parallel_size=1 \ >> /results/convert_gpt3_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.

Kubernetes

Set configuration for a Slurm cluster in the conf/cluster/k8s.yaml file:

Copy
Copied!
            

pull_secret: null # Kubernetes secret for the container registry to pull private containers. shm_size: 512Gi # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes. nfs_server: null # Hostname or IP address for the NFS server where data is stored. nfs_path: null # Path to store data in the NFS server. ib_resource_name: "nvidia.com/hostdev" # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters. ib_count: "8" # Specify the number of IB devices to include per node in each pod.

Example

Set the cluster and cluster_type settings to k8s in conf/config.yaml.

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - conversion

Then enter:

Copy
Copied!
            

python3 main.py

This will launch a Helm chart based on the conversion configurations which will spawn a pod to convert the specified model to the .nemo format. The pod can be viewed with kubectl get pods and the log can be read with kubectl logs <pod name>.

In conf/config.yaml, the conversion configuration should be set to the pathname of the config file to be used for conversion. For T5 models this should be set to t5/convert_t5 which corresponds to conf/conversion/t5/convert_t5.yaml.

The conversion value must be included in stages to run the conversion pipeline.

Common

To specify the input checkpoint to be used for conversion for T5 models, use the model configuration in conf/conversion/t5/convert_t5.yaml:

Copy
Copied!
            

model: model_type: t5 # gpt or t5 checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt) hparams_file: ${conversion.run.train_dir}/results/hparams.yaml tensor_model_parallel_size: 1 # 1 for 220m, 2 for 3b pipeline_model_parallel_size: 1 model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} vocab_file: ${data_dir}/bpe/vocab.txt merge_file: null

To specify the output location and file name of the converted .nemo file for T5 models, use the run configuration in conf/conversion/t5/convert_t5.yaml:

Copy
Copied!
            

run: name: convert_${conversion.run.model_train_name} nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node time_limit: "2:00:00" ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}} convert_name: convert_nemo model_train_name: t5_220m train_dir: ${base_results_dir}/${.model_train_name} results_dir: ${.train_dir}/${.convert_name} output_path: ${.train_dir}/${.convert_name} nemo_file_name: megatron_t5.nemo # name of nemo checkpoint; must be .nemo file

Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation or inference pipelines set the conf/config.yaml file to:

Copy
Copied!
            

stages: - conversion

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

In order to run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The conversion script must be launched in a multi-node job.

To run the conversion pipeline to convert a T5 220M checkpoint stored in /mount/results/t5_220m/results/checkpoints, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py conversion=convert_t5 \ stages=<conversion> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \ base_results_dir=/mount/results conversion.model.vocab_file=/mount/data/bpe/vocab.txt \ conversion.run.model_train_name=t5_220m conversion.run.results_dir=/mount/results/t5_220m/results/convert_nemo \ conversion.model.checkpoint_folder=/mount/results/t5_220m/checkpoints \ conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \ >> /results/convert_t5_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.

You specify the configuration to be used for checkpoint conversion by setting the conversion configuration in conf/config.yaml to the pathname of the conversion configuration file to be used. For mT5 models the conversion configuration must be set to mt5/convert_mt5. This value is stored in conf/conversion/mt5/convert_mt5.yaml.

You must include the conversion in stages to run the conversion pipeline.

Common

To specify the input checkpoint to be used for conversion for mT5 models, set the model configuration in conf/conversion/mt5/convert_mt5.yaml:

Copy
Copied!
            

model: model_type: t5 # gpt or t5, use t5 for mt5 as well checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt) hparams_file: ${conversion.run.train_dir}/results/hparams.yaml tensor_model_parallel_size: 1 pipeline_model_parallel_size: 1 model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}} vocab_file: null merge_file: null tokenizer_model: ${data_dir}/mc4/bpe/mt5_tokenizer.model

To specify the output location and file name for the converted .nemo file for mT5 models, set the run configuration in conf/conversion/convert_mt5.yaml:

Copy
Copied!
            

run: name: convert_${conversion.run.model_train_name} nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node time_limit: "2:00:00" ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}} convert_name: convert_nemo model_train_name: mt5_390m train_dir: ${base_results_dir}/${.model_train_name} results_dir: ${.train_dir}/${.convert_name} output_path: ${.train_dir}/${.convert_name} nemo_file_name: megatron_mt5.nemo # name of nemo checkpoint; must be .nemo file

Slurm

You define the configuration for a Slurm cluster in conf/cluster/bcm.yaml:

Copy
Copied!
            

partition: null account: null exclusive: True gpus_per_task: null gpus_per_node: 8 mem: 0 overcommit: False job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!
            

stages: - conversion

Then enter:

Copy
Copied!
            

python3 main.py

Base Command Platform

To run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.

To run the conversion pipeline to convert a mT5 390M checkpoint stored in /mount/results/mt5_390m/results/checkpoints, enter:

Copy
Copied!
            

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py conversion=convert_mt5 \ stages=<conversion> \ cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \ conversion.run.model_train_name=mt5_390m \ base_results_dir=/mount/results conversion.run.results_dir=/mount/results/mt5_390m/results/convert_nemo \ conversion.model.checkpoint_folder=/mount/results/mt5_390m/checkpoints \ conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \ >> /results/convert_mt5_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.

© Copyright 2023, NVIDIA. Last updated on Sep 13, 2023.