Checkpoint Conversion

NVIDIA provides a simple tool to convert the checkpoints from .ckpt format to .nemo format. This tool will be used at a later time for evaluation and for inference.

The configuration you use for checkpoint conversion must be specified by setting the conversion configuration in conf/config.yaml to the pathname of the conversion configuration file to be used. Its default value is gpt3/convert_gpt3, which is stored in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.

In conf/config.yaml, set the conversion configuration to the path of the config file to be used for conversion. The default value is gpt3/convert_gpt3, which is stored in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.

The conversion value must be included in the stages configuration to run the conversion pipeline.

Common

To specify the input checkpoint to be used for conversion for GPT models, use the model configuration in conf/conversion/convert_gpt3.yaml:

Copy
Copied!

            
            model:
    model_type: gpt # gpt or t5
    checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
    checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
    hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
    tensor_model_parallel_size: 2 # 1 for 126m, 2 for 5b, and 8 for 20b or larger models
    pipeline_model_parallel_size: 1
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    vocab_file: ${data_dir}/bpe/vocab.json
    merge_file: ${data_dir}/bpe/merges.txt

To specify the output location and filename of the converted .nemo file for GPT models, use the run configuration in conf/conversion/gpt3/convert_gpt3.yaml:

Copy
Copied!

            
            run:
    name: convert_${conversion.run.model_train_name}
    nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
    time_limit: "2:00:00"
    ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
    convert_name: convert_nemo
    model_train_name: gpt3_5b
    train_dir: ${base_results_dir}/${.model_train_name}
    results_dir: ${.train_dir}/${.convert_name}
    output_path: ${.train_dir}/${.convert_name}
    nemo_file_name: megatron_gpt.nemo # name of nemo checkpoint; must be .nemo file

Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

Copy
Copied!

            
            partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - conversion

Then enter:

Copy
Copied!

            
            python3 main.py

Base Command Platform

To run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.

To convert a 126M checkpoint stored in /mount/results/gpt3_126m/results/checkpoints, use the command:

Copy
Copied!

            
            python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts \
data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results \
conversion.run.model_train_name=gpt3_126m \
conversion.model.vocab_file=/mount/data/bpe/vocab.json \
conversion.model.merge_file=/mount/data/bpe/merges.txt \
conversion.run.results_dir=/mount/results/gpt3_126m/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints conversion.model.tensor_model_parallel_size=1 \
>> /results/convert_gpt3_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.

Kubernetes

Set configuration for a Slurm cluster in the conf/cluster/k8s.yaml file:

Copy
Copied!

            
            pull_secret: null  # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null  # Hostname or IP address for the NFS server where data is stored.
nfs_path: null  # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev"  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8"  # Specify the number of IB devices to include per node in each pod.

Example

Set the cluster and cluster_type settings to k8s in conf/config.yaml.

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

Copy
Copied!

            
            stages:
  - conversion

Then enter:

Copy
Copied!

            
            python3 main.py

This will launch a Helm chart based on the conversion configurations which will spawn a pod to convert the specified model to the .nemo format. The pod can be viewed with kubectl get pods and the log can be read with kubectl logs <pod name>.