Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Checkpoint Conversion#

NVIDIA provides a simple tool to convert the checkpoints from .ckpt format to .nemo format. This tool will be used at a later time for evaluation and for inference.

The configuration you use for checkpoint conversion must be specified by setting the conversion configuration in conf/config.yaml to the pathname of the conversion configuration file to be used. Its default value is gpt3/convert_gpt3, which is stored in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.

In conf/config.yaml, set the conversion configuration to the path of the config file to be used for conversion. The default value is gpt3/convert_gpt3, which is stored in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.

The conversion value must be included in the stages configuration to run the conversion pipeline.

Common#

To specify the input checkpoint to be used for conversion for GPT models, use the model configuration in conf/conversion/convert_gpt3.yaml:

model:
    model_type: gpt # gpt or t5
    checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
    checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
    hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
    tensor_model_parallel_size: 2 # 1 for 126m, 2 for 5b, and 8 for 20b or larger models
    pipeline_model_parallel_size: 1
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    vocab_file: ${data_dir}/bpe/vocab.json
    merge_file: ${data_dir}/bpe/merges.txt

To specify the output location and filename of the converted .nemo file for GPT models, use the run configuration in conf/conversion/gpt3/convert_gpt3.yaml:

run:
    name: convert_${conversion.run.model_train_name}
    nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
    time_limit: "2:00:00"
    ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
    convert_name: convert_nemo
    model_train_name: gpt3_5b
    train_dir: ${base_results_dir}/${.model_train_name}
    results_dir: ${.train_dir}/${.convert_name}
    output_path: ${.train_dir}/${.convert_name}
    nemo_file_name: megatron_gpt.nemo # name of nemo checkpoint; must be .nemo file

Slurm#

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - conversion

Then enter:

python3 main.py

Base Command Platform#

To run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp. You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.

To convert a 126M checkpoint stored in /mount/results/gpt3_126m/results/checkpoints, use the command:

python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts \
data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results \
conversion.run.model_train_name=gpt3_126m \
conversion.model.vocab_file=/mount/data/bpe/vocab.json \
conversion.model.merge_file=/mount/data/bpe/merges.txt \
conversion.run.results_dir=/mount/results/gpt3_126m/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints conversion.model.tensor_model_parallel_size=1 \
>> /results/convert_gpt3_log.txt 2>&1

The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC. You may add any other configuration required to modify the command’s behavior.

Kubernetes#

Set configuration for a Slurm cluster in the conf/cluster/k8s.yaml file:

pull_secret: null  # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null  # Hostname or IP address for the NFS server where data is stored.
nfs_path: null  # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev"  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8"  # Specify the number of IB devices to include per node in each pod.

Example

Set the cluster and cluster_type settings to k8s in conf/config.yaml.

To run only the conversion pipeline and not the data preparation, training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:

stages:
  - conversion

Then enter:

python3 main.py

This will launch a Helm chart based on the conversion configurations which will spawn a pod to convert the specified model to the .nemo format. The pod can be viewed with kubectl get pods and the log can be read with kubectl logs <pod name>.