Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Checkpoint Conversion#
NVIDIA provides a simple tool to convert the checkpoints from .ckpt
format to .nemo format. This tool will be used at a later time for evaluation and for inference.
The configuration you use for checkpoint conversion must be specified by setting the conversion configuration in conf/config.yaml to the pathname of the conversion configuration file to be used. Its default value is gpt3/convert_gpt3, which is stored in
conf/conversion/gpt3/convert_gpt3.yaml for GPT models.
In conf/config.yaml, set the conversion configuration to the path of the config file to be used for conversion.
The default value is gpt3/convert_gpt3, which is stored in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.
The conversion value must be included in the stages configuration to run the conversion pipeline.
Common#
To specify the input checkpoint to be used for conversion for GPT
models, use the model configuration in conf/conversion/convert_gpt3.yaml:
model:
model_type: gpt # gpt or t5
checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
tensor_model_parallel_size: 2 # 1 for 126m, 2 for 5b, and 8 for 20b or larger models
pipeline_model_parallel_size: 1
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
vocab_file: ${data_dir}/bpe/vocab.json
merge_file: ${data_dir}/bpe/merges.txt
To specify the output location and filename of the converted .nemo
file for GPT models, use the run configuration in
conf/conversion/gpt3/convert_gpt3.yaml:
run:
name: convert_${conversion.run.model_train_name}
nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
time_limit: "2:00:00"
ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
convert_name: convert_nemo
model_train_name: gpt3_5b
train_dir: ${base_results_dir}/${.model_train_name}
results_dir: ${.train_dir}/${.convert_name}
output_path: ${.train_dir}/${.convert_name}
nemo_file_name: megatron_gpt.nemo # name of nemo checkpoint; must be .nemo file
Slurm#
Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml
file:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Example
To run only the conversion pipeline and not the data preparation,
training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:
stages:
- conversion
Then enter:
python3 main.py
Base Command Platform#
To run the conversion script on Base Command Platform, set the cluster_type configuration in conf/config.yaml to bcp.
You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.
To convert a 126M checkpoint stored in /mount/results/gpt3_126m/results/checkpoints,
use the command:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts \
data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results \
conversion.run.model_train_name=gpt3_126m \
conversion.model.vocab_file=/mount/data/bpe/vocab.json \
conversion.model.merge_file=/mount/data/bpe/merges.txt \
conversion.run.results_dir=/mount/results/gpt3_126m/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints conversion.model.tensor_model_parallel_size=1 \
>> /results/convert_gpt3_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data, and the results workspace in /mount/results. stdout and stderr
are redirected to the file /results/data_gpt3_log.txt, which you can download from NGC.
You may add any other configuration required to modify the command’s behavior.
Kubernetes#
Set configuration for a Slurm cluster in the conf/cluster/k8s.yaml
file:
pull_secret: null # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null # Hostname or IP address for the NFS server where data is stored.
nfs_path: null # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev" # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8" # Specify the number of IB devices to include per node in each pod.
Example
Set the cluster and cluster_type settings to k8s in conf/config.yaml.
To run only the conversion pipeline and not the data preparation,
training, evaluation, or inference pipelines, set the stages section of conf/config.yaml to:
stages:
- conversion
Then enter:
python3 main.py
This will launch a Helm chart based on the conversion configurations which will
spawn a pod to convert the specified model to the .nemo format. The pod can
be viewed with kubectl get pods and the log can be read with
kubectl logs <pod name>.