Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Checkpoint Conversion
NVIDIA provides a simple tool to convert the checkpoints from .ckpt
format to .nemo
format. This tool will be used at a later time for evaluation and for inference.
The configuration you use for checkpoint conversion must be specified by setting the conversion
configuration in conf/config.yaml
to the pathname of the conversion configuration file to be used. Its default value is gpt3/convert_gpt3
, which is stored in
conf/conversion/gpt3/convert_gpt3.yaml
for GPT models.
In conf/config.yaml
, set the conversion
configuration to the path of the config file to be used for conversion.
The default value is gpt3/convert_gpt3
, which is stored in conf/conversion/gpt3/convert_gpt3.yaml
for GPT models.
The conversion
value must be included in the stages
configuration to run the conversion pipeline.
Common
To specify the input checkpoint to be used for conversion for GPT
models, use the model
configuration in conf/conversion/convert_gpt3.yaml
:
model:
model_type: gpt # gpt or t5
checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
tensor_model_parallel_size: 2 # 1 for 126m, 2 for 5b, and 8 for 20b or larger models
pipeline_model_parallel_size: 1
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
vocab_file: ${data_dir}/bpe/vocab.json
merge_file: ${data_dir}/bpe/merges.txt
To specify the output location and filename of the converted .nemo
file for GPT models, use the run
configuration in
conf/conversion/gpt3/convert_gpt3.yaml
:
run:
name: convert_${conversion.run.model_train_name}
nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
time_limit: "2:00:00"
ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
convert_name: convert_nemo
model_train_name: gpt3_5b
train_dir: ${base_results_dir}/${.model_train_name}
results_dir: ${.train_dir}/${.convert_name}
output_path: ${.train_dir}/${.convert_name}
nemo_file_name: megatron_gpt.nemo # name of nemo checkpoint; must be .nemo file
Slurm
Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml
file:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Example
To run only the conversion pipeline and not the data preparation,
training, evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- conversion
Then enter:
python3 main.py
Base Command Platform
To run the conversion script on Base Command Platform, set the cluster_type
configuration in conf/config.yaml
to bcp
.
You can also override this configuration from the command line, using hydra. This script must be launched in a multi-node job.
To convert a 126M checkpoint stored in /mount/results/gpt3_126m/results/checkpoints
,
use the command:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts \
data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results \
conversion.run.model_train_name=gpt3_126m \
conversion.model.vocab_file=/mount/data/bpe/vocab.json \
conversion.model.merge_file=/mount/data/bpe/merges.txt \
conversion.run.results_dir=/mount/results/gpt3_126m/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints conversion.model.tensor_model_parallel_size=1 \
>> /results/convert_gpt3_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data
, and the results workspace in /mount/results
. stdout
and stderr
are redirected to the file /results/data_gpt3_log.txt
, which you can download from NGC.
You may add any other configuration required to modify the command’s behavior.
Kubernetes
Set configuration for a Slurm cluster in the conf/cluster/k8s.yaml
file:
pull_secret: null # Kubernetes secret for the container registry to pull private containers.
shm_size: 512Gi # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.
nfs_server: null # Hostname or IP address for the NFS server where data is stored.
nfs_path: null # Path to store data in the NFS server.
ib_resource_name: "nvidia.com/hostdev" # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters.
ib_count: "8" # Specify the number of IB devices to include per node in each pod.
Example
Set the cluster
and cluster_type
settings to k8s
in conf/config.yaml
.
To run only the conversion pipeline and not the data preparation,
training, evaluation, or inference pipelines, set the stages
section of conf/config.yaml
to:
stages:
- conversion
Then enter:
python3 main.py
This will launch a Helm chart based on the conversion configurations which will
spawn a pod to convert the specified model to the .nemo
format. The pod can
be viewed with kubectl get pods
and the log can be read with
kubectl logs <pod name>
.