Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Checkpoint Conversion
In conf/config.yaml
, the conversion
configuration should be set to the pathname of the config file to be used for conversion.
For T5 models this should be set to t5/convert_t5
which corresponds to conf/conversion/t5/convert_t5.yaml
.
The conversion
value must be included in stages
to run the conversion pipeline.
Common
To specify the input checkpoint to be used for conversion for T5 models,
use the model
configuration in conf/conversion/t5/convert_t5.yaml
:
model:
model_type: t5 # gpt or t5
checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
tensor_model_parallel_size: 1 # 1 for 220m, 2 for 3b
pipeline_model_parallel_size: 1
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
vocab_file: ${data_dir}/bpe/vocab.txt
merge_file: null
To specify the output location and file name of the converted .nemo
file for T5 models, use the run
configuration in
conf/conversion/t5/convert_t5.yaml
:
run:
name: convert_${conversion.run.model_train_name}
nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
time_limit: "2:00:00"
ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
convert_name: convert_nemo
model_train_name: t5_220m
train_dir: ${base_results_dir}/${.model_train_name}
results_dir: ${.train_dir}/${.convert_name}
output_path: ${.train_dir}/${.convert_name}
nemo_file_name: megatron_t5.nemo # name of nemo checkpoint; must be .nemo file
Slurm
Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml
file:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Example
To run only the conversion pipeline and not the data preparation,
training, evaluation or inference pipelines set the conf/config.yaml
file to:
stages:
- conversion
Then enter:
python3 main.py
Base Command Platform
In order to run the conversion script on Base Command Platform, set the
cluster_type
configuration in conf/config.yaml
to bcp
. This can
also be overridden from the command line, using hydra. The conversion
script must be launched in a multi-node job.
To run the conversion pipeline to convert a T5 220M checkpoint stored in
/mount/results/t5_220m/results/checkpoints
, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py conversion=convert_t5 \
stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results conversion.model.vocab_file=/mount/data/bpe/vocab.txt \
conversion.run.model_train_name=t5_220m conversion.run.results_dir=/mount/results/t5_220m/results/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/t5_220m/checkpoints \
conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \
>> /results/convert_t5_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data
,
and the results workspace in /mount/results
. stdout
and stderr
are redirected to the file /results/data_gpt3_log.txt
,
which you can download from NGC. You may add any other configuration required to modify the command’s behavior.