Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Checkpoint Conversion#
In conf/config.yaml, the conversion configuration should be set to the pathname of the config file to be used for conversion.
For T5 models this should be set to t5/convert_t5 which corresponds to conf/conversion/t5/convert_t5.yaml.
The conversion value must be included in stages to run the conversion pipeline.
Common#
To specify the input checkpoint to be used for conversion for T5 models,
use the model configuration in conf/conversion/t5/convert_t5.yaml:
model:
model_type: t5 # gpt or t5
checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
tensor_model_parallel_size: 1 # 1 for 220m, 2 for 3b
pipeline_model_parallel_size: 1
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
vocab_file: ${data_dir}/bpe/vocab.txt
merge_file: null
To specify the output location and file name of the converted .nemo
file for T5 models, use the run configuration in
conf/conversion/t5/convert_t5.yaml:
run:
name: convert_${conversion.run.model_train_name}
nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
time_limit: "2:00:00"
ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
convert_name: convert_nemo
model_train_name: t5_220m
train_dir: ${base_results_dir}/${.model_train_name}
results_dir: ${.train_dir}/${.convert_name}
output_path: ${.train_dir}/${.convert_name}
nemo_file_name: megatron_t5.nemo # name of nemo checkpoint; must be .nemo file
Slurm#
Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml
file:
partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"
Example
To run only the conversion pipeline and not the data preparation,
training, evaluation or inference pipelines set the conf/config.yaml
file to:
stages:
- conversion
Then enter:
python3 main.py
Base Command Platform#
In order to run the conversion script on Base Command Platform, set the
cluster_type configuration in conf/config.yaml to bcp. This can
also be overridden from the command line, using hydra. The conversion
script must be launched in a multi-node job.
To run the conversion pipeline to convert a T5 220M checkpoint stored in
/mount/results/t5_220m/results/checkpoints, enter:
python3 /opt/NeMo-Framework-Launcher/launcher_scripts/main.py conversion=convert_t5 \
stages=<conversion> \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Framework-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results conversion.model.vocab_file=/mount/data/bpe/vocab.txt \
conversion.run.model_train_name=t5_220m conversion.run.results_dir=/mount/results/t5_220m/results/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/t5_220m/checkpoints \
conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \
>> /results/convert_t5_log.txt 2>&1
The command above assumes that you mounted the data workspace in /mount/data,
and the results workspace in /mount/results. stdout and stderr are redirected to the file /results/data_gpt3_log.txt,
which you can download from NGC. You may add any other configuration required to modify the command’s behavior.