Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Train with Predefined Configurations
NVIDIA provides configurations for four Llama1 model sizes: 7B, 13B, 30B, 65B, and three Llama2 model sizes: 7B, 13B, 70B, and two Llama3 model sizes: 8B, 70B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.
Run Training
To run Llama training, update
conf/config.yaml
:
defaults:
- training: llama/llama3_8b
stages:
- training
Specify
llama1
,llama2
,llama3
and the model size for thetraining
configuration:llama/<llama3/llama2/llama1>_<model_size>
. All pre-defined training configs are located at Llama Training Config.Execute the launcher pipeline:
python3 main.py
.
Configure Training
Default configurations for model size specific training can be found in the folder conf/training/llama
. The configuration is divided into four sections run
, trainer
, exp_manager
, and model
.
run:
name: llama2_7b
results_dir: ${base_results_dir}/${.name}
time_limit: "0-04:00:00"
dependency: "singleton"
Set the number of nodes and devices for training:
trainer:
num_nodes: 16
devices: 8
max_steps: 300000 # consumed_samples = global_step * global_batch_size
max_time: "05:23:30:00" # days:hours:minutes:seconds
Set configurations for creating a checkpoint:
exp_manger:
create_checkpoint_callback: True
checkpoint_callback_params:
monitor: val_loss
save_top_k: 10
mode: min
always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
save_nemo_train_end: False # not recommended when training large models on clusters with short time limits
filename: 'megatron_llama--{val_loss:.2f}-{step}-{consumed_amples}'
model_parallel_size: ${multiply:${training.model.tensor_model_parallel_size}, ${training.model.pipeline_model_parallel_size}}
Set wandb configurations:
exp_manager:
create_wandb_logger: True
wandb_logger_kwargs:
project: nemo_llama
name: ${training.run.name}
Set tensor parallel and pipeline parallel size:
model:
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
Set data distribution configuration:
model:
data:
data_prefix:
- .0333
- ${data_dir}/my-llama_00_text_document
- .0333
- ${data_dir}/my-llama_00_text_document