Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Train with Predefined Configurations

NVIDIA provides configurations for four Llama1 model sizes: 7B, 13B, 30B, 65B, and three Llama2 model sizes: 7B, 13B, 70B, and two Llama3 model sizes: 8B, 70B. These configurations include carefully selected hyperparameters, which you may use as guidelines for any custom model configurations.

Run Training

  1. To run Llama training, update conf/config.yaml:

defaults:
  - training: llama/llama3_8b

stages:
  - training
  1. Specify llama1 , llama2 , llama3 and the model size for the training configuration: llama/<llama3/llama2/llama1>_<model_size>. All pre-defined training configs are located at Llama Training Config.

  2. Execute the launcher pipeline: python3 main.py.

Configure Training

Default configurations for model size specific training can be found in the folder conf/training/llama. The configuration is divided into four sections run, trainer, exp_manager, and model.

run:
  name: llama2_7b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "0-04:00:00"
  dependency: "singleton"
  1. Set the number of nodes and devices for training:

trainer:
  num_nodes: 16
  devices: 8
  max_steps: 300000 # consumed_samples = global_step * global_batch_size
  max_time: "05:23:30:00" # days:hours:minutes:seconds
  1. Set configurations for creating a checkpoint:

exp_manger:
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 10
    mode: min
    always_save_nemo: False # saves nemo file during validation, not implemented for model parallel
    save_nemo_train_end: False # not recommended when training large models on clusters with short time limits
    filename: 'megatron_llama--{val_loss:.2f}-{step}-{consumed_amples}'
    model_parallel_size: ${multiply:${training.model.tensor_model_parallel_size}, ${training.model.pipeline_model_parallel_size}}
  1. Set wandb configurations:

exp_manager:
  create_wandb_logger: True
  wandb_logger_kwargs:
    project: nemo_llama
    name: ${training.run.name}
  1. Set tensor parallel and pipeline parallel size:

model:
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
  1. Set data distribution configuration:

model:
  data:
    data_prefix:
    - .0333
    - ${data_dir}/my-llama_00_text_document
    - .0333
    - ${data_dir}/my-llama_00_text_document