Continual Learning with Pretrained Checkpoints#

Continual learning allows LLMs to acquire new skills and stay up-to-date with the rapidly evolving landscape of human knowledge. In this guide, we’ll explore how to engage in continual learning using existing pretrained checkpoints with NeMo 2.0. This process is applicable to various models, including Llama 1, Llama 2, Llama 3, Gemma, Mistral, Mixtral, and others. Here we use Llama 3.1 8B as an example to illustrate the workflow of running continual learning.

Acquire Pretrained Checkpoints#

Before starting continual learning, make sure you have the checkpoints downloaded already. You can automatically download the Llama 3.1 8B checkpoints from Hugging Face and convert them to NeMo using the following scripts:

from pathlib import Path
from nemo.collections import llm

if __name__ == "__main__":
    llm.import_ckpt(
        model=llm.Llama31Config8b(),
        source="hf://meta-llama/Meta-Llama-3.1-8B",
    )

Configure Continual Learning#

To enable continual learning, start with the pretraining recipe for your desired model.

recipe = llm.recipes.llama31_8b.pretrain_recipe(dir="path/to/save", name="llama3_continual_learning", num_nodes=1, num_gpus_per_node=8)

Instead of pretraining from scratch, you can modify the resume components of the pre-defined pretrain_recipe to resume continual learning from the pretrained checkpoints:

from nemo import lightning as nl
import nemo_run as run

recipe.resume = run.Config(
    nl.AutoResume,
    restore_config=run.Config(nl.RestoreConfig, path="nemo://meta-llama/Meta-Llama-3.1-8B"),
    resume_if_exists=True,
)

Adjust Training Configurations#

When engaging in continual learning, it is often beneficial to modify various training configurations. For example:

Model Parallelism: Depending on the computational resources available, you may adjust the model’s parallelism settings.
Data Blend: You can change the dataset or modify how data is blended during training to better suit the new training objectives.
Learning Rate Scheduler: Adjusting the learning rate schedule can help optimize training for the new conditions.

All of the above can be easily configured with NeMo’s recipe:

# Modify Model Parallelism if needed
# These are recommended number for Llama3.1 8B, and for larger models,
# Use more parallelism parameters to fit the model as needed.
recipe.trainer.strategy.tensor_model_parallel_size = 1
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.strategy.context_parallel_size = 2
# Modify Data Blend if needed
new_paths = [.3, "path/to/data1", .7, "path/to/data2"]
recipe.data.paths = new_paths
# Or you can directly swap the data module if needed
new_data_module = run.Config(
  llm.PreTrainingDataModule,
  paths = new_paths,
  seq_length = seq_length,
  global_batch_size = gbs,
  micro_batch_size = mbs,
)
# Modify Learning Rate Scheduler if needed
recipe.optim.lr_scheduler.warmup_steps = warmup_steps
recipe.optim.lr_scheduler.min_lr = min_lr
recipe.optim.config.lr = max_lr

Execute Continual Learning#

Please refer to nemo2-quickstart-nemo-run for various ways to execute the training.