Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Long Context Recipes#

Long context model training enhances the capability of large language models to handle long context inputs. Longer sequence lengths are beneficial for many NLP tasks, such as document-level summarization, long document classification, and long document question answering. NeMo Framework provides recipes to train long context (16k and 64k sequence length) models like Llama-3, Mixtral, and Nemotron. Context Parallelism (CP) is the key technique used in NeMo Framework to train long context models. This technique allows training and inferencing models with longer sequence lengths. CP allows NeMo Framework to provide 16k and 64k sequence lengths in current recipes, with a potential to scale to even longer sequence lengths like 1 million sequence length.

Context Parallelism

Provided Long Context Recipes in NeMo#

NeMo 2.0 provides tested recipes to train long context models. The recipes are available in the NeMo LLM recipes directory. Currently, all the recipes are pretraining recipes, since there is no fine-tuning use case currently.

The following tables for the Llama 3, Mixtral, and Nemotron models show different sequence lengths with corresponding recipes in NeMo.

Llama 3#

Sequence Length	8B	70B
16k	Yes	Yes
64k	Yes	Yes

Mixtral#

Sequence Length	8x7B
16k	Yes
64k	Yes

Nemotron 4#

Sequence Length	15B	22B
16k	Yes	Yes
64k	Yes	Yes

Usage#

Command Line Interface#

You can use these recipes via the NeMo Run CLI. For NeMo Run CLI installation instructions, see NeMo Run CLI README for more details.

For example:

nemo llm pretrain --factory llama3_8b_16k

llama3_8b_16k could be replaced by other recipes included in NeMo 2.0 (e.g. mixtral_8x7b_16k)
For Long Context recipes, we only support the pretrain task since there is no actual Long Context finetuning dataset yet.

When launching the recipes with multiple processes (i.e. on multiple GPUs), add the -y option to the command to avoid user confirmation prompts. For example, nemo llm pretrain --factory llama3_8b -y

Customizing Parameters#

You can override any parameter in the recipe:

nemo llm pretrain --factory llama3_8b_16k trainer.max_steps=2000

To continue training from a checkpoint, add the resume.resume_from_path="to/some/path" option to the command. Users can also specify other options like sequence length in this manner. For example:

nemo llm pretrain --factory llama3_8b_16k resume.resume_from_path="to/some/path" model.config.seq_length=2048 data.seq_length=2048

For more details about running recipes, see Github pre-train README.

Adding a New Recipe#

See Github add-recipe README for instructions on how to add a new recipe.

Long Context Model Training Pipeline#

Typically, long context model training involves 2 steps on the model side: pretraining from scratch and extending the context length of the previous model multiple times with continue training. In NeMo 2.0, this process could be done with the following steps with NeMo Run CLI:

Pretraining from Scratch#

Here is an example for pretraining Llama 3 model from scratch with a sequence length of 2048 (default sequence length for llama3_8b recipe is 8192):

nemo llm pretrain --factory llama3_8b model.config.seq_length=2048 data.seq_length=2048 log.log_dir="your/first/log/ckpt/dir/here"

After running the above command, we will have a pretrained model checkpoint with a sequence length of 2048 in your directory.

Extending the context length of a pre-trained checkpoint#

To extend the context length of the model, we can continue training from the checkpoint, for example with a sequence length of 16k:

nemo llm pretrain --factory llama3_8b_16k log.log_dir="your/second/log/ckpt/dir/here" resume.resume_from_path="your/first/log/ckpt/dir/here"

And then, we can continue training from the 16k checkpoint to 64k:

nemo llm pretrain --factory llama3_8b_64k log.log_dir="your/third/log/ckpt/dir/here" resume.resume_from_path="your/second/log/ckpt/dir/here"

For other context lengths, users need to add their corresponding recipes for different parallelism configurations. Please follow Github add-recipe README for more details.