Long Context Recipes#

Long context model training enhances the capability of large language models to handle long context inputs. Longer sequence lengths are beneficial for many NLP tasks, such as document-level summarization, long document classification, and long document question answering. NeMo Framework provides recipes to train long context (16k and 64k sequence length) models like Llama-3, Mixtral, and Nemotron. Context Parallelism (CP) is the key technique used in NeMo Framework to train long context models. This technique allows training and inferencing models with longer sequence lengths. CP enables the NeMo Framework to support 16k and 64k sequence lengths in its current recipes, with the potential to scale to even longer sequence lengths, such as 1 million.

Context Parallelism

Provided Long Context Recipes in NeMo#

NeMo 2.0 provides tested recipes to train long context models. These recipes are available in the NeMo LLM recipes directory. At present, all available recipes are for pretraining, as there are no fine-tuning use cases yet.

The following tables for the Llama 3, Mixtral, and Nemotron models show different sequence lengths with corresponding recipes in NeMo.

Llama 3#

Sequence Length	8B	70B
16k	Yes	Yes
64k	Yes	Yes

Mixtral#

Sequence Length	8x7B
16k	Yes
64k	Yes

Nemotron 4#

Sequence Length	15B	22B
16k	Yes	Yes
64k	Yes	Yes

Use Long Context Recipes#

Note

For long context recipes, we only support the pretrain task since there is no actual long context fine-tuning dataset yet.

We listed two methods to access long context recipes: using the NeMo-Run API and the NeMo-Run CLI. Both wrap the same API code, but the NeMo-Run CLI version is designed for ease of use.

Use with the NeMo-Run API#

Note

For detailed information about NeMo-Run, please refer to Quickstart with NeMo-Run. Below is a concise version that focuses on the usage of long context recipes in NeMo 2.0.

In NeMo 2.0, we included an example to demonstrate our existing pretraining recipes. For instance, you can use the llama3_8b_16k script as follows:

python pretraining.py --slurm --recipe llama3_8b_16k

--slurm is required for a non-local Slurm job submission. Additionally, you must configure the Slurm account settings in this example Python file.
The model type and model size can be substituted with other recipes available in NeMo 2.0 (e.g. mixtral with model size 8x7b_16k).

Customize Parameters#

To use customized parameters, you can easily modify the example. For instance, you can add pretrain.trainer.max_steps=2000 before the experiment execution part to override the max step of this recipe.

Use with the NeMo-Run CLI #

Note

The NeMo-Run CLI currently works only in a local environment. To use it with a Slurm cluster, please refer to the NeMo-Run example.

You can use these recipes via the NeMo-Run CLI. For NeMo-Run CLI installation instructions, refer to the NeMo-Run CLI README.

For example:

nemo llm pretrain --factory llama3_8b_16k

llama3_8b_16k could be replaced by other recipes included in NeMo 2.0 (e.g. mixtral_8x7b_16k)

Note

When launching the recipes with multiple processes (i.e. on multiple GPUs), add the -y option to the command to avoid user confirmation prompts. For example, nemo llm pretrain --factory llama3_8b -y.

Customize Parameters#

You can override any parameter in the recipe:

nemo llm pretrain --factory llama3_8b_16k trainer.max_steps=2000

To continue training from a checkpoint, add the resume.resume_from_path="to/some/path" option to the command. You can also specify other options like sequence length in this manner. For example:

nemo llm pretrain --factory llama3_8b_16k resume.resume_from_path="to/some/path" model.config.seq_length=2048 data.seq_length=2048

For more details about running recipes, see Github pre-train README.

Add a New Recipe#

See Github add-recipe README for instructions on how to add a new recipe.

Train a Long Context Model Pipeline#

Typically, training a long context model involves two steps: pretraining from scratch and extending the context length of the previous model multiple times through continued pretraining. In NeMo 2.0, this process can be accomplished using the NeMo-Run CLI with the following steps.

Use with the NeMo-Run API#

Pretrain the Llama 3 Model from Scratch#

Here is an example of pretraining the Llama 3 model from scratch with a sequence length of 2048 using this script. The default sequence length for the llama3_8b recipe is 8192. First, you need to override the seq_length used in llama3_8b with:

pretrain.model.config.seq_length=2048
pretrain.data.seq_length=2048

Then you can launch the job with:

python pretraining.py --slurm --recipe llama3_8b

After running the above command, you will have a pretrained model checkpoint with a sequence length of 2048 in your directory.

Extend the Context Length of a Pretrained Checkpoint#

To extend the context length of the model, you can continue training from the checkpoint. For example, you can use a sequence length of 16k:

python pretraining.py --slurm --recipe llama3_8b_16k

Next, you can continue training from the 16k checkpoint to 64k:

python pretraining.py --slurm --recipe llama3_8b_64k

For saving and loading from different directories, please add:

pretrain.log.log_dir="your/second/log/ckpt/dir/here"
pretrain.resume.resume_from_path="your/first/log/ckpt/dir/here"

For other context lengths, you need to add the corresponding recipes for different parallelism configurations or override the existing recipes. Refer to the GitHub add-recipe README for details.

Use with the NeMo-Run CLI #

Note

The NeMo-Run CLI currently works only in a local environment. To work with a Slurm cluster, please see the NeMo-Run example.

Pretrain the Llama 3 Model from Scratch#

Here is an example of pretraining the Llama 3 model from scratch with a sequence length of 2048 with the NeMo-Run CLI. The default sequence length for the llama3_8b recipe is 8192.

nemo llm pretrain --factory llama3_8b model.config.seq_length=2048 data.seq_length=2048 log.log_dir="your/first/log/ckpt/dir/here"

After running the above command, we will have a pretrained model checkpoint with a sequence length of 2048 in your directory.

Extend the Context Length of a Pretrained Checkpoint#

To extend the context length of the model, you can continue training from the checkpoint. For example, you can use a sequence length of 16k:

nemo llm pretrain --factory llama3_8b_16k log.log_dir="your/second/log/ckpt/dir/here" resume.resume_from_path="your/first/log/ckpt/dir/here"

Next, you can continue training from the 16k checkpoint to 64k:

nemo llm pretrain --factory llama3_8b_64k log.log_dir="your/third/log/ckpt/dir/here" resume.resume_from_path="your/second/log/ckpt/dir/here"

For other context lengths, you need to add the corresponding recipes for different parallelism configurations or override the existing recipes. Refer to the GitHub add-recipe README for details.

Long Context Recipes#

Provided Long Context Recipes in NeMo#

Llama 3#

Mixtral#

Nemotron 4#

Use Long Context Recipes#

Use with the NeMo-Run API#

Customize Parameters#

Use with the NeMo-Run CLI#

Customize Parameters#

Add a New Recipe#

Train a Long Context Model Pipeline#

Use with the NeMo-Run API#

Pretrain the Llama 3 Model from Scratch#

Extend the Context Length of a Pretrained Checkpoint#

Use with the NeMo-Run CLI#

Pretrain the Llama 3 Model from Scratch#

Extend the Context Length of a Pretrained Checkpoint#

Use with the NeMo-Run CLI #

Use with the NeMo-Run CLI #