Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Sequence Packing#

This section explains how to use the sequence packing training technique with Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT).

Sequence Packing for SFT/PEFT#

Overview#

When fine-tuning a large language model, whether using SFT or PEFT methods, GPU underutilization often occurs due to an inefficient data pipeline. This inefficiency arises because most fine-tuning datasets have a skewed distribution of sequence lengths, with many short sequences and a few long ones, following Zipf’s Law. Since transformer models require fixed-length inputs, shorter sequences must be padded with unused tokens, leading to two main inefficiencies:

  • Computation performed on the pad values is eventually ignored for model output, resulting in wasted FLOPs.

  • Micro batch size is often limited by the batch which contains longer sequences, so that most other micro batches have underutilized GPU memory.

Sequence packing is a training technique where multiple training sequences (examples) are concatenated into one long sequence (pack). This method eliminates the need for padding, allowing more tokens to be processed in each micro batch. As a result, it maximizes both GPU compute and GPU memory utilization.

While sequences for pretraining can be concatenated naively, this is not the case for SFT and instruction fine-tuning where each input sequence should be treated individually. The conventional solution is to build an extended attention mask to mark the sequence id each token belongs to, and mask out attention values between sequences. However, this increases the complexity of attention from \(\sum_i {s_i}^2\) to \(\Big({\sum_i {s_i}}\Big)^2\), where \(s_i\) is the length of the ith subsequence. In practice, the conventional solution puts a limit on the length of packing. Instead, NeMo provides a highly optimized version of sequence packing which makes use of variable-length attention kernels in FlashAttention and TransformerEngine. With this approach, attention values between sequences are never calculated, so the complexity of attention remains at \(\sum_i {s_i}^2\). This allows packing sequences to arbitrary lengths so that GPU memory can be fully utilized.

All things considered, NeMo’s implementation of sequence packing provides [1]

  • Up to 10X performance improvement in terms of FLOPs

  • Up to 6X performance improvement in terms of training time

  • No impact on model convergence

How to run SFT/PEFT with packed sequence#

Prepare Dataset#

We provide a convenient script to pack your SFT or PEFT dataset. This script assumes that you already have a prepared dataset file for SFT/PEFT training in NeMo. If you do not, please follow this to download and prepare the Dolly dataset as an example. You will get a file named training.jsonl. The rest of this tutorial also assumes you already have a recipe for training with the unpacked dataset.

Two main steps are run in this script:

  1. The online processing code in GPTSFTDataset is run. This includes tasks such as prompt template manipulation, sequence length truncation, and tokenization. The result is an array of tokenized sequences, represented by indices.

  2. The tokenized sequences are grouped by length and a packing algorithm is run.

You can read more about packing algorithms here. Currently, two variants of first_fit are supported. - first_fit_decreasing sorts the sequences in decreasing order before applying the first-fit algorithm. It generates a more optimal packing, but it tends to keep all short sequences together, which may have an impact for convergence. - first_fit_shuffle runs first-fit in a random order. Packing is less optimal but it keeps the dataset order random. The recommendation is to run first_fit_shuffle and check the packed sequence lengths. If they are similar to the target length (i.e. efficient packing), then use shuffle. Otherwise try first_fit_decreasing.

python scripts/nlp_language_modeling/prepare_packed_ft_dataset.py \
   model.data.train_ds.file_names=[/path/to/training.jsonl] \
   model.data.train_ds.max_seq_length=2048 \
   +tokenizer_path=/path/to/tokenizer.model \
   +output_dir=/path/to/output_folder \
   +pack_sizes=[2048,4096,8192] \
[  +packing_algorithm=first_fit_shuffle \  ]
[  +seed=0                                 ]

Note

  1. If your model or dataset requires non-default configs for conventional SFT/PEFT training in NeMo, you will need to pass in the same configs to model.data.train_ds as you would for training with an unpacked dataset.

  2. model.data.train_ds.max_seq_length is the length to which each sequence is truncated before packing multiple sequences to the size of packed sequence (pack_size). max_seq_length should be set to the same value as unpacked data and can be determined by examining the distribution of sequence lengths in the dataset.

3. pack_sizes is a list of packed sequence lengths. In this example, there will be three output files, one for each pack size. The output files are named <output_folder>/packed_{pack_size}_seed{seed}.npy. This argument is a list because you will likely want to experiment with a few pack_sizes to find out which length can fill the GPU memory without exceeding it. Adjusting pack_size is analogous to adjusting the micro batch size in the unpacked case.

Adjust Training Config#

To train with packed sequences, you need to change four items in the SFT/PEFT config file.

  1. Turn on the packed_sequence flag:

    ++model.data.train_ds.packed_sequence=True
    
  2. Use the new dataset file instead of the original jsonl file:

    model.data.train_ds.file_names=output_folder/packed_{pack_size}_seed{seed}.npy
    
  3. Specify the packed sequence length. This should be one of the pack_sizes you specified during data preparation.

    model.data.train_ds.max_seq_length={pack_size}
    
  4. Adjust the batch sizes.

    • Micro batch size has to be set to 1 as a nominal constraint. This is because batches are now concatenated in the preprocessing step. You can increase the pack_size to achieve the same purpose of increasing micro batch size.

    • Global batch size has to be adjusted so that the training recipe is maintained. Because each pack contains multiple sequences now, global batch size needs to be reduced by the average number of sequences per pack n, where n = num_sequences_in_dataset / num_packs. This ensures that each gradient iteration sees (on average) the same number of tokens. The value of n is printed out when the script is run.

    model.micro_batch_size=1
    model.global_batch_size=<GBS divided by n>
    

Now, you are all set to fine-tune your model with a much improved throughput!

Sequence Packing for NeVA#

Sequence packing with NeVA for multimodal large language models differs from the LLM SFT/PEFT approach. For details, please refer to the documentation below.

Sequence Packing for NeVA

Footnotes