Sequence Packing#

This section explains how to use the sequence packing training technique with Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT).

Sequence Packing for SFT/PEFT#

Overview#

When fine-tuning a large language model or vision language model, whether using SFT or PEFT methods, GPU under-utilization often occurs due to an inefficient input data structure. This inefficiency arises because many fine-tuning datasets have a skewed distribution of sequence lengths, with many short sequences and a few long ones, following Zipf’s Law. Since transformer models require fixed-length inputs, shorter sequences must be padded with many padding tokens. This leads to two main inefficiencies:

Computation performed on the pad tokens is eventually masked out, resulting in wasted GPU computation.
Micro batch size is often limited by the batch which contains longer sequences, so that most other micro batches have under-utilized GPU memory.

Sequence packing is a training technique where multiple training sequences (examples) are concatenated into one long sequence (pack). This technique greatly reduces the number of padding tokens, allowing more meaningful tokens to be processed in each micro batch. As a result, it maximizes both GPU compute and GPU memory utilization.

While sequences for pretraining can be concatenated naively without carefully minding the sequence boundaries, this is often not the case for supervised and instruction fine-tuning. This is because the data quality is much higher in fine-tuning workloads, so each input sequence should be treated individually.

The conventional solution is to build a custom attention mask (specifically, a block triangular mask) to mask out attention values between sequences. However, this increases the complexity of attention from \(\sum_i {s_i}^2\) to \(\Big({\sum_i {s_i}}\Big)^2\), where \(s_i\) is the length of the \(i\) th subsequence. In practice, the conventional solution puts a limit on the packed sequence size. Instead, NeMo provides a highly optimized version of sequence packing which makes use of variable-length attention kernels in FlashAttention and TransformerEngine. Instead of providing a custom attention mask, information about sequence boundaries is passed in with the cu_seq_lens variable (short for cumulative sequence length) [1]. With this approach, attention values between sequences are never calculated, so the complexity of attention remains at \(\sum_i {s_i}^2\). This allows the packed sequence size to increase to arbitrary lengths without affecting the memory complexity, so that GPU memory can be fully utilized.

All things considered, NeMo’s implementation of sequence packing provides [2]:

Up to 10X performance improvement in terms of FLOPs
Up to 6X performance improvement in terms of training time
No impact on model convergence

Run SFT/PEFT with Packed Sequences in LLM#

Prepare the Dataset#

In NeMo 2.0, the packed dataset is automatically prepared before training, eliminating the need for any additional steps.

Train with Predefined Fine-Tune Recipes#

The quickest way to start fine-tuning a model with packed sequences is to use the NeMo-Run recipes. Simply set packed_sequence=True in the recipe function. The following is an example using the Llama 3 8B model.

from nemo.collections import llm

recipe = llm.llama3_8b.finetune_recipe(
    name="llama3_8b_finetuning",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
    packed_sequence=True,
)

Migrate Your Own Recipe#

If you already have a fine-tuning recipe without using packed sequences, modify your recipe as follows to start training with packed sequences.

Add packed_sequence_specs=PackedSequenceSpecs(...) in your datamodule. For example:

data = llm.DollyDataModule(
        seq_length=2048,
        micro_batch_size=1,
        global_batch_size=8,
        packed_sequence_specs=PackedSequenceSpecs(
            packed_sequence_size=2048,
            tokenizer_model_name="a_unique_tokenizer_name",
        ),
    )

See here for an explanation of the allowed fields in PackedSequenceSpecs.

Adjust the batch sizes.
- Micro batch size must be set to 1. This constraint arises because samples in a micro batch are no longer stacked; they are now concatenated during the data preparation step. Consequently, micro batch size becomes irrelevant when using packed sequences, so we require users to set it to 1 to acknowledge this fact. To improve GPU memory utilization, you can increase the packed_sequence_size, achieving the same effect as increasing the micro batch size.
- Global batch size must be adjusted to maintain the training recipe. Since each pack now contains multiple sequences, the global batch size needs to be reduced by the average number of sequences per pack n where n = num_sequences_in_dataset / num_packs (equivalently, n = packed_sequence_size / average_seq_len). This ensures that each gradient iteration sees, on average, the same number of tokens. The value of n is printed out during the data preparation step. You may need to run training once, obtain the value of n from the logs, then run your training script again with the updated global batch size. Alternatively, you can start with the smallest global batch size possible such that data_parallel_size = num_gpus (i.e. no gradient accumulation), and tune your global batch size gradually.

Now, you are all set to fine-tune your model with a much improved throughput!

Run SFT/PEFT with Packed Sequences in NeVA#

In NeMo 2.0, you no longer need to pre-process datasets for sequence packing before use. Instead, packing is performed on the fly. Depending on the dataset type you are using, the implementation will vary slightly. Check out our example fine-tuning script.

Mock data format

data = vlm.NevaMockDataModule(
    seq_length=2048,
    global_batch_size=128,
    micro_batch_size=4,
    tokenizer=None,
    image_processor=None,
    num_workers=4,
    packed_sequence=True,
)

Set packed_sequence to True to indicate whether you want to use packed sequences in NevaMockDataModule. If enabled, the entire micro batch (with micro_batch_size randomly generated samples) will be packed into one sequence with the THD layout [1]. The mock dataset is intended for quick tests and benchmarks.

Llava data format

data_config = vlm.ImageDataConfig(
    image_folder="/path/to/image_folder",
    conv_template="v1",
)
data = vlm.PreloadedDataModule(
    paths="/path/to/data_json_file",
    data_config=data_config,
    seq_length=2048,
    decoder_seq_length=None,
    global_batch_size=128,
    micro_batch_size=4,
    tokenizer=None,
    image_processor=None,
    num_workers=4,
    packed_sequence=True,
    num_image_embeddings_per_tile=576,
)

Set packed_sequence to True to enable packed sequences in PreloadedDataModule. Similar to the mock module, if packing is enabled, the entire micro batch will be packed into one sequence with the THD layout [1]. Note that no additional selection algorithm is applied. Since samples in each micro batch are randomly selected from the entire dataset, the behavior is equivalent to randomly selecting micro_batch_size samples. With sequence packing enabled, padding sequences to the same length within each micro batch is no longer necessary, potentially improving processing speed.

You can increase the micro_batch_size when enabling packed_sequence. As long as the global batch size (global_batch_size) remains consistent, convergence behavior should be similar. Note that when pipeline parallelism (PP) is enabled, packed sequences will still be truncated to seq_length to facilitate PP communication.

Energon data format

config = vlm.MultiModalSampleConfig(
    image_token=vlm.ImageToken(token_str="<image>", token_id=-200),
    ignore_place_holder=-100,
    conversation_template_config=LLaVATemplateConfig(),
)

data = vlm.EnergonMultiModalDataModule(
    path="/path/to/energon_data",
    tokenizer=tokenizer,
    image_processor=image_processor,
    seq_length=2048,
    micro_batch_size=1,
    global_batch_size=32,
    num_workers=0,
    multimodal_sample_config=config,
    task_encoder=MultiModalTaskEncoder(
        tokenizer=tokenizer,
        image_processor=image_processor,
        multimodal_sample_config=config,
        packed_sequence=True,
        packed_sequence_size=8192,
        num_image_embeddings_per_tile=576,
    ),
    packing_buffer_size=200 if packed_sequence else None,
)

To use the Energon dataset, follow the instructions at here to process your dataset into the Energon format. Set packed_sequence=True and specify packing_buffer_size to enable sequence packing. The Energon dataset uses on-the-fly packing, where each worker reads a packing_buffer_size number of samples and packs them into sequences of size packed_sequence_size. Refer to the Energon user guide for more details on packing.

When using this dataset, set the micro batch size (micro_batch_size) to 1 and adjust the global batch size (global_batch_size) by dividing it by the average number of sequences per pack (n).