Packed Sequences & Long-Context Training#
For what packed sequences are, the three packing paths, and when to use them, see:
docs/training/packed-sequences.mdcard.yaml(co-located)
Enablement#
Offline packed SFT#
cfg.train.micro_batch_size = 1
cfg.dataset.dataset_kwargs.pad_to_max_length = True
cfg.dataset.packed_sequence_specs.packed_sequence_size = 8192 # match seq_length
VLM in-batch packing#
cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 4 # must be > 1
CP + packing (finetuning)#
cfg.model.context_parallel_size = 4
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = 2 * 4 # 2 * CP
# If sequence_parallel is also enabled, pad_seq_to_mult must include TP:
# cfg.dataset.packed_sequence_specs.pad_seq_to_mult = 2 * CP * TP
Code Anchors#
Packed sequence dataset:
src/megatron/bridge/data/datasets/packed_sequence.pySFT dataset:
src/megatron/bridge/data/datasets/sft.pyPacked seq utils:
src/megatron/bridge/training/utils/packed_seq_utils.pyGPT step (packing logic):
src/megatron/bridge/training/gpt_step.pyVLM step (packing logic):
src/megatron/bridge/training/vlm_step.pyFinetune utils:
src/megatron/bridge/recipes/utils/finetune_utils.pyFunctional test:
tests/functional_tests/training/test_seqpacking_cp_example.py
Pitfalls#
MBS constraint: Offline packed SFT requires
micro_batch_size == 1. VLM in-batch packing requiresmicro_batch_size > 1. Mixing these up produces silent data corruption.CP divisibility:
seq_lengthmust be divisible by2 * context_parallel_size. When sequence parallelism (SP) is also enabled, the divisor becomes2 * CP * TP. Violations cause assertion errors during initialization.Per-token loss with CP: Finetuning with
CP > 1requirescalculate_per_token_loss=Trueandaverage_in_collective=False. Without these, loss scaling is wrong across CP ranks.MTP incompatibility: Sequence packing for finetuning is documented as unsupported with multi-token prediction.
Model-family opt-outs: Several model families explicitly disable packing: Qwen3-Next SFT, GLM-4.5 SFT/PEFT, Qwen3.5-VL. Check model-specific recipes before assuming packing is available.
Verification#
For offline packed SFT, verify that cu_seqlens and seq_offsets are
present in the batch dict during the forward pass. For CP + packing, look for
the pad_seq_to_mult validation message during config setup.