nemo_automodel.components.models.common.packing
nemo_automodel.components.models.common.packing
Flash Attention packing support via monkey-patching.
When attn_implementation="flash_attention_2" and neat packing is enabled,
the collater produces an indexed attention mask [B, S] where each
position contains the 1-based document index (0 = padding). For example::
[1, 1, 2, 2, 2, 0] # 2 tokens in doc 1, 3 in doc 2, 1 padding
To make HuggingFace’s flash attention path use flash_attn_varlen_func
with per-document cu_seqlens, we monkey-patch two functions:
transformers.modeling_flash_attention_utils._get_unpad_data— extracts per-document sequence lengths from the indexed mask and builds cu_seqlens.transformers.models.qwen3_vl.modeling_qwen3_vl.create_causal_mask— returns the 2D indexed mask as-is, bypassing 4D mask creation.
This is the same approach used by LlamaFactory.
Module Contents
Functions
Data
API
Replacement for create_causal_mask that passes through packed masks.
FA2 handles masking internally, so always pass through. For non-FA2 backends, pass through packed masks but delegate normal 2D masks to HF.
Apply monkey-patches for packed-sequence training with flash_attention_2.
Only patches when attn_implementation == "flash_attention_2".
Parameters:
The attention implementation used by the model.
Determine the attention backend from model config.
Custom models store it in backend.attn; HF models use attn_implementation.
Extract per-document sequence lengths from an indexed attention mask.
Example::
>>> get_seqlens_in_batch(torch.tensor([[1, 1, 2, 2, 2, 0], … [1, 2, 2, 3, 3, 3]])) tensor([2, 3, 1, 2, 3])
Parameters:
[B, S] integer tensor where each position contains
the 1-based document index (0 = padding).
Returns: torch.Tensor
1D tensor of all individual document lengths across the batch.
Prepare indices and cu_seqlens for flash_attn_varlen_func.
This is a drop-in replacement for
transformers.modeling_flash_attention_utils._get_unpad_data
that handles indexed attention masks (values 1, 2, 3, …) instead of
binary (0/1) masks. Each unique non-zero value is treated as a separate
document, so flash_attn_varlen_func applies causal attention
within each document without cross-document attention.
Example::
>>> get_unpad_data(torch.tensor([[1, 1, 2, 2, 2, 0], … [1, 2, 2, 3, 3, 3]])) (tensor([0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11]), tensor([ 0, 2, 5, 6, 8, 11], dtype=torch.int32), 3)
Returns: torch.Tensor
Indices of non-padding tokens from the flattened sequence.
Return True iff attention_mask is an Automodel-style indexed packing mask.
The Automodel neat_packed_vlm_collater (and the LLM equivalent) encode
packed-sample boundaries by marking document i (1-based) with the
integer i and using 0 for padding (e.g. [1, 1, 1, 2, 2, 3, 3, 0, 0]).
Any value greater than 1 is therefore a sufficient signal that two or
more documents are packed into the same row. A standard 0/1 attention mask
never has values > 1.