nemo_automodel.components.models.bagel.modeling_qwen2_packed
nemo_automodel.components.models.bagel.modeling_qwen2_packed
Qwen2 language backbone with packed-sequence attention and MoT shell.
Stage 1 uses PackedAttention + Qwen2DecoderLayer. The
PackedAttentionMoT / Qwen2MoTDecoderLayer shells are defined so that
the *_moe_gen parameter siblings exist in the module tree and survive
checkpoint round-tripping; they remain dormant in Stage 1 when
packed_gen_token_indexes is empty.
Module Contents
Classes
Functions
Data
API
Bases: ModelOutput
BAGEL packed decoder output with optional past key-value cache.
Dict-backed KV cache, one entry per layer (BAGEL inference helper).
Bases: _PackedAttentionBase
MoT variant: adds *_moe_gen siblings of every projection and QK-norm.
Bases: Module
Standard (non-MoT) packed Qwen2 decoder block.
Bases: Qwen2PreTrainedModel
Packed-sequence Qwen2 LM head wrapper.
Seed *_moe_gen parameters from their UND siblings (Stage 1 cold-start).
Bases: Module
SwiGLU MLP used in Qwen2 (gate_proj * up_proj -> down_proj).
Bases: Module
MoT decoder: every norm/MLP is duplicated into *_moe_gen siblings.
Bases: Qwen2PreTrainedModel
Packed-sequence Qwen2 backbone.
Selects Qwen2DecoderLayer or Qwen2MoTDecoderLayer per-layer based
on config.layer_module (string -> class). When the MoT variant is
active, self.use_moe == True and an extra norm_moe_gen sibling is
created for the final RMSNorm.
Copy UND weights into MoE-gen siblings (Stage 1 cold-start seeding).
Bases: PreTrainedModel
Abstract base class — mirrors HF Qwen2PreTrainedModel flags.
Bases: Module
Qwen2 RMSNorm (equivalent to T5LayerNorm).
Bases: Module
Qwen2 rotary embedding — delegates inv_freq init to HF ROPE_INIT_FUNCTIONS.
DIVERGENCE: transformers 5.x removed "default" from ROPE_INIT_FUNCTIONS
so we fall back to a local copy of the pre-5.x default implementation when
the rope_type is unspecified.
Bases: Module
Common init for PackedAttention / PackedAttentionMoT (QKV shapes, RoPE, QK-norm).
Local “default” RoPE init — transformers 5.x dropped it from ROPE_INIT_FUNCTIONS.
Return a dict with rope_theta / scaling info, handling transformers 4.x and 5.x.
DIVERGENCE: upstream BAGEL was written against transformers 4.4x where
Qwen2Config exposes rope_theta and rope_scaling as top-level
attributes. transformers 5.x moves these into a single rope_parameters
dict on Qwen2Config (Llama still keeps the old layout). AM’s container runs
transformers 5.x, so we normalize here instead of hard-coding one schema.
Apply Rotary Position Embedding to query and key tensors.
Rotates half the hidden dims of the input.