nemo_automodel.components.models.bagel.modeling_qwen2_packed

View as Markdown

Qwen2 language backbone with packed-sequence attention and MoT shell.

Stage 1 uses PackedAttention + Qwen2DecoderLayer. The PackedAttentionMoT / Qwen2MoTDecoderLayer shells are defined so that the *_moe_gen parameter siblings exist in the module tree and survive checkpoint round-tripping; they remain dormant in Stage 1 when packed_gen_token_indexes is empty.

Module Contents

Classes

NameDescription
BaseNavitOutputWithPastBAGEL packed decoder output with optional past key-value cache.
NaiveCacheDict-backed KV cache, one entry per layer (BAGEL inference helper).
PackedAttentionBAGEL’s packed-sequence attention (UND path, no MoT).
PackedAttentionMoTMoT variant: adds *_moe_gen siblings of every projection and QK-norm.
Qwen2DecoderLayerStandard (non-MoT) packed Qwen2 decoder block.
Qwen2ForCausalLMPacked-sequence Qwen2 LM head wrapper.
Qwen2MLPSwiGLU MLP used in Qwen2 (gate_proj * up_proj -> down_proj).
Qwen2MoTDecoderLayerMoT decoder: every norm/MLP is duplicated into *_moe_gen siblings.
Qwen2ModelPacked-sequence Qwen2 backbone.
Qwen2PreTrainedModelAbstract base class — mirrors HF Qwen2PreTrainedModel flags.
Qwen2RMSNormQwen2 RMSNorm (equivalent to T5LayerNorm).
Qwen2RotaryEmbeddingQwen2 rotary embedding — delegates inv_freq init to HF ROPE_INIT_FUNCTIONS.
_PackedAttentionBaseCommon init for PackedAttention / PackedAttentionMoT (QKV shapes, RoPE, QK-norm).

Functions

NameDescription
_compute_default_rope_parametersLocal “default” RoPE init — transformers 5.x dropped it from ROPE_INIT_FUNCTIONS.
_extract_rope_configReturn a dict with rope_theta / scaling info, handling transformers 4.x and 5.x.
_flash_attn_varlen-
_pad_sequence-
apply_rotary_pos_embApply Rotary Position Embedding to query and key tensors.
rotate_halfRotates half the hidden dims of the input.

Data

_DECODER_LAYER_DICT

__all__

_flex_attention

API

class nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast(
packed_query_sequence: torch.FloatTensor = None,
past_key_values: typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None
)
Dataclass

Bases: ModelOutput

BAGEL packed decoder output with optional past key-value cache.

packed_query_sequence
FloatTensor = None
past_key_values
Optional[NaiveCache] = None
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache(
num_layers: int
)

Dict-backed KV cache, one entry per layer (BAGEL inference helper).

key_cache
dict[int, Optional[Tensor]] = {k: None for k in (range(num_layers))}
num_layers
int
seq_lens
int
value_cache
dict[int, Optional[Tensor]] = {k: None for k in (range(num_layers))}
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttention(
config: transformers.Qwen2Config,
layer_idx: typing.Optional[int] = None
)

Bases: _PackedAttentionBase

BAGEL’s packed-sequence attention (UND path, no MoT).

k_norm
q_norm
nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttention.forward(
args = (),
kwargs = {}
)
nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttention.forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: typing.Optional[torch.Tensor] = None,
packed_key_value_indexes: typing.Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True
) -> typing.Tuple[torch.Tensor, typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]
nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttention.forward_train(
packed_sequence: torch.Tensor,
sample_lens: typing.List[int],
attention_mask,
packed_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor]
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttentionMoT(
config: transformers.Qwen2Config,
layer_idx: typing.Optional[int] = None
)

Bases: _PackedAttentionBase

MoT variant: adds *_moe_gen siblings of every projection and QK-norm.

k_norm
k_norm_moe_gen
k_proj_moe_gen
o_proj_moe_gen
q_norm
q_norm_moe_gen
q_proj_moe_gen
v_proj_moe_gen
nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttentionMoT.forward(
args = (),
kwargs = {}
)
nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttentionMoT.forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: typing.Optional[torch.Tensor] = None,
packed_key_value_indexes: typing.Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: typing.Optional[torch.Tensor] = None,
packed_text_indexes: typing.Optional[torch.Tensor] = None
) -> typing.Tuple[torch.Tensor, typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]
nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttentionMoT.forward_train(
packed_sequence: torch.Tensor,
sample_lens: typing.List[int],
attention_mask,
packed_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor],
packed_und_token_indexes: torch.LongTensor,
packed_gen_token_indexes: torch.LongTensor
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2DecoderLayer(
config: transformers.Qwen2Config,
layer_idx: typing.Optional[int] = None
)

Bases: Module

Standard (non-MoT) packed Qwen2 decoder block.

hidden_size
= config.hidden_size
input_layernorm
mlp
= Qwen2MLP(config)
post_attention_layernorm
self_attn
= PackedAttention(config, layer_idx)
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2DecoderLayer.forward(
args = (),
kwargs = {}
)
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2DecoderLayer.forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: typing.Optional[torch.Tensor] = None,
packed_key_value_indexes: typing.Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True
) -> typing.Tuple[torch.Tensor, typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2DecoderLayer.forward_train(
packed_sequence: torch.Tensor,
sample_lens: typing.List[int],
attention_mask,
packed_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor]
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM(
config: transformers.Qwen2Config
)

Bases: Qwen2PreTrainedModel

Packed-sequence Qwen2 LM head wrapper.

_tied_weights_keys
= ['lm_head.weight']
lm_head
model
= Qwen2Model(config)
vocab_size
= config.vocab_size
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.forward(
args = (),
kwargs = {}
) -> typing.Union[torch.Tensor, nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast]
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_ids: torch.Tensor,
packed_query_indexes: torch.Tensor,
past_key_values: typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: typing.Optional[torch.Tensor] = None,
packed_key_value_indexes: typing.Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: typing.Optional[torch.Tensor] = None,
packed_text_indexes: typing.Optional[torch.Tensor] = None
) -> nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.forward_train(
packed_sequence: torch.Tensor,
sample_lens: typing.List[int],
attention_mask,
packed_position_ids: torch.Tensor,
packed_und_token_indexes: typing.Optional[torch.LongTensor] = None,
packed_gen_token_indexes: typing.Optional[torch.LongTensor] = None
) -> torch.Tensor
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.get_decoder() -> torch.nn.Module
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.get_input_embeddings() -> torch.nn.Module
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.get_output_embeddings() -> torch.nn.Module
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.init_moe() -> None

Seed *_moe_gen parameters from their UND siblings (Stage 1 cold-start).

nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.set_decoder(
decoder: torch.nn.Module
) -> None
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.set_input_embeddings(
value: torch.nn.Module
) -> None
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM.set_output_embeddings(
new_embeddings: torch.nn.Module
) -> None
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MLP(
config: transformers.Qwen2Config
)

Bases: Module

SwiGLU MLP used in Qwen2 (gate_proj * up_proj -> down_proj).

act_fn
= ACT2FN[config.hidden_act]
down_proj
gate_proj
hidden_size
= config.hidden_size
intermediate_size
= config.intermediate_size
up_proj
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MLP.forward(
hidden_state: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MoTDecoderLayer(
config: transformers.Qwen2Config,
layer_idx: typing.Optional[int] = None,
attn_module: type = PackedAttentionMoT
)

Bases: Module

MoT decoder: every norm/MLP is duplicated into *_moe_gen siblings.

freeze_und
= getattr(config, 'freeze_und', False)
hidden_size
= config.hidden_size
input_layernorm
input_layernorm_moe_gen
mlp
= Qwen2MLP(config)
mlp_moe_gen
= Qwen2MLP(config)
post_attention_layernorm
post_attention_layernorm_moe_gen
self_attn
= attn_module(config, layer_idx)
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MoTDecoderLayer.forward(
args = (),
kwargs = {}
)
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MoTDecoderLayer.forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: typing.Optional[torch.Tensor] = None,
packed_key_value_indexes: typing.Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: typing.Optional[torch.Tensor] = None,
packed_text_indexes: typing.Optional[torch.Tensor] = None
) -> typing.Tuple[torch.Tensor, typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MoTDecoderLayer.forward_train(
packed_sequence: torch.Tensor,
sample_lens: typing.List[int],
attention_mask,
packed_position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor],
packed_und_token_indexes: torch.LongTensor,
packed_gen_token_indexes: torch.LongTensor
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2Model(
config: transformers.Qwen2Config
)

Bases: Qwen2PreTrainedModel

Packed-sequence Qwen2 backbone.

Selects Qwen2DecoderLayer or Qwen2MoTDecoderLayer per-layer based on config.layer_module (string -> class). When the MoT variant is active, self.use_moe == True and an extra norm_moe_gen sibling is created for the final RMSNorm.

embed_tokens
layers
norm
norm_moe_gen
padding_idx
= config.pad_token_id
rotary_emb
= Qwen2RotaryEmbedding(config=config)
use_moe
= 'Mo' in layer_module_name
vocab_size
= config.vocab_size
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2Model.forward(
args = (),
kwargs = {}
)
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2Model.forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_ids: torch.Tensor,
packed_query_indexes: torch.Tensor,
past_key_values: typing.Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: typing.Optional[torch.Tensor] = None,
packed_key_value_indexes: typing.Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: typing.Optional[torch.Tensor] = None,
packed_text_indexes: typing.Optional[torch.Tensor] = None
) -> nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2Model.forward_train(
packed_sequence: torch.Tensor,
sample_lens: typing.List[int],
attention_mask,
packed_position_ids: torch.Tensor,
packed_und_token_indexes: typing.Optional[torch.LongTensor] = None,
packed_gen_token_indexes: typing.Optional[torch.LongTensor] = None
) -> torch.Tensor
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2Model.init_moe() -> None

Copy UND weights into MoE-gen siblings (Stage 1 cold-start seeding).

class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModel()

Bases: PreTrainedModel

Abstract base class — mirrors HF Qwen2PreTrainedModel flags.

_no_split_modules
= ['Qwen2DecoderLayer', 'Qwen2MoTDecoderLayer']
_skip_keys_device_placement
= 'past_key_values'
base_model_prefix
= 'model'
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModel._init_weights(
module: torch.nn.Module
) -> None
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RMSNorm(
hidden_size: int,
eps: float = 1e-06
)

Bases: Module

Qwen2 RMSNorm (equivalent to T5LayerNorm).

weight
= nn.Parameter(torch.ones(hidden_size))
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RMSNorm.extra_repr() -> str
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RMSNorm.forward(
hidden_states: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RotaryEmbedding(
config: transformers.Qwen2Config,
device: typing.Optional[torch.device] = None
)

Bases: Module

Qwen2 rotary embedding — delegates inv_freq init to HF ROPE_INIT_FUNCTIONS.

DIVERGENCE: transformers 5.x removed "default" from ROPE_INIT_FUNCTIONS so we fall back to a local copy of the pre-5.x default implementation when the rope_type is unspecified.

max_seq_len_cached
= config.max_position_embeddings
original_inv_freq
= self.inv_freq
original_max_seq_len
= config.max_position_embeddings
rope_type
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RotaryEmbedding._dynamic_frequency_update(
position_ids: torch.Tensor,
device: torch.device
) -> None
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RotaryEmbedding.forward(
x: torch.Tensor,
position_ids: torch.Tensor
) -> typing.Tuple[torch.Tensor, torch.Tensor]
class nemo_automodel.components.models.bagel.modeling_qwen2_packed._PackedAttentionBase(
config: transformers.Qwen2Config,
layer_idx: typing.Optional[int] = None
)

Bases: Module

Common init for PackedAttention / PackedAttentionMoT (QKV shapes, RoPE, QK-norm).

attention_dropout
= config.attention_dropout
head_dim
= self.hidden_size // self.num_heads
hidden_size
= config.hidden_size
is_causal
= getattr(config, 'is_causal', True)
k_proj
max_position_embeddings
= config.max_position_embeddings
num_heads
= config.num_attention_heads
num_key_value_groups
= self.num_heads // self.num_key_value_heads
num_key_value_heads
= config.num_key_value_heads
o_proj
q_proj
rope_theta
v_proj
nemo_automodel.components.models.bagel.modeling_qwen2_packed._compute_default_rope_parameters(
config: transformers.Qwen2Config,
device: typing.Optional[torch.device] = None,
seq_len: typing.Optional[int] = None
) -> typing.Tuple[torch.Tensor, float]

Local “default” RoPE init — transformers 5.x dropped it from ROPE_INIT_FUNCTIONS.

nemo_automodel.components.models.bagel.modeling_qwen2_packed._extract_rope_config(
config: transformers.Qwen2Config
) -> dict

Return a dict with rope_theta / scaling info, handling transformers 4.x and 5.x.

DIVERGENCE: upstream BAGEL was written against transformers 4.4x where Qwen2Config exposes rope_theta and rope_scaling as top-level attributes. transformers 5.x moves these into a single rope_parameters dict on Qwen2Config (Llama still keeps the old layout). AM’s container runs transformers 5.x, so we normalize here instead of hard-coding one schema.

nemo_automodel.components.models.bagel.modeling_qwen2_packed._flash_attn_varlen(
args = (),
kwargs = {}
)
nemo_automodel.components.models.bagel.modeling_qwen2_packed._pad_sequence(
tensor: torch.Tensor,
pad_size: int
) -> torch.Tensor
nemo_automodel.components.models.bagel.modeling_qwen2_packed.apply_rotary_pos_emb(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
position_ids: typing.Optional[torch.Tensor] = None,
unsqueeze_dim: int = 1
) -> typing.Tuple[torch.Tensor, torch.Tensor]

Apply Rotary Position Embedding to query and key tensors.

nemo_automodel.components.models.bagel.modeling_qwen2_packed.rotate_half(
x: torch.Tensor
) -> torch.Tensor

Rotates half the hidden dims of the input.

nemo_automodel.components.models.bagel.modeling_qwen2_packed._DECODER_LAYER_DICT = {'Qwen2DecoderLayer': Qwen2DecoderLayer, 'Qwen2MoTDecoderLayer': partial(Qwen2Mo...
nemo_automodel.components.models.bagel.modeling_qwen2_packed.__all__ = ['Qwen2RMSNorm', 'Qwen2RotaryEmbedding', 'Qwen2MLP', 'PackedAttention', 'PackedA...
nemo_automodel.components.models.bagel.modeling_qwen2_packed._flex_attention = FlexAttention.flex_attn