nemo_automodel.components.models.bagel.modeling_qwen2_packed#

Qwen2 language backbone with packed-sequence attention and MoT shell.

Stage 1 uses PackedAttention + Qwen2DecoderLayer. The PackedAttentionMoT / Qwen2MoTDecoderLayer shells are defined so that the *_moe_gen parameter siblings exist in the module tree and survive checkpoint round-tripping; they remain dormant in Stage 1 when packed_gen_token_indexes is empty.

Module Contents#

Classes#

Qwen2RMSNorm

Qwen2 RMSNorm (equivalent to T5LayerNorm).

Qwen2RotaryEmbedding

Qwen2 rotary embedding — delegates inv_freq init to HF ROPE_INIT_FUNCTIONS.

Qwen2MLP

SwiGLU MLP used in Qwen2 (gate_proj * up_proj -> down_proj).

NaiveCache

Dict-backed KV cache, one entry per layer (BAGEL inference helper).

BaseNavitOutputWithPast

BAGEL packed decoder output with optional past key-value cache.

_PackedAttentionBase

Common init for PackedAttention / PackedAttentionMoT (QKV shapes, RoPE, QK-norm).

PackedAttention

BAGEL’s packed-sequence attention (UND path, no MoT).

PackedAttentionMoT

MoT variant: adds *_moe_gen siblings of every projection and QK-norm.

Qwen2DecoderLayer

Standard (non-MoT) packed Qwen2 decoder block.

Qwen2MoTDecoderLayer

MoT decoder: every norm/MLP is duplicated into *_moe_gen siblings.

Qwen2PreTrainedModel

Abstract base class — mirrors HF Qwen2PreTrainedModel flags.

Qwen2Model

Packed-sequence Qwen2 backbone.

Qwen2ForCausalLM

Packed-sequence Qwen2 LM head wrapper.

Functions#

_flash_attn_varlen

_extract_rope_config

Return a dict with rope_theta / scaling info, handling transformers 4.x and 5.x.

_compute_default_rope_parameters

Local “default” RoPE init — transformers 5.x dropped it from ROPE_INIT_FUNCTIONS.

rotate_half

Rotates half the hidden dims of the input.

apply_rotary_pos_emb

Apply Rotary Position Embedding to query and key tensors.

_pad_sequence

Data#

API#

nemo_automodel.components.models.bagel.modeling_qwen2_packed.__all__#

[‘Qwen2RMSNorm’, ‘Qwen2RotaryEmbedding’, ‘Qwen2MLP’, ‘PackedAttention’, ‘PackedAttentionMoT’, ‘Qwen2


nemo_automodel.components.models.bagel.modeling_qwen2_packed._flash_attn_varlen(*args, **kwargs)#
nemo_automodel.components.models.bagel.modeling_qwen2_packed._flex_attention#

None

class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RMSNorm(hidden_size: int, eps: float = 1e-06)#

Bases: torch.nn.Module

Qwen2 RMSNorm (equivalent to T5LayerNorm).

Initialization

forward(hidden_states: torch.Tensor) torch.Tensor#
extra_repr() str#
nemo_automodel.components.models.bagel.modeling_qwen2_packed._extract_rope_config(config: transformers.Qwen2Config) dict#

Return a dict with rope_theta / scaling info, handling transformers 4.x and 5.x.

DIVERGENCE: upstream BAGEL was written against transformers 4.4x where Qwen2Config exposes rope_theta and rope_scaling as top-level attributes. transformers 5.x moves these into a single rope_parameters dict on Qwen2Config (Llama still keeps the old layout). AM’s container runs transformers 5.x, so we normalize here instead of hard-coding one schema.

nemo_automodel.components.models.bagel.modeling_qwen2_packed._compute_default_rope_parameters(
config: transformers.Qwen2Config,
device: Optional[torch.device] = None,
seq_len: Optional[int] = None,
) Tuple[torch.Tensor, float]#

Local “default” RoPE init — transformers 5.x dropped it from ROPE_INIT_FUNCTIONS.

class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RotaryEmbedding(
config: transformers.Qwen2Config,
device: Optional[torch.device] = None,
)#

Bases: torch.nn.Module

Qwen2 rotary embedding — delegates inv_freq init to HF ROPE_INIT_FUNCTIONS.

DIVERGENCE: transformers 5.x removed "default" from ROPE_INIT_FUNCTIONS so we fall back to a local copy of the pre-5.x default implementation when the rope_type is unspecified.

Initialization

_dynamic_frequency_update(
position_ids: torch.Tensor,
device: torch.device,
) None#
forward(
x: torch.Tensor,
position_ids: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#
nemo_automodel.components.models.bagel.modeling_qwen2_packed.rotate_half(x: torch.Tensor) torch.Tensor#

Rotates half the hidden dims of the input.

nemo_automodel.components.models.bagel.modeling_qwen2_packed.apply_rotary_pos_emb(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
position_ids: Optional[torch.Tensor] = None,
unsqueeze_dim: int = 1,
) Tuple[torch.Tensor, torch.Tensor]#

Apply Rotary Position Embedding to query and key tensors.

class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MLP(config: transformers.Qwen2Config)#

Bases: torch.nn.Module

SwiGLU MLP used in Qwen2 (gate_proj * up_proj -> down_proj).

Initialization

forward(hidden_state: torch.Tensor) torch.Tensor#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache(num_layers: int)#

Dict-backed KV cache, one entry per layer (BAGEL inference helper).

Initialization

property num_layers: int#
property seq_lens: int#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast#

Bases: transformers.utils.ModelOutput

BAGEL packed decoder output with optional past key-value cache.

packed_query_sequence: torch.FloatTensor#

None

past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]#

None

nemo_automodel.components.models.bagel.modeling_qwen2_packed._pad_sequence(tensor: torch.Tensor, pad_size: int) torch.Tensor#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed._PackedAttentionBase(
config: transformers.Qwen2Config,
layer_idx: Optional[int] = None,
)#

Bases: torch.nn.Module

Common init for PackedAttention / PackedAttentionMoT (QKV shapes, RoPE, QK-norm).

Initialization

class nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttention(
config: transformers.Qwen2Config,
layer_idx: Optional[int] = None,
)#

Bases: nemo_automodel.components.models.bagel.modeling_qwen2_packed._PackedAttentionBase

BAGEL’s packed-sequence attention (UND path, no MoT).

Initialization

forward(*args, **kwargs)#
forward_train(
packed_sequence: torch.Tensor,
sample_lens: List[int],
attention_mask,
packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
) torch.Tensor#
forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: Optional[torch.Tensor] = None,
packed_key_value_indexes: Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
) Tuple[torch.Tensor, Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttentionMoT(
config: transformers.Qwen2Config,
layer_idx: Optional[int] = None,
)#

Bases: nemo_automodel.components.models.bagel.modeling_qwen2_packed._PackedAttentionBase

MoT variant: adds *_moe_gen siblings of every projection and QK-norm.

Initialization

forward(*args, **kwargs)#
forward_train(
packed_sequence: torch.Tensor,
sample_lens: List[int],
attention_mask,
packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
packed_und_token_indexes: torch.LongTensor,
packed_gen_token_indexes: torch.LongTensor,
) torch.Tensor#
forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: Optional[torch.Tensor] = None,
packed_key_value_indexes: Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: Optional[torch.Tensor] = None,
packed_text_indexes: Optional[torch.Tensor] = None,
) Tuple[torch.Tensor, Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2DecoderLayer(
config: transformers.Qwen2Config,
layer_idx: Optional[int] = None,
)#

Bases: torch.nn.Module

Standard (non-MoT) packed Qwen2 decoder block.

Initialization

forward(*args, **kwargs)#
forward_train(
packed_sequence: torch.Tensor,
sample_lens: List[int],
attention_mask,
packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
) torch.Tensor#
forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: Optional[torch.Tensor] = None,
packed_key_value_indexes: Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
) Tuple[torch.Tensor, Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MoTDecoderLayer(
config: transformers.Qwen2Config,
layer_idx: Optional[int] = None,
attn_module: type = PackedAttentionMoT,
)#

Bases: torch.nn.Module

MoT decoder: every norm/MLP is duplicated into *_moe_gen siblings.

Initialization

forward(*args, **kwargs)#
forward_train(
packed_sequence: torch.Tensor,
sample_lens: List[int],
attention_mask,
packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
packed_und_token_indexes: torch.LongTensor,
packed_gen_token_indexes: torch.LongTensor,
) torch.Tensor#
forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
packed_query_indexes: torch.Tensor,
past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: Optional[torch.Tensor] = None,
packed_key_value_indexes: Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: Optional[torch.Tensor] = None,
packed_text_indexes: Optional[torch.Tensor] = None,
) Tuple[torch.Tensor, Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache]]#
nemo_automodel.components.models.bagel.modeling_qwen2_packed._DECODER_LAYER_DICT#

None

class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

Abstract base class — mirrors HF Qwen2PreTrainedModel flags.

config_class#

None

base_model_prefix#

‘model’

supports_gradient_checkpointing#

True

_no_split_modules#

[‘Qwen2DecoderLayer’, ‘Qwen2MoTDecoderLayer’]

_skip_keys_device_placement#

‘past_key_values’

_supports_flash_attn_2#

True

_supports_cache_class#

True

_init_weights(module: torch.nn.Module) None#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2Model(config: transformers.Qwen2Config)#

Bases: nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModel

Packed-sequence Qwen2 backbone.

Selects Qwen2DecoderLayer or Qwen2MoTDecoderLayer per-layer based on config.layer_module (string -> class). When the MoT variant is active, self.use_moe == True and an extra norm_moe_gen sibling is created for the final RMSNorm.

Initialization

init_moe() None#

Copy UND weights into MoE-gen siblings (Stage 1 cold-start seeding).

forward(*args, **kwargs)#
forward_train(
packed_sequence: torch.Tensor,
sample_lens: List[int],
attention_mask,
packed_position_ids: torch.Tensor,
packed_und_token_indexes: Optional[torch.LongTensor] = None,
packed_gen_token_indexes: Optional[torch.LongTensor] = None,
) torch.Tensor#
forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_ids: torch.Tensor,
packed_query_indexes: torch.Tensor,
past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: Optional[torch.Tensor] = None,
packed_key_value_indexes: Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: Optional[torch.Tensor] = None,
packed_text_indexes: Optional[torch.Tensor] = None,
) nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast#
class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM(config: transformers.Qwen2Config)#

Bases: nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModel

Packed-sequence Qwen2 LM head wrapper.

Initialization

_tied_weights_keys#

[‘lm_head.weight’]

init_moe() None#

Seed *_moe_gen parameters from their UND siblings (Stage 1 cold-start).

get_input_embeddings() torch.nn.Module#
set_input_embeddings(value: torch.nn.Module) None#
get_output_embeddings() torch.nn.Module#
set_output_embeddings(new_embeddings: torch.nn.Module) None#
set_decoder(decoder: torch.nn.Module) None#
get_decoder() torch.nn.Module#
forward(
*args,
**kwargs,
) Union[torch.Tensor, nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast]#
forward_train(
packed_sequence: torch.Tensor,
sample_lens: List[int],
attention_mask,
packed_position_ids: torch.Tensor,
packed_und_token_indexes: Optional[torch.LongTensor] = None,
packed_gen_token_indexes: Optional[torch.LongTensor] = None,
) torch.Tensor#
forward_inference(
packed_query_sequence: torch.Tensor,
query_lens: torch.Tensor,
packed_query_position_ids: torch.Tensor,
packed_query_indexes: torch.Tensor,
past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
key_values_lens: Optional[torch.Tensor] = None,
packed_key_value_indexes: Optional[torch.Tensor] = None,
update_past_key_values: bool = True,
is_causal: bool = True,
mode: str = 'und',
packed_vae_token_indexes: Optional[torch.Tensor] = None,
packed_text_indexes: Optional[torch.Tensor] = None,
) nemo_automodel.components.models.bagel.modeling_qwen2_packed.BaseNavitOutputWithPast#