nemo_automodel.components.models.bagel.modeling_qwen2_packed#
Qwen2 language backbone with packed-sequence attention and MoT shell.
Stage 1 uses PackedAttention + Qwen2DecoderLayer. The
PackedAttentionMoT / Qwen2MoTDecoderLayer shells are defined so that
the *_moe_gen parameter siblings exist in the module tree and survive
checkpoint round-tripping; they remain dormant in Stage 1 when
packed_gen_token_indexes is empty.
Module Contents#
Classes#
Qwen2 RMSNorm (equivalent to T5LayerNorm). |
|
Qwen2 rotary embedding â delegates inv_freq init to HF |
|
SwiGLU MLP used in Qwen2 (gate_proj * up_proj -> down_proj). |
|
Dict-backed KV cache, one entry per layer (BAGEL inference helper). |
|
BAGEL packed decoder output with optional past key-value cache. |
|
Common init for PackedAttention / PackedAttentionMoT (QKV shapes, RoPE, QK-norm). |
|
BAGELâs packed-sequence attention (UND path, no MoT). |
|
MoT variant: adds |
|
Standard (non-MoT) packed Qwen2 decoder block. |
|
MoT decoder: every norm/MLP is duplicated into |
|
Abstract base class â mirrors HF Qwen2PreTrainedModel flags. |
|
Packed-sequence Qwen2 backbone. |
|
Packed-sequence Qwen2 LM head wrapper. |
Functions#
Return a dict with |
|
Local âdefaultâ RoPE init â transformers 5.x dropped it from ROPE_INIT_FUNCTIONS. |
|
Rotates half the hidden dims of the input. |
|
Apply Rotary Position Embedding to query and key tensors. |
|
Data#
API#
- nemo_automodel.components.models.bagel.modeling_qwen2_packed.__all__#
[âQwen2RMSNormâ, âQwen2RotaryEmbeddingâ, âQwen2MLPâ, âPackedAttentionâ, âPackedAttentionMoTâ, âQwen2âŠ
- nemo_automodel.components.models.bagel.modeling_qwen2_packed._flash_attn_varlen(*args, **kwargs)#
- nemo_automodel.components.models.bagel.modeling_qwen2_packed._flex_attention#
None
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RMSNorm(hidden_size: int, eps: float = 1e-06)#
Bases:
torch.nn.ModuleQwen2 RMSNorm (equivalent to T5LayerNorm).
Initialization
- forward(hidden_states: torch.Tensor) torch.Tensor#
- extra_repr() str#
- nemo_automodel.components.models.bagel.modeling_qwen2_packed._extract_rope_config(config: transformers.Qwen2Config) dict#
Return a dict with
rope_theta/ scaling info, handling transformers 4.x and 5.x.DIVERGENCE: upstream BAGEL was written against transformers 4.4x where
Qwen2Configexposesrope_thetaandrope_scalingas top-level attributes. transformers 5.x moves these into a singlerope_parametersdict on Qwen2Config (Llama still keeps the old layout). AMâs container runs transformers 5.x, so we normalize here instead of hard-coding one schema.
- nemo_automodel.components.models.bagel.modeling_qwen2_packed._compute_default_rope_parameters(
- config: transformers.Qwen2Config,
- device: Optional[torch.device] = None,
- seq_len: Optional[int] = None,
Local âdefaultâ RoPE init â transformers 5.x dropped it from ROPE_INIT_FUNCTIONS.
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2RotaryEmbedding(
- config: transformers.Qwen2Config,
- device: Optional[torch.device] = None,
Bases:
torch.nn.ModuleQwen2 rotary embedding â delegates inv_freq init to HF
ROPE_INIT_FUNCTIONS.DIVERGENCE: transformers 5.x removed
"default"fromROPE_INIT_FUNCTIONSso we fall back to a local copy of the pre-5.x default implementation when the rope_type is unspecified.Initialization
- _dynamic_frequency_update(
- position_ids: torch.Tensor,
- device: torch.device,
- forward(
- x: torch.Tensor,
- position_ids: torch.Tensor,
- nemo_automodel.components.models.bagel.modeling_qwen2_packed.rotate_half(x: torch.Tensor) torch.Tensor#
Rotates half the hidden dims of the input.
- nemo_automodel.components.models.bagel.modeling_qwen2_packed.apply_rotary_pos_emb(
- q: torch.Tensor,
- k: torch.Tensor,
- cos: torch.Tensor,
- sin: torch.Tensor,
- position_ids: Optional[torch.Tensor] = None,
- unsqueeze_dim: int = 1,
Apply Rotary Position Embedding to query and key tensors.
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MLP(config: transformers.Qwen2Config)#
Bases:
torch.nn.ModuleSwiGLU MLP used in Qwen2 (gate_proj * up_proj -> down_proj).
Initialization
- forward(hidden_state: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache(num_layers: int)#
Dict-backed KV cache, one entry per layer (BAGEL inference helper).
Initialization
- property num_layers: int#
- property seq_lens: int#
Bases:
transformers.utils.ModelOutputBAGEL packed decoder output with optional past key-value cache.
None
None
- nemo_automodel.components.models.bagel.modeling_qwen2_packed._pad_sequence(tensor: torch.Tensor, pad_size: int) torch.Tensor#
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed._PackedAttentionBase(
- config: transformers.Qwen2Config,
- layer_idx: Optional[int] = None,
Bases:
torch.nn.ModuleCommon init for PackedAttention / PackedAttentionMoT (QKV shapes, RoPE, QK-norm).
Initialization
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttention(
- config: transformers.Qwen2Config,
- layer_idx: Optional[int] = None,
Bases:
nemo_automodel.components.models.bagel.modeling_qwen2_packed._PackedAttentionBaseBAGELâs packed-sequence attention (UND path, no MoT).
Initialization
- forward(*args, **kwargs)#
- forward_train(
- packed_sequence: torch.Tensor,
- sample_lens: List[int],
- attention_mask,
- packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- forward_inference(
- packed_query_sequence: torch.Tensor,
- query_lens: torch.Tensor,
- packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- packed_query_indexes: torch.Tensor,
- past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
- key_values_lens: Optional[torch.Tensor] = None,
- packed_key_value_indexes: Optional[torch.Tensor] = None,
- update_past_key_values: bool = True,
- is_causal: bool = True,
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.PackedAttentionMoT(
- config: transformers.Qwen2Config,
- layer_idx: Optional[int] = None,
Bases:
nemo_automodel.components.models.bagel.modeling_qwen2_packed._PackedAttentionBaseMoT variant: adds
*_moe_gensiblings of every projection and QK-norm.Initialization
- forward(*args, **kwargs)#
- forward_train(
- packed_sequence: torch.Tensor,
- sample_lens: List[int],
- attention_mask,
- packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- packed_und_token_indexes: torch.LongTensor,
- packed_gen_token_indexes: torch.LongTensor,
- forward_inference(
- packed_query_sequence: torch.Tensor,
- query_lens: torch.Tensor,
- packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- packed_query_indexes: torch.Tensor,
- past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
- key_values_lens: Optional[torch.Tensor] = None,
- packed_key_value_indexes: Optional[torch.Tensor] = None,
- update_past_key_values: bool = True,
- is_causal: bool = True,
- mode: str = 'und',
- packed_vae_token_indexes: Optional[torch.Tensor] = None,
- packed_text_indexes: Optional[torch.Tensor] = None,
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2DecoderLayer(
- config: transformers.Qwen2Config,
- layer_idx: Optional[int] = None,
Bases:
torch.nn.ModuleStandard (non-MoT) packed Qwen2 decoder block.
Initialization
- forward(*args, **kwargs)#
- forward_train(
- packed_sequence: torch.Tensor,
- sample_lens: List[int],
- attention_mask,
- packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- forward_inference(
- packed_query_sequence: torch.Tensor,
- query_lens: torch.Tensor,
- packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- packed_query_indexes: torch.Tensor,
- past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
- key_values_lens: Optional[torch.Tensor] = None,
- packed_key_value_indexes: Optional[torch.Tensor] = None,
- update_past_key_values: bool = True,
- is_causal: bool = True,
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2MoTDecoderLayer(
- config: transformers.Qwen2Config,
- layer_idx: Optional[int] = None,
- attn_module: type = PackedAttentionMoT,
Bases:
torch.nn.ModuleMoT decoder: every norm/MLP is duplicated into
*_moe_gensiblings.Initialization
- forward(*args, **kwargs)#
- forward_train(
- packed_sequence: torch.Tensor,
- sample_lens: List[int],
- attention_mask,
- packed_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- packed_und_token_indexes: torch.LongTensor,
- packed_gen_token_indexes: torch.LongTensor,
- forward_inference(
- packed_query_sequence: torch.Tensor,
- query_lens: torch.Tensor,
- packed_query_position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- packed_query_indexes: torch.Tensor,
- past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
- key_values_lens: Optional[torch.Tensor] = None,
- packed_key_value_indexes: Optional[torch.Tensor] = None,
- update_past_key_values: bool = True,
- is_causal: bool = True,
- mode: str = 'und',
- packed_vae_token_indexes: Optional[torch.Tensor] = None,
- packed_text_indexes: Optional[torch.Tensor] = None,
- nemo_automodel.components.models.bagel.modeling_qwen2_packed._DECODER_LAYER_DICT#
None
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModel#
Bases:
transformers.modeling_utils.PreTrainedModelAbstract base class â mirrors HF Qwen2PreTrainedModel flags.
- config_class#
None
- base_model_prefix#
âmodelâ
- supports_gradient_checkpointing#
True
- _no_split_modules#
[âQwen2DecoderLayerâ, âQwen2MoTDecoderLayerâ]
- _skip_keys_device_placement#
âpast_key_valuesâ
- _supports_flash_attn_2#
True
- _supports_cache_class#
True
- _init_weights(module: torch.nn.Module) None#
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2Model(config: transformers.Qwen2Config)#
Bases:
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModelPacked-sequence Qwen2 backbone.
Selects
Qwen2DecoderLayerorQwen2MoTDecoderLayerper-layer based onconfig.layer_module(string -> class). When the MoT variant is active,self.use_moe == Trueand an extranorm_moe_gensibling is created for the final RMSNorm.Initialization
- init_moe() None#
Copy UND weights into MoE-gen siblings (Stage 1 cold-start seeding).
- forward(*args, **kwargs)#
- forward_train(
- packed_sequence: torch.Tensor,
- sample_lens: List[int],
- attention_mask,
- packed_position_ids: torch.Tensor,
- packed_und_token_indexes: Optional[torch.LongTensor] = None,
- packed_gen_token_indexes: Optional[torch.LongTensor] = None,
- forward_inference(
- packed_query_sequence: torch.Tensor,
- query_lens: torch.Tensor,
- packed_query_position_ids: torch.Tensor,
- packed_query_indexes: torch.Tensor,
- past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
- key_values_lens: Optional[torch.Tensor] = None,
- packed_key_value_indexes: Optional[torch.Tensor] = None,
- update_past_key_values: bool = True,
- is_causal: bool = True,
- mode: str = 'und',
- packed_vae_token_indexes: Optional[torch.Tensor] = None,
- packed_text_indexes: Optional[torch.Tensor] = None,
- class nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2ForCausalLM(config: transformers.Qwen2Config)#
Bases:
nemo_automodel.components.models.bagel.modeling_qwen2_packed.Qwen2PreTrainedModelPacked-sequence Qwen2 LM head wrapper.
Initialization
- _tied_weights_keys#
[âlm_head.weightâ]
- init_moe() None#
Seed
*_moe_genparameters from their UND siblings (Stage 1 cold-start).
- get_input_embeddings() torch.nn.Module#
- set_input_embeddings(value: torch.nn.Module) None#
- get_output_embeddings() torch.nn.Module#
- set_output_embeddings(new_embeddings: torch.nn.Module) None#
- set_decoder(decoder: torch.nn.Module) None#
- get_decoder() torch.nn.Module#
- forward(
- *args,
- **kwargs,
- forward_train(
- packed_sequence: torch.Tensor,
- sample_lens: List[int],
- attention_mask,
- packed_position_ids: torch.Tensor,
- packed_und_token_indexes: Optional[torch.LongTensor] = None,
- packed_gen_token_indexes: Optional[torch.LongTensor] = None,
- forward_inference(
- packed_query_sequence: torch.Tensor,
- query_lens: torch.Tensor,
- packed_query_position_ids: torch.Tensor,
- packed_query_indexes: torch.Tensor,
- past_key_values: Optional[nemo_automodel.components.models.bagel.modeling_qwen2_packed.NaiveCache] = None,
- key_values_lens: Optional[torch.Tensor] = None,
- packed_key_value_indexes: Optional[torch.Tensor] = None,
- update_past_key_values: bool = True,
- is_causal: bool = True,
- mode: str = 'und',
- packed_vae_token_indexes: Optional[torch.Tensor] = None,
- packed_text_indexes: Optional[torch.Tensor] = None,