bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention#
Custom SelfAttention for ERNIE 4.5 VL vision encoder.
Overrides the standard MCore SelfAttention to apply absolute 2D RoPE positional embeddings instead of the standard relative RoPE.
ERNIE ViT uses non-interleaved RoPE (rotate_half style, splitting at
the midpoint: [-x2, x1]), corresponding to rotary_interleaved=False
in MCore. The RoPE frequencies are pre-computed as absolute position
embeddings based on 2D (height, width) grid coordinates.
This approach mirrors Qwen3VLSelfAttention but with ERNIE-specific non-interleaved rotation.
Module Contents#
Classes#
SelfAttention with absolute 2D RoPE for ERNIE ViT. |
Functions#
Apply RoPE to |
|
Apply absolute RoPE, routing to bshd or thd format as appropriate. |
API#
- bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention._apply_rotary_pos_emb_thd_absolute(
- t: torch.Tensor,
- cu_seqlens: torch.Tensor,
- freqs: torch.Tensor,
- rotary_interleaved: bool = False,
Apply RoPE to
thd(packed) format tensors using absolute position embeddings.- Parameters:
t – Input tensor of shape [total_tokens, num_heads, head_dim].
cu_seqlens – Cumulative sequence lengths (currently unused, kept for API consistency).
freqs – Rotary embedding frequencies of shape [total_tokens, 1, 1, head_dim].
rotary_interleaved – Whether to use interleaved rotation.
- Returns:
Tensor of shape [total_tokens, num_heads, head_dim] with RoPE applied.
- bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention.apply_rotary_pos_emb_absolute(
- t: torch.Tensor,
- freqs: torch.Tensor,
- config,
- cu_seqlens: Optional[torch.Tensor] = None,
Apply absolute RoPE, routing to bshd or thd format as appropriate.
For ERNIE ViT, the freqs tensor has shape [total_tokens, 1, 1, head_dim] (absolute position embeddings, where the raw frequencies of shape [head_dim//2] are tiled 2x to cover the full head_dim), unlike standard relative RoPE where freqs is [max_seqlen, 1, 1, rotary_dim].
- Parameters:
t – Input tensor (Q or K).
freqs – Pre-computed RoPE frequencies.
config – TransformerConfig (used for rotary_interleaved flag).
cu_seqlens – If provided, indicates packed sequence (thd) format.
- Returns:
Tensor with RoPE applied, same shape as input.
- class bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention.ErnieVLSelfAttention#
Bases:
megatron.core.transformer.attention.SelfAttentionSelfAttention with absolute 2D RoPE for ERNIE ViT.
Overrides the standard MCore SelfAttention.forward() to apply
apply_rotary_pos_emb_absoluteinstead of the standardapply_rotary_pos_embwhich expects relative position embeddings.This is necessary because ERNIE ViT pre-computes absolute 2D (H, W) position embeddings and passes them as rotary_pos_emb through the TransformerBlock, rather than using the standard MCore RoPE infrastructure that computes frequencies from sequential position IDs.
- forward(
- hidden_states: torch.Tensor,
- attention_mask: torch.Tensor,
- key_value_states: Optional[torch.Tensor] = None,
- inference_context: Optional[megatron.core.transformer.attention.BaseInferenceContext] = None,
- rotary_pos_emb: Optional[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]] = None,
- rotary_pos_cos: Optional[torch.Tensor] = None,
- rotary_pos_sin: Optional[torch.Tensor] = None,
- attention_bias: Optional[torch.Tensor] = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
- sequence_len_offset: Optional[int] = None,
- *,
- inference_params: Optional[megatron.core.transformer.attention.BaseInferenceContext] = None,
- rotary_pos_cos_sin: Optional[torch.Tensor] = None,
Forward pass with absolute 2D RoPE for vision encoder.
The main difference from the parent class is in the RoPE application section: we use
apply_rotary_pos_emb_absolutewhich handles absolute position embeddings properly for both bshd and thd formats.- Parameters:
hidden_states – Input tensor [seq_len, batch, hidden_size].
attention_mask – Attention mask (typically None for ViT).
rotary_pos_emb – Pre-computed absolute 2D RoPE frequencies.
packed_seq_params – Parameters for per-image packed sequence attention.
args) ((other) – See parent class SelfAttention.
- Returns:
Tuple of (output, bias) where output is [seq_len, batch, hidden_size].