bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention#

Custom SelfAttention for ERNIE 4.5 VL vision encoder.

Overrides the standard MCore SelfAttention to apply absolute 2D RoPE positional embeddings instead of the standard relative RoPE.

ERNIE ViT uses non-interleaved RoPE (rotate_half style, splitting at the midpoint: [-x2, x1]), corresponding to rotary_interleaved=False in MCore. The RoPE frequencies are pre-computed as absolute position embeddings based on 2D (height, width) grid coordinates.

This approach mirrors Qwen3VLSelfAttention but with ERNIE-specific non-interleaved rotation.

Module Contents#

Classes#

ErnieVLSelfAttention

SelfAttention with absolute 2D RoPE for ERNIE ViT.

Functions#

_apply_rotary_pos_emb_thd_absolute

Apply RoPE to thd (packed) format tensors using absolute position embeddings.

apply_rotary_pos_emb_absolute

Apply absolute RoPE, routing to bshd or thd format as appropriate.

API#

bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention._apply_rotary_pos_emb_thd_absolute(
t: torch.Tensor,
cu_seqlens: torch.Tensor,
freqs: torch.Tensor,
rotary_interleaved: bool = False,
) torch.Tensor#

Apply RoPE to thd (packed) format tensors using absolute position embeddings.

Parameters:
  • t – Input tensor of shape [total_tokens, num_heads, head_dim].

  • cu_seqlens – Cumulative sequence lengths (currently unused, kept for API consistency).

  • freqs – Rotary embedding frequencies of shape [total_tokens, 1, 1, head_dim].

  • rotary_interleaved – Whether to use interleaved rotation.

Returns:

Tensor of shape [total_tokens, num_heads, head_dim] with RoPE applied.

bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention.apply_rotary_pos_emb_absolute(
t: torch.Tensor,
freqs: torch.Tensor,
config,
cu_seqlens: Optional[torch.Tensor] = None,
) torch.Tensor#

Apply absolute RoPE, routing to bshd or thd format as appropriate.

For ERNIE ViT, the freqs tensor has shape [total_tokens, 1, 1, head_dim] (absolute position embeddings, where the raw frequencies of shape [head_dim//2] are tiled 2x to cover the full head_dim), unlike standard relative RoPE where freqs is [max_seqlen, 1, 1, rotary_dim].

Parameters:
  • t – Input tensor (Q or K).

  • freqs – Pre-computed RoPE frequencies.

  • config – TransformerConfig (used for rotary_interleaved flag).

  • cu_seqlens – If provided, indicates packed sequence (thd) format.

Returns:

Tensor with RoPE applied, same shape as input.

class bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention.ErnieVLSelfAttention#

Bases: megatron.core.transformer.attention.SelfAttention

SelfAttention with absolute 2D RoPE for ERNIE ViT.

Overrides the standard MCore SelfAttention.forward() to apply apply_rotary_pos_emb_absolute instead of the standard apply_rotary_pos_emb which expects relative position embeddings.

This is necessary because ERNIE ViT pre-computes absolute 2D (H, W) position embeddings and passes them as rotary_pos_emb through the TransformerBlock, rather than using the standard MCore RoPE infrastructure that computes frequencies from sequential position IDs.

forward(
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
key_value_states: Optional[torch.Tensor] = None,
inference_context: Optional[megatron.core.transformer.attention.BaseInferenceContext] = None,
rotary_pos_emb: Optional[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]] = None,
rotary_pos_cos: Optional[torch.Tensor] = None,
rotary_pos_sin: Optional[torch.Tensor] = None,
attention_bias: Optional[torch.Tensor] = None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
sequence_len_offset: Optional[int] = None,
*,
inference_params: Optional[megatron.core.transformer.attention.BaseInferenceContext] = None,
rotary_pos_cos_sin: Optional[torch.Tensor] = None,
) Tuple[torch.Tensor, torch.Tensor]#

Forward pass with absolute 2D RoPE for vision encoder.

The main difference from the parent class is in the RoPE application section: we use apply_rotary_pos_emb_absolute which handles absolute position embeddings properly for both bshd and thd formats.

Parameters:
  • hidden_states – Input tensor [seq_len, batch, hidden_size].

  • attention_mask – Attention mask (typically None for ViT).

  • rotary_pos_emb – Pre-computed absolute 2D RoPE frequencies.

  • packed_seq_params – Parameters for per-image packed sequence attention.

  • args) ((other) – See parent class SelfAttention.

Returns:

Tuple of (output, bias) where output is [seq_len, batch, hidden_size].