`bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention`#

Custom SelfAttention for ERNIE 4.5 VL vision encoder.

Overrides the standard MCore SelfAttention to apply absolute 2D RoPE positional embeddings instead of the standard relative RoPE.

ERNIE ViT uses non-interleaved RoPE (rotate_half style, splitting at the midpoint: [-x2, x1]), corresponding to rotary_interleaved=False in MCore. The RoPE frequencies are pre-computed as absolute position embeddings based on 2D (height, width) grid coordinates.

This approach mirrors Qwen3VLSelfAttention but with ERNIE-specific non-interleaved rotation.

Module Contents#

Classes#

ErnieVLSelfAttention

SelfAttention with absolute 2D RoPE for ERNIE ViT.

Functions#

`_apply_rotary_pos_emb_thd_absolute`	Apply RoPE to `thd` (packed) format tensors using absolute position embeddings.
`apply_rotary_pos_emb_absolute`	Apply absolute RoPE, routing to bshd or thd format as appropriate.

API#

bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention._apply_rotary_pos_emb_thd_absolute( t: torch.Tensor, cu_seqlens: torch.Tensor, freqs: torch.Tensor, rotary_interleaved: bool = False, ) → torch.Tensor#

Apply RoPE to thd (packed) format tensors using absolute position embeddings.

Parameters:

t – Input tensor of shape [total_tokens, num_heads, head_dim].
cu_seqlens – Cumulative sequence lengths (currently unused, kept for API consistency).
freqs – Rotary embedding frequencies of shape [total_tokens, 1, 1, head_dim].
rotary_interleaved – Whether to use interleaved rotation.

Returns:

Tensor of shape [total_tokens, num_heads, head_dim] with RoPE applied.

bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention.apply_rotary_pos_emb_absolute( t: torch.Tensor, freqs: torch.Tensor, config, cu_seqlens: Optional[torch.Tensor] = None, ) → torch.Tensor#

Apply absolute RoPE, routing to bshd or thd format as appropriate.

For ERNIE ViT, the freqs tensor has shape [total_tokens, 1, 1, head_dim] (absolute position embeddings, where the raw frequencies of shape [head_dim//2] are tiled 2x to cover the full head_dim), unlike standard relative RoPE where freqs is [max_seqlen, 1, 1, rotary_dim].

Parameters:

t – Input tensor (Q or K).
freqs – Pre-computed RoPE frequencies.
config – TransformerConfig (used for rotary_interleaved flag).
cu_seqlens – If provided, indicates packed sequence (thd) format.

Returns:

Tensor with RoPE applied, same shape as input.

class bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention.ErnieVLSelfAttention#

Bases: megatron.core.transformer.attention.SelfAttention

SelfAttention with absolute 2D RoPE for ERNIE ViT.

Overrides the standard MCore SelfAttention.forward() to apply apply_rotary_pos_emb_absolute instead of the standard apply_rotary_pos_emb which expects relative position embeddings.

This is necessary because ERNIE ViT pre-computes absolute 2D (H, W) position embeddings and passes them as rotary_pos_emb through the TransformerBlock, rather than using the standard MCore RoPE infrastructure that computes frequencies from sequential position IDs.

forward( hidden_states: torch.Tensor, attention_mask: torch.Tensor, key_value_states: Optional[torch.Tensor] = None, inference_context: Optional[megatron.core.transformer.attention.BaseInferenceContext] = None, rotary_pos_emb: Optional[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]] = None, rotary_pos_cos: Optional[torch.Tensor] = None, rotary_pos_sin: Optional[torch.Tensor] = None, attention_bias: Optional[torch.Tensor] = None, packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None, sequence_len_offset: Optional[int] = None, *, inference_params: Optional[megatron.core.transformer.attention.BaseInferenceContext] = None, rotary_pos_cos_sin: Optional[torch.Tensor] = None, ) → Tuple[torch.Tensor, torch.Tensor]#

Forward pass with absolute 2D RoPE for vision encoder.

The main difference from the parent class is in the RoPE application section: we use apply_rotary_pos_emb_absolute which handles absolute position embeddings properly for both bshd and thd formats.

Parameters:

hidden_states – Input tensor [seq_len, batch, hidden_size].
attention_mask – Attention mask (typically None for ViT).
rotary_pos_emb – Pre-computed absolute 2D RoPE frequencies.
packed_seq_params – Parameters for per-image packed sequence attention.
args) ((other) – See parent class SelfAttention.

Returns:

Tuple of (output, bias) where output is [seq_len, batch, hidden_size].

bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention#

Module Contents#

Classes#

Functions#

API#

`bridge.models.ernie_vl.modeling_ernie45_vl.vision_attention`#