bridge.models.ernie_vl.modeling_ernie45_vl.vision_layer_spec#

Layer spec for the ERNIE 4.5 VL Megatron-native Vision Transformer (ViT).

Provides get_ernie_vit_layer_spec() which returns a ModuleSpec for a single ViT transformer layer using Transformer Engine modules.

The spec is identical to the standard MCore ViT spec from megatron.core.models.vision.vit_layer_specs.get_vit_layer_with_transformer_engine_spec except that self_attention.module is overridden with ErnieVLSelfAttention to handle absolute 2D RoPE (non-interleaved rotate_half style).

Architecture details: - Attention: TELayerNormColumnParallelLinear (fused QKV + LN) + TEDotProductAttention + TERowParallelLinear - MLP: TELayerNormColumnParallelLinear (fused fc1 + LN) + TERowParallelLinear - Mask type: AttnMaskType.no_mask (bidirectional attention for ViT) - pre_mlp_layernorm: IdentityOp (LN is fused into TE linear layers)

Module Contents#

Functions#

get_ernie_vit_layer_spec

Return a TransformerLayer ModuleSpec for ERNIE ViT.

API#

bridge.models.ernie_vl.modeling_ernie45_vl.vision_layer_spec.get_ernie_vit_layer_spec()#

Return a TransformerLayer ModuleSpec for ERNIE ViT.

This reuses the standard MCore ViT TE spec and only overrides the self-attention module with ErnieVLSelfAttention to apply absolute 2D RoPE embeddings instead of the standard relative RoPE.

Returns:

Spec for one ERNIE ViT transformer layer.

Return type:

ModuleSpec