bridge.models.ernie_vl.modeling_ernie45_vl.model#

Megatron-Core compatible ERNIE 4.5 VL MoE model.

This module wraps the HuggingFace ERNIE 4.5 VL MoE vision encoder and resampler with a Megatron-Core GPT language model to create a distributable VLM.

Architecture: - Vision Tower: Ernie4_5_VLMoeVisionTransformerPretrainedModel (HF, replicated across TP) - Resampler: Ernie4_5_VLMoeVariableResolutionResamplerModel (HF, replicated across TP) - Language Model: MCoreGPTModel (Megatron-Core, distributed across TP/PP/EP) with custom ErnieMultiTypeMoE layers supporting dual-pool MoE: * text_moe_layer: 64 experts (FFN=1536) for text tokens * vision_moe_layer: 64 experts (FFN=512) for vision tokens * shared_experts: shared MLP for all tokens

Module Contents#

Classes#

_MgVitTowerAdapter

Thin adapter that makes the MG-native ErnieVLVisionModel compatible with the HF-bound get_image_features / get_video_features methods.

ErnieMultimodalRotaryEmbedding

ERNIE-specific 3D M-RoPE with interleaved H/W frequency allocation.

Ernie45VLModel

ERNIE 4.5 VL MoE Model (Vision-Language with Mixture of Experts).

Functions#

_normalize_hf_config

Ensure the HF config has a text_config attribute.

_normalize_vision_config

Ensure the vision config has all attributes required by the transformers-builtin vision model classes (Ernie4_5_VLMoeVisionBlock, Ernie4_5_VLMoeVisionTransformerPretrainedModel, Ernie4_5_VLMoeVariableResolutionResamplerModel).

API#

bridge.models.ernie_vl.modeling_ernie45_vl.model._normalize_hf_config(hf_config)#

Ensure the HF config has a text_config attribute.

The transformers-builtin Ernie4_5_VLMoeVariableResolutionResamplerModel accesses config.text_config.hidden_size and config.text_config.rms_norm_eps. The nested config (Instruct model) has text_config as a sub-object, but the flat config (Thinking model) stores all LLM fields directly on the top-level config.

For flat configs, we set text_config to point to the config itself so that config.text_config.hidden_size resolves to config.hidden_size.

bridge.models.ernie_vl.modeling_ernie45_vl.model._normalize_vision_config(vision_config, hf_config=None)#

Ensure the vision config has all attributes required by the transformers-builtin vision model classes (Ernie4_5_VLMoeVisionBlock, Ernie4_5_VLMoeVisionTransformerPretrainedModel, Ernie4_5_VLMoeVariableResolutionResamplerModel).

The Thinking model’s custom DFNRopeVisionTransformerConfig (auto_map) uses mlp_ratio + embed_dim instead of intermediate_size, omits rms_norm_eps, and omits temporal_merge_size. This function adds the missing attributes so the same config object works with the transformers-builtin vision model code.

class bridge.models.ernie_vl.modeling_ernie45_vl.model._MgVitTowerAdapter(
mg_vision_model: megatron.bridge.models.ernie_vl.modeling_ernie45_vl.vision_model.ErnieVLVisionModel,
)#

Bases: torch.nn.Module

Thin adapter that makes the MG-native ErnieVLVisionModel compatible with the HF-bound get_image_features / get_video_features methods.

The HF methods call self.vision_tower(pixel_values, grid_thw, return_dict=True) and expect a BaseModelOutputWithPooling with .last_hidden_state. They also access self.vision_tower.spatial_merge_size.

This adapter wraps ErnieVLVisionModel to match that interface exactly.

Initialization

forward(pixel_values, grid_thw, return_dict=True, **kwargs)#
class bridge.models.ernie_vl.modeling_ernie45_vl.model.ErnieMultimodalRotaryEmbedding(freq_allocation: int = 20, **kwargs)#

Bases: megatron.core.models.common.embeddings.rotary_pos_embedding.MultimodalRotaryEmbedding

ERNIE-specific 3D M-RoPE with interleaved H/W frequency allocation.

ERNIE 4.5 VL uses a custom RoPE layout that differs from the standard Qwen2VL-style contiguous block layout used by MultimodalRotaryEmbedding.

Standard (Qwen2VL) layout with mrope_section=[22, 22, 20]: head dims [0:44] -> T (temporal) axis, freq bands 0-21 head dims [44:88] -> H (height) axis, freq bands 0-21 head dims [88:128] -> W (width) axis, freq bands 0-19

ERNIE layout with freq_allocation=20: head dims [0:44] -> H,W interleaved: even freq bands -> H, odd -> W head dims [44:88] -> (same interleaving continues) head dims [88:128] -> T (temporal) axis, freq bands 44-63

More precisely, for freq band index f (0..63): f in {0,2,4,…,42} (even, f<44) -> H position f in {1,3,5,…,43} (odd, f<44) -> W position f in {44,45,…,63} (last 20) -> T position

For text tokens (T=H=W=p), both layouts produce identical results since all axes have the same position value. The difference only manifests for image/video tokens where T, H, W have distinct values.

This subclass overrides forward() to implement ERNIE’s interleaved layout while reusing the parent’s inv_freq and infrastructure.

Initialization

forward(
position_ids: torch.Tensor,
mrope_section,
cp_group=None,
) torch.Tensor#

Compute ERNIE-style interleaved M-RoPE embeddings.

Parameters:
  • position_ids – [3, batch, seq_len] where axis 0=T, 1=H, 2=W

  • mrope_section – Ignored (kept for API compatibility). ERNIE uses freq_allocation instead.

  • cp_group – Context parallel group.

Returns:

RoPE embedding of shape [seq_len, batch, 1, head_dim].

Return type:

Tensor

class bridge.models.ernie_vl.modeling_ernie45_vl.model.Ernie45VLModel(
config: megatron.bridge.models.gpt_provider.GPTModelProvider,
pre_process: bool = True,
post_process: bool = True,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

ERNIE 4.5 VL MoE Model (Vision-Language with Mixture of Experts).

This model combines:

  • A HuggingFace ERNIE 4.5 vision encoder (32-layer ViT with 2D RoPE)

  • A variable-resolution resampler (spatial + temporal merging)

  • A Megatron-Core GPT language model with heterogeneous dual-pool MoE

The vision tower and resampler are borrowed directly from HuggingFace and replicated across TP ranks. The language model uses standard Megatron-Core distributed infrastructure.

Parameters:
  • config (GPTModelProvider) – Language model provider configuration.

  • pre_process (bool) – Include embedding layer (used with pipeline parallelism).

  • post_process (bool) – Include output layer (used with pipeline parallelism).

  • vp_stage (int, optional) – Virtual pipeline stage index.

Initialization

property decoder#

Expose language model decoder for mcore inference compatibility.

set_input_tensor(input_tensor) None#

Set model chunk input tensor.

_normalize_pixel_values(pixel_values: torch.Tensor) torch.Tensor#

Normalize raw pixel patches for the vision encoder.

The ERNIE 4.5 VL processor outputs raw pixel patches (0-255 range, do_rescale=False, do_normalize=False). This method applies CLIP normalization on-device, matching the HF custom model’s vision_forward() + add_image_preprocess() logic:

pixel_values = pixel_values / 255.0
pixel_values = (pixel_values - CLIP_MEAN) / CLIP_STD
Parameters:

pixel_values – Raw pixel patches [total_patches, C*patch_size^2]. Values in 0-255 range (any dtype).

Returns:

Normalized pixel patches in bfloat16, values in ~(-2, 2.5) range.

forward(
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
pixel_values: Optional[torch.Tensor] = None,
pixel_values_videos: Optional[torch.FloatTensor] = None,
image_grid_thw: Optional[torch.LongTensor] = None,
video_grid_thw: Optional[torch.LongTensor] = None,
mm_token_type_ids: Optional[torch.IntTensor] = None,
moe_mm_token_type_ids: Optional[torch.IntTensor] = None,
labels: torch.Tensor = None,
inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
extra_block_kwargs: dict = None,
runtime_gather_output: Optional[bool] = None,
*,
inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
loss_mask: Optional[torch.Tensor] = None,
) torch.Tensor#

Forward pass for ERNIE 4.5 VL MoE.

Parameters:
  • input_ids – Token IDs [batch_size, seq_len].

  • pixel_values – Image pixel values for the vision encoder.

  • pixel_values_videos – Video pixel values for the vision encoder.

  • image_grid_thw – Grid dimensions (T, H, W) per image [num_images, 3].

  • video_grid_thw – Grid dimensions (T, H, W) per video [num_videos, 3].

  • mm_token_type_ids – Token type IDs for M-RoPE computation (0=text, 1=image, 2=video).

  • moe_mm_token_type_ids – Token type IDs for MoE routing (0=text, 1/2=vision).

  • labels – Labels for language modeling loss.

  • loss_mask – Mask for loss computation.

freeze(
freeze_language_model: bool,
freeze_vision_model: bool,
freeze_vision_projection: bool,
)#

Freeze model modules.

Parameters:
  • freeze_language_model – Freeze the language model module.

  • freeze_vision_model – Freeze the vision encoder (patch_embed + blocks).

  • freeze_vision_projection – Freeze the resampler / projector.