`bridge.models.ernie_vl.modeling_ernie45_vl.model`#

Megatron-Core compatible ERNIE 4.5 VL MoE model.

This module wraps the HuggingFace ERNIE 4.5 VL MoE vision encoder and resampler with a Megatron-Core GPT language model to create a distributable VLM.

Architecture: - Vision Tower: Ernie4_5_VLMoeVisionTransformerPretrainedModel (HF, replicated across TP) - Resampler: Ernie4_5_VLMoeVariableResolutionResamplerModel (HF, replicated across TP) - Language Model: MCoreGPTModel (Megatron-Core, distributed across TP/PP/EP) with custom ErnieMultiTypeMoE layers supporting dual-pool MoE: * text_moe_layer: 64 experts (FFN=1536) for text tokens * vision_moe_layer: 64 experts (FFN=512) for vision tokens * shared_experts: shared MLP for all tokens

Module Contents#

Classes#

`_MgVitTowerAdapter`	Thin adapter that makes the MG-native ErnieVLVisionModel compatible with the HF-bound `get_image_features` / `get_video_features` methods.
`ErnieMultimodalRotaryEmbedding`	ERNIE-specific 3D M-RoPE with interleaved H/W frequency allocation.
`Ernie45VLModel`	ERNIE 4.5 VL MoE Model (Vision-Language with Mixture of Experts).

Functions#

`_normalize_hf_config`	Ensure the HF config has a `text_config` attribute.
`_normalize_vision_config`	Ensure the vision config has all attributes required by the transformers-builtin vision model classes (Ernie4_5_VLMoeVisionBlock, Ernie4_5_VLMoeVisionTransformerPretrainedModel, Ernie4_5_VLMoeVariableResolutionResamplerModel).

API#

bridge.models.ernie_vl.modeling_ernie45_vl.model._normalize_hf_config(hf_config)#

Ensure the HF config has a text_config attribute.

The transformers-builtin Ernie4_5_VLMoeVariableResolutionResamplerModel accesses config.text_config.hidden_size and config.text_config.rms_norm_eps. The nested config (Instruct model) has text_config as a sub-object, but the flat config (Thinking model) stores all LLM fields directly on the top-level config.

For flat configs, we set text_config to point to the config itself so that config.text_config.hidden_size resolves to config.hidden_size.

bridge.models.ernie_vl.modeling_ernie45_vl.model._normalize_vision_config(vision_config, hf_config=None)#

Ensure the vision config has all attributes required by the transformers-builtin vision model classes (Ernie4_5_VLMoeVisionBlock, Ernie4_5_VLMoeVisionTransformerPretrainedModel, Ernie4_5_VLMoeVariableResolutionResamplerModel).

The Thinking model’s custom DFNRopeVisionTransformerConfig (auto_map) uses mlp_ratio + embed_dim instead of intermediate_size, omits rms_norm_eps, and omits temporal_merge_size. This function adds the missing attributes so the same config object works with the transformers-builtin vision model code.

class bridge.models.ernie_vl.modeling_ernie45_vl.model._MgVitTowerAdapter( mg_vision_model: megatron.bridge.models.ernie_vl.modeling_ernie45_vl.vision_model.ErnieVLVisionModel, )#

Bases: torch.nn.Module

Thin adapter that makes the MG-native ErnieVLVisionModel compatible with the HF-bound get_image_features / get_video_features methods.

The HF methods call self.vision_tower(pixel_values, grid_thw, return_dict=True) and expect a BaseModelOutputWithPooling with .last_hidden_state. They also access self.vision_tower.spatial_merge_size.

This adapter wraps ErnieVLVisionModel to match that interface exactly.

Initialization

forward(pixel_values, grid_thw, return_dict=True, **kwargs)#

class bridge.models.ernie_vl.modeling_ernie45_vl.model.ErnieMultimodalRotaryEmbedding(freq_allocation: int = 20, **kwargs)#

Bases: megatron.core.models.common.embeddings.rotary_pos_embedding.MultimodalRotaryEmbedding

ERNIE-specific 3D M-RoPE with interleaved H/W frequency allocation.

ERNIE 4.5 VL uses a custom RoPE layout that differs from the standard Qwen2VL-style contiguous block layout used by MultimodalRotaryEmbedding.

Standard (Qwen2VL) layout with mrope_section=[22, 22, 20]: head dims [0:44] -> T (temporal) axis, freq bands 0-21 head dims [44:88] -> H (height) axis, freq bands 0-21 head dims [88:128] -> W (width) axis, freq bands 0-19

ERNIE layout with freq_allocation=20: head dims [0:44] -> H,W interleaved: even freq bands -> H, odd -> W head dims [44:88] -> (same interleaving continues) head dims [88:128] -> T (temporal) axis, freq bands 44-63

More precisely, for freq band index f (0..63): f in {0,2,4,…,42} (even, f<44) -> H position f in {1,3,5,…,43} (odd, f<44) -> W position f in {44,45,…,63} (last 20) -> T position

For text tokens (T=H=W=p), both layouts produce identical results since all axes have the same position value. The difference only manifests for image/video tokens where T, H, W have distinct values.

This subclass overrides forward() to implement ERNIE’s interleaved layout while reusing the parent’s inv_freq and infrastructure.

Initialization

forward( position_ids: torch.Tensor, mrope_section, cp_group=None, ) → torch.Tensor#

Compute ERNIE-style interleaved M-RoPE embeddings.

Parameters:

position_ids – [3, batch, seq_len] where axis 0=T, 1=H, 2=W
mrope_section – Ignored (kept for API compatibility). ERNIE uses freq_allocation instead.
cp_group – Context parallel group.

Returns:

RoPE embedding of shape [seq_len, batch, 1, head_dim].

Return type:

Tensor

class bridge.models.ernie_vl.modeling_ernie45_vl.model.Ernie45VLModel( config: megatron.bridge.models.gpt_provider.GPTModelProvider, pre_process: bool = True, post_process: bool = True, vp_stage: Optional[int] = None, )#

Bases: megatron.core.transformer.module.MegatronModule

ERNIE 4.5 VL MoE Model (Vision-Language with Mixture of Experts).

This model combines:

A HuggingFace ERNIE 4.5 vision encoder (32-layer ViT with 2D RoPE)
A variable-resolution resampler (spatial + temporal merging)
A Megatron-Core GPT language model with heterogeneous dual-pool MoE

The vision tower and resampler are borrowed directly from HuggingFace and replicated across TP ranks. The language model uses standard Megatron-Core distributed infrastructure.

Parameters:

config (GPTModelProvider) – Language model provider configuration.
pre_process (bool) – Include embedding layer (used with pipeline parallelism).
post_process (bool) – Include output layer (used with pipeline parallelism).
vp_stage (int, optional) – Virtual pipeline stage index.

Initialization

property decoder#: Expose language model decoder for mcore inference compatibility.

set_input_tensor(input_tensor) → None#: Set model chunk input tensor.

_normalize_pixel_values(pixel_values: torch.Tensor) → torch.Tensor#

Normalize raw pixel patches for the vision encoder.

The ERNIE 4.5 VL processor outputs raw pixel patches (0-255 range, do_rescale=False, do_normalize=False). This method applies CLIP normalization on-device, matching the HF custom model’s vision_forward() + add_image_preprocess() logic:

pixel_values = pixel_values / 255.0
pixel_values = (pixel_values - CLIP_MEAN) / CLIP_STD

Parameters:: pixel_values – Raw pixel patches [total_patches, C*patch_size^2]. Values in 0-255 range (any dtype).
Returns:: Normalized pixel patches in bfloat16, values in ~(-2, 2.5) range.

forward( input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, inputs_embeds: Optional[torch.FloatTensor] = None, pixel_values: Optional[torch.Tensor] = None, pixel_values_videos: Optional[torch.FloatTensor] = None, image_grid_thw: Optional[torch.LongTensor] = None, video_grid_thw: Optional[torch.LongTensor] = None, mm_token_type_ids: Optional[torch.IntTensor] = None, moe_mm_token_type_ids: Optional[torch.IntTensor] = None, labels: torch.Tensor = None, inference_context: megatron.core.inference.contexts.BaseInferenceContext = None, packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None, extra_block_kwargs: dict = None, runtime_gather_output: Optional[bool] = None, *, inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None, loss_mask: Optional[torch.Tensor] = None, ) → torch.Tensor#

Forward pass for ERNIE 4.5 VL MoE.

Parameters:

input_ids – Token IDs [batch_size, seq_len].
pixel_values – Image pixel values for the vision encoder.
pixel_values_videos – Video pixel values for the vision encoder.
image_grid_thw – Grid dimensions (T, H, W) per image [num_images, 3].
video_grid_thw – Grid dimensions (T, H, W) per video [num_videos, 3].
mm_token_type_ids – Token type IDs for M-RoPE computation (0=text, 1=image, 2=video).
moe_mm_token_type_ids – Token type IDs for MoE routing (0=text, 1/2=vision).
labels – Labels for language modeling loss.
loss_mask – Mask for loss computation.

freeze( freeze_language_model: bool, freeze_vision_model: bool, freeze_vision_projection: bool, )#

Freeze model modules.

Parameters:

freeze_language_model – Freeze the language model module.
freeze_vision_model – Freeze the vision encoder (patch_embed + blocks).
freeze_vision_projection – Freeze the resampler / projector.

bridge.models.ernie_vl.modeling_ernie45_vl.model#

Module Contents#

Classes#

Functions#

API#

`bridge.models.ernie_vl.modeling_ernie45_vl.model`#