bridge.models.ernie_vl.modeling_ernie45_vl.model#
Megatron-Core compatible ERNIE 4.5 VL MoE model.
This module wraps the HuggingFace ERNIE 4.5 VL MoE vision encoder and resampler with a Megatron-Core GPT language model to create a distributable VLM.
Architecture: - Vision Tower: Ernie4_5_VLMoeVisionTransformerPretrainedModel (HF, replicated across TP) - Resampler: Ernie4_5_VLMoeVariableResolutionResamplerModel (HF, replicated across TP) - Language Model: MCoreGPTModel (Megatron-Core, distributed across TP/PP/EP) with custom ErnieMultiTypeMoE layers supporting dual-pool MoE: * text_moe_layer: 64 experts (FFN=1536) for text tokens * vision_moe_layer: 64 experts (FFN=512) for vision tokens * shared_experts: shared MLP for all tokens
Module Contents#
Classes#
Thin adapter that makes the MG-native ErnieVLVisionModel compatible
with the HF-bound |
|
ERNIE-specific 3D M-RoPE with interleaved H/W frequency allocation. |
|
ERNIE 4.5 VL MoE Model (Vision-Language with Mixture of Experts). |
Functions#
Ensure the HF config has a |
|
Ensure the vision config has all attributes required by the transformers-builtin vision model classes (Ernie4_5_VLMoeVisionBlock, Ernie4_5_VLMoeVisionTransformerPretrainedModel, Ernie4_5_VLMoeVariableResolutionResamplerModel). |
API#
- bridge.models.ernie_vl.modeling_ernie45_vl.model._normalize_hf_config(hf_config)#
Ensure the HF config has a
text_configattribute.The transformers-builtin
Ernie4_5_VLMoeVariableResolutionResamplerModelaccessesconfig.text_config.hidden_sizeandconfig.text_config.rms_norm_eps. The nested config (Instruct model) hastext_configas a sub-object, but the flat config (Thinking model) stores all LLM fields directly on the top-level config.For flat configs, we set
text_configto point to the config itself so thatconfig.text_config.hidden_sizeresolves toconfig.hidden_size.
- bridge.models.ernie_vl.modeling_ernie45_vl.model._normalize_vision_config(vision_config, hf_config=None)#
Ensure the vision config has all attributes required by the transformers-builtin vision model classes (Ernie4_5_VLMoeVisionBlock, Ernie4_5_VLMoeVisionTransformerPretrainedModel, Ernie4_5_VLMoeVariableResolutionResamplerModel).
The Thinking modelβs custom
DFNRopeVisionTransformerConfig(auto_map) usesmlp_ratio+embed_diminstead ofintermediate_size, omitsrms_norm_eps, and omitstemporal_merge_size. This function adds the missing attributes so the same config object works with the transformers-builtin vision model code.
- class bridge.models.ernie_vl.modeling_ernie45_vl.model._MgVitTowerAdapter(
- mg_vision_model: megatron.bridge.models.ernie_vl.modeling_ernie45_vl.vision_model.ErnieVLVisionModel,
Bases:
torch.nn.ModuleThin adapter that makes the MG-native ErnieVLVisionModel compatible with the HF-bound
get_image_features/get_video_featuresmethods.The HF methods call
self.vision_tower(pixel_values, grid_thw, return_dict=True)and expect aBaseModelOutputWithPoolingwith.last_hidden_state. They also accessself.vision_tower.spatial_merge_size.This adapter wraps
ErnieVLVisionModelto match that interface exactly.Initialization
- forward(pixel_values, grid_thw, return_dict=True, **kwargs)#
- class bridge.models.ernie_vl.modeling_ernie45_vl.model.ErnieMultimodalRotaryEmbedding(freq_allocation: int = 20, **kwargs)#
Bases:
megatron.core.models.common.embeddings.rotary_pos_embedding.MultimodalRotaryEmbeddingERNIE-specific 3D M-RoPE with interleaved H/W frequency allocation.
ERNIE 4.5 VL uses a custom RoPE layout that differs from the standard Qwen2VL-style contiguous block layout used by
MultimodalRotaryEmbedding.Standard (Qwen2VL) layout with mrope_section=[22, 22, 20]: head dims [0:44] -> T (temporal) axis, freq bands 0-21 head dims [44:88] -> H (height) axis, freq bands 0-21 head dims [88:128] -> W (width) axis, freq bands 0-19
ERNIE layout with freq_allocation=20: head dims [0:44] -> H,W interleaved: even freq bands -> H, odd -> W head dims [44:88] -> (same interleaving continues) head dims [88:128] -> T (temporal) axis, freq bands 44-63
More precisely, for freq band index f (0..63): f in {0,2,4,β¦,42} (even, f<44) -> H position f in {1,3,5,β¦,43} (odd, f<44) -> W position f in {44,45,β¦,63} (last 20) -> T position
For text tokens (T=H=W=p), both layouts produce identical results since all axes have the same position value. The difference only manifests for image/video tokens where T, H, W have distinct values.
This subclass overrides
forward()to implement ERNIEβs interleaved layout while reusing the parentβsinv_freqand infrastructure.Initialization
- forward(
- position_ids: torch.Tensor,
- mrope_section,
- cp_group=None,
Compute ERNIE-style interleaved M-RoPE embeddings.
- Parameters:
position_ids β [3, batch, seq_len] where axis 0=T, 1=H, 2=W
mrope_section β Ignored (kept for API compatibility). ERNIE uses freq_allocation instead.
cp_group β Context parallel group.
- Returns:
RoPE embedding of shape [seq_len, batch, 1, head_dim].
- Return type:
Tensor
- class bridge.models.ernie_vl.modeling_ernie45_vl.model.Ernie45VLModel(
- config: megatron.bridge.models.gpt_provider.GPTModelProvider,
- pre_process: bool = True,
- post_process: bool = True,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.transformer.module.MegatronModuleERNIE 4.5 VL MoE Model (Vision-Language with Mixture of Experts).
This model combines:
A HuggingFace ERNIE 4.5 vision encoder (32-layer ViT with 2D RoPE)
A variable-resolution resampler (spatial + temporal merging)
A Megatron-Core GPT language model with heterogeneous dual-pool MoE
The vision tower and resampler are borrowed directly from HuggingFace and replicated across TP ranks. The language model uses standard Megatron-Core distributed infrastructure.
- Parameters:
config (GPTModelProvider) β Language model provider configuration.
pre_process (bool) β Include embedding layer (used with pipeline parallelism).
post_process (bool) β Include output layer (used with pipeline parallelism).
vp_stage (int, optional) β Virtual pipeline stage index.
Initialization
- property decoder#
Expose language model decoder for mcore inference compatibility.
- set_input_tensor(input_tensor) None#
Set model chunk input tensor.
- _normalize_pixel_values(pixel_values: torch.Tensor) torch.Tensor#
Normalize raw pixel patches for the vision encoder.
The ERNIE 4.5 VL processor outputs raw pixel patches (0-255 range,
do_rescale=False, do_normalize=False). This method applies CLIP normalization on-device, matching the HF custom modelβsvision_forward()+add_image_preprocess()logic:pixel_values = pixel_values / 255.0 pixel_values = (pixel_values - CLIP_MEAN) / CLIP_STD
- Parameters:
pixel_values β Raw pixel patches [total_patches, C*patch_size^2]. Values in 0-255 range (any dtype).
- Returns:
Normalized pixel patches in bfloat16, values in ~(-2, 2.5) range.
- forward(
- input_ids: torch.LongTensor = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- pixel_values: Optional[torch.Tensor] = None,
- pixel_values_videos: Optional[torch.FloatTensor] = None,
- image_grid_thw: Optional[torch.LongTensor] = None,
- video_grid_thw: Optional[torch.LongTensor] = None,
- mm_token_type_ids: Optional[torch.IntTensor] = None,
- moe_mm_token_type_ids: Optional[torch.IntTensor] = None,
- labels: torch.Tensor = None,
- inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
- packed_seq_params: megatron.core.packed_seq_params.PackedSeqParams = None,
- extra_block_kwargs: dict = None,
- runtime_gather_output: Optional[bool] = None,
- *,
- inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
- loss_mask: Optional[torch.Tensor] = None,
Forward pass for ERNIE 4.5 VL MoE.
- Parameters:
input_ids β Token IDs [batch_size, seq_len].
pixel_values β Image pixel values for the vision encoder.
pixel_values_videos β Video pixel values for the vision encoder.
image_grid_thw β Grid dimensions (T, H, W) per image [num_images, 3].
video_grid_thw β Grid dimensions (T, H, W) per video [num_videos, 3].
mm_token_type_ids β Token type IDs for M-RoPE computation (0=text, 1=image, 2=video).
moe_mm_token_type_ids β Token type IDs for MoE routing (0=text, 1/2=vision).
labels β Labels for language modeling loss.
loss_mask β Mask for loss computation.
- freeze(
- freeze_language_model: bool,
- freeze_vision_model: bool,
- freeze_vision_projection: bool,
Freeze model modules.
- Parameters:
freeze_language_model β Freeze the language model module.
freeze_vision_model β Freeze the vision encoder (patch_embed + blocks).
freeze_vision_projection β Freeze the resampler / projector.