bridge.models.ernie_vl.ernie45_vl_bridge#
Megatron Bridge for ERNIE 4.5 VL MoE (Vision-Language with Mixture of Experts).
This bridge handles conversion between HuggingFace Ernie4_5_VLMoeForConditionalGeneration and Megatron-Core Ernie45VLModel formats, including:
Language model weights with heterogeneous dual-pool MoE:
text_moe: 64 experts with intermediate_size=1536 (text tokens) -> mapped to ErnieMultiTypeMoE.text_moe_layer (standard MoELayer)
vision_moe: 64 experts with intermediate_size=512 (vision tokens) -> mapped to ErnieMultiTypeMoE.vision_moe_layer (standard MoELayer)
shared_experts: 2 shared experts with intermediate_size=3072 -> mapped to ErnieMultiTypeMoE.shared_experts
Vision encoder weights:
HF ViT (use_mg_vit=False): replicated across TP ranks via ReplicatedMapping
MG ViT (use_mg_vit=True): TP-sharded with ConcatenatedQKVMapping for fused QKV
Resampler / projector weights (replicated across TP ranks)
3D M-RoPE position embedding configuration
The ErnieMultiTypeMoE module contains two separate MoELayer instances (one per expert pool), each with its own router and SequentialMLP experts. This gives both pools full TP support through standard Megatron-Core infrastructure.
HF on-disk (safetensors) keys â after _checkpoint_conversion_mapping reversal:
model.layers.{i}.mlp.gate.weight (text router)
model.layers.{i}.mlp.gate.weight_1 (vision router)
model.layers.{i}.mlp.moe_statics.e_score_correction_bias (concat text+vision)
model.layers.{i}.mlp.experts.{j}.gate_proj.weight (j=0..N-1 text, j=N..2N-1 vision)
model.layers.{i}.mlp.experts.{j}.up_proj.weight
model.layers.{i}.mlp.experts.{j}.down_proj.weight
model.layers.{i}.mlp.shared_experts.{gate,up,down}_proj.weight
model.vision_model.**
Megatron Weight Naming (per-expert SequentialMLP within ErnieMultiTypeMoE): language_model.decoder.layers.{i}.mlp.text_moe_layer.router.weight language_model.decoder.layers.{i}.mlp.text_moe_layer.router.expert_bias language_model.decoder.layers.{i}.mlp.text_moe_layer.experts.local_experts.{j}.linear_fc1.weight language_model.decoder.layers.{i}.mlp.text_moe_layer.experts.local_experts.{j}.linear_fc2.weight language_model.decoder.layers.{i}.mlp.vision_moe_layer.router.weight language_model.decoder.layers.{i}.mlp.vision_moe_layer.router.expert_bias language_model.decoder.layers.{i}.mlp.vision_moe_layer.experts.local_experts.{j}.linear_fc1.weight language_model.decoder.layers.{i}.mlp.vision_moe_layer.experts.local_experts.{j}.linear_fc2.weight language_model.decoder.layers.{i}.mlp.shared_experts.linear_fc1.weight language_model.decoder.layers.{i}.mlp.shared_experts.linear_fc2.weight
MG-native ViT Weight Naming (use_mg_vit=True, TP-sharded): vision_model.decoder.layers.{i}.self_attention.linear_qkv.weight (fused QKV, ConcatenatedQKVMapping) vision_model.decoder.layers.{i}.self_attention.linear_qkv.bias vision_model.decoder.layers.{i}.self_attention.linear_qkv.layer_norm_weight (fused norm1) vision_model.decoder.layers.{i}.self_attention.linear_qkv.layer_norm_bias vision_model.decoder.layers.{i}.self_attention.linear_proj.weight vision_model.decoder.layers.{i}.self_attention.linear_proj.bias vision_model.decoder.layers.{i}.mlp.linear_fc1.weight vision_model.decoder.layers.{i}.mlp.linear_fc1.bias vision_model.decoder.layers.{i}.mlp.linear_fc1.layer_norm_weight (fused norm2) vision_model.decoder.layers.{i}.mlp.linear_fc1.layer_norm_bias vision_model.decoder.layers.{i}.mlp.linear_fc2.weight vision_model.decoder.layers.{i}.mlp.linear_fc2.bias vision_model.patch_embed.proj.weight (replicated) vision_model.decoder.final_layernorm.weight vision_model.decoder.final_layernorm.bias
Note on Expert Parallelism:
EP>1 is supported for dual-pool MoE. The bridge handles the expert offset
between text and vision pools correctly: text experts use indices 0..N-1 and
vision experts use N..2N-1 in HF on-disk format. The frameworkâs
_megatron_local_name_to_global function handles SequentialMLP-style expert
numbering, and gather_from_ep_ranks preserves pool offsets when
reconstructing HF parameter names during export.
Module Contents#
Classes#
GatedMLPMapping with expert index offset for vision pool. |
|
RowParallelMapping with expert index offset for vision pool. |
|
Mapping for the concatenated text+vision expert bias tensor. |
|
Megatron Bridge for ERNIE 4.5 VL MoE Conditional Generation. |
Functions#
EP all-gather with pool offset for dual-pool MoE vision experts. |
|
Resolve wildcard captures, shifting the 2nd capture (expert index) for HF side. |
Data#
API#
- bridge.models.ernie_vl.ernie45_vl_bridge.logger#
âgetLogger(âŠ)â
- bridge.models.ernie_vl.ernie45_vl_bridge._ERNIE45_VL_MOE_HF_CLASS_NAME#
âErnie4_5_VLMoeForConditionalGenerationâ
- bridge.models.ernie_vl.ernie45_vl_bridge._offset_gather_from_ep_ranks(
- mapping,
- megatron_weights: Optional[torch.Tensor],
- megatron_module,
- hf_param_name: Optional[str] = None,
EP all-gather with pool offset for dual-pool MoE vision experts.
Per EP rank i the HF expert index is: expert_offset + local_expert_number + num_experts_per_rank * i
- bridge.models.ernie_vl.ernie45_vl_bridge._resolve_with_offset(
- megatron_pattern: str,
- hf_pattern,
- captures: Tuple[str, ...],
- expert_offset: int,
Resolve wildcard captures, shifting the 2nd capture (expert index) for HF side.
- class bridge.models.ernie_vl.ernie45_vl_bridge._OffsetGatedMLPMapping(
- megatron_param: str,
- gate: str,
- up: str,
- expert_offset: int = 0,
Bases:
megatron.bridge.models.conversion.param_mapping.GatedMLPMappingGatedMLPMapping with expert index offset for vision pool.
Handles both directions:
resolve(): shifts expert index for HF side only
gather_from_ep_ranks(): reconstructs offset HF indices during EP export
Initialization
- resolve(captures: Tuple[str, ...])#
- gather_from_ep_ranks(
- megatron_weights,
- megatron_module,
- hf_param_name=None,
- class bridge.models.ernie_vl.ernie45_vl_bridge._OffsetRowParallelMapping(
- megatron_param: str,
- hf_param: str,
- expert_offset: int = 0,
Bases:
megatron.bridge.models.conversion.param_mapping.RowParallelMappingRowParallelMapping with expert index offset for vision pool.
Used for vision expert down_proj (linear_fc2), which is always row-parallel in SequentialMLP. Using explicit RowParallelMapping avoids the AutoMapping delegation issue where the delegateâs gather_from_ep_ranks bypasses offset logic.
Initialization
- resolve(captures: Tuple[str, ...])#
- gather_from_ep_ranks(
- megatron_weights,
- megatron_module,
- hf_param_name=None,
- class bridge.models.ernie_vl.ernie45_vl_bridge._ConcatBiasMapping(
- megatron_param: str,
- hf_param: str,
- slice_name: str,
- num_experts: int,
Bases:
megatron.bridge.models.conversion.param_mapping.AutoMappingMapping for the concatenated text+vision expert bias tensor.
The on-disk HF format stores a single
moe_statics.e_score_correction_biastensor of shape[2, num_experts]where row 0 is the text pool bias and row 1 is the vision pool bias. This mapping extracts the appropriate row based onslice_name.For export (megatron_to_hf), the text mapping buffers its bias in a class-level dict keyed by resolved HF param name. The vision mapping retrieves the buffered text bias, stacks them into
[2, N], and returns the merged tensor. This ensures only one entry per HF key.Initialization
- _export_buffer: Dict[str, torch.Tensor]#
None
- classmethod clear_export_buffer()#
Remove any stale entries from the class-level export buffer.
- resolve(captures: Tuple[str, ...])#
- hf_to_megatron(hf_weights, megatron_module)#
Extract the text or vision slice from the concatenated bias.
On-disk shape is [2, num_experts]: row 0 = text, row 1 = vision.
- megatron_to_hf(megatron_weights, megatron_module)#
Export text+vision expert bias as concatenated [2, N] tensor.
The text mapping buffers its bias; the vision mapping retrieves it and stacks into [2, N]. If the text bias is not yet buffered (shouldnât happen in practice), falls back to exporting as-is.
- class bridge.models.ernie_vl.ernie45_vl_bridge.Ernie45VLBridge#
Bases:
megatron.bridge.models.conversion.model_bridge.MegatronModelBridgeMegatron Bridge for ERNIE 4.5 VL MoE Conditional Generation.
This bridge handles the conversion between HuggingFace Ernie4_5_VLMoeForConditionalGeneration and Megatron-Core Ernie45VLModel formats, including weight mappings and configuration translation for this vision-language MoE model.
Key architectural features handled:
Heterogeneous dual-pool MoE via ErnieMultiTypeMoE:
text_moe_layer: standard Megatron MoELayer (TP support)
vision_moe_layer: standard Megatron MoELayer (TP support)
Shared experts across modalities
3D Multimodal RoPE (M-RoPE)
Variable-resolution vision resampler (spatial + temporal merging)
GQA with configurable query/KV heads
HF on-disk per-expert weights <-> Megatron per-expert SequentialMLP weights
.. rubric:: Example
from megatron.bridge import AutoBridge bridge = AutoBridge.from_hf_pretrained(âbaidu/ERNIE-4.5-VL-28B-A3B-Instructâ) provider = bridge.to_megatron_provider()
- static _get_text_config(hf_config)#
Extract the text/language config from either nested or flat HF config.
The transformers-builtin
Ernie4_5_VLMoeConfig(model_type=ernie4_5_vl_moe) uses a nestedtext_configsub-object, while the custom auto_map configErnie4_5_VLMoEConfig(model_type=ernie4_5_moe_vl, e.g. the Thinking model) uses a flat layout where all LLM fields live directly on the top-level config.Returns the appropriate config object (nested text_config or the config itself).
- static _get_num_experts(text_config) int#
Extract the per-pool number of experts as an int.
The nested config stores
moe_num_expertsas a plain int (e.g. 4), while the flat/Thinking config stores it as a list[64, 64](text pool, vision pool â both values are always equal).
- provider_bridge(
- hf_pretrained: megatron.bridge.models.hf_pretrained.vlm.PreTrainedVLM,
Create an Ernie45VLModelProvider from a HuggingFace pretrained model.
Maps HuggingFace Ernie4_5_VLMoeConfig fields to Megatron provider parameters, including vision config, MoE settings, M-RoPE sections, and token IDs.
Supports both nested config (transformers builtin, model_type=ernie4_5_vl_moe) and flat config (auto_map custom, model_type=ernie4_5_moe_vl).
- Parameters:
hf_pretrained â HuggingFace pretrained VLM model.
- Returns:
Ernie45VLModelProvider configured with the HF modelâs parameters.
- stream_weights_megatron_to_hf(*args, **kwargs)#
Override to clear the _ConcatBiasMapping export buffer before each run.
- mapping_registry() megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#
Return MegatronMappingRegistry with parameter mappings for ERNIE 4.5 VL MoE.
Uses the HF on-disk (safetensors) key format, which differs from the in-memory
state_dict()format due to HuggingFaceâs_checkpoint_conversion_mapping.On-disk format:
No
language_model.prefix:model.layers.*notmodel.language_model.layers.*Per-expert flat-indexed weights:
experts.{j}.gate_proj.weightText experts indices 0..N-1, vision experts indices N..2N-1
Single
gate.weight(text router) andgate.weight_1(vision router)Concatenated
moe_statics.e_score_correction_biasfor text+visionmodel.vision_model.**notmodel.vision_tower.**Resampler:
spatial_linear.0/2/3notspatial_linear.fc1/fc2/ln(same fortemporal_linear)
- Returns:
MegatronMappingRegistry with all parameter mappings.