bridge.models.ernie_vl.ernie45_vl_bridge#

Megatron Bridge for ERNIE 4.5 VL MoE (Vision-Language with Mixture of Experts).

This bridge handles conversion between HuggingFace Ernie4_5_VLMoeForConditionalGeneration and Megatron-Core Ernie45VLModel formats, including:

  • Language model weights with heterogeneous dual-pool MoE:

    • text_moe: 64 experts with intermediate_size=1536 (text tokens) -> mapped to ErnieMultiTypeMoE.text_moe_layer (standard MoELayer)

    • vision_moe: 64 experts with intermediate_size=512 (vision tokens) -> mapped to ErnieMultiTypeMoE.vision_moe_layer (standard MoELayer)

    • shared_experts: 2 shared experts with intermediate_size=3072 -> mapped to ErnieMultiTypeMoE.shared_experts

  • Vision encoder weights:

    • HF ViT (use_mg_vit=False): replicated across TP ranks via ReplicatedMapping

    • MG ViT (use_mg_vit=True): TP-sharded with ConcatenatedQKVMapping for fused QKV

  • Resampler / projector weights (replicated across TP ranks)

  • 3D M-RoPE position embedding configuration

The ErnieMultiTypeMoE module contains two separate MoELayer instances (one per expert pool), each with its own router and SequentialMLP experts. This gives both pools full TP support through standard Megatron-Core infrastructure.

HF on-disk (safetensors) keys – after _checkpoint_conversion_mapping reversal: model.layers.{i}.mlp.gate.weight (text router) model.layers.{i}.mlp.gate.weight_1 (vision router) model.layers.{i}.mlp.moe_statics.e_score_correction_bias (concat text+vision) model.layers.{i}.mlp.experts.{j}.gate_proj.weight (j=0..N-1 text, j=N..2N-1 vision) model.layers.{i}.mlp.experts.{j}.up_proj.weight model.layers.{i}.mlp.experts.{j}.down_proj.weight model.layers.{i}.mlp.shared_experts.{gate,up,down}_proj.weight model.vision_model.**

Megatron Weight Naming (per-expert SequentialMLP within ErnieMultiTypeMoE): language_model.decoder.layers.{i}.mlp.text_moe_layer.router.weight language_model.decoder.layers.{i}.mlp.text_moe_layer.router.expert_bias language_model.decoder.layers.{i}.mlp.text_moe_layer.experts.local_experts.{j}.linear_fc1.weight language_model.decoder.layers.{i}.mlp.text_moe_layer.experts.local_experts.{j}.linear_fc2.weight language_model.decoder.layers.{i}.mlp.vision_moe_layer.router.weight language_model.decoder.layers.{i}.mlp.vision_moe_layer.router.expert_bias language_model.decoder.layers.{i}.mlp.vision_moe_layer.experts.local_experts.{j}.linear_fc1.weight language_model.decoder.layers.{i}.mlp.vision_moe_layer.experts.local_experts.{j}.linear_fc2.weight language_model.decoder.layers.{i}.mlp.shared_experts.linear_fc1.weight language_model.decoder.layers.{i}.mlp.shared_experts.linear_fc2.weight

MG-native ViT Weight Naming (use_mg_vit=True, TP-sharded): vision_model.decoder.layers.{i}.self_attention.linear_qkv.weight (fused QKV, ConcatenatedQKVMapping) vision_model.decoder.layers.{i}.self_attention.linear_qkv.bias vision_model.decoder.layers.{i}.self_attention.linear_qkv.layer_norm_weight (fused norm1) vision_model.decoder.layers.{i}.self_attention.linear_qkv.layer_norm_bias vision_model.decoder.layers.{i}.self_attention.linear_proj.weight vision_model.decoder.layers.{i}.self_attention.linear_proj.bias vision_model.decoder.layers.{i}.mlp.linear_fc1.weight vision_model.decoder.layers.{i}.mlp.linear_fc1.bias vision_model.decoder.layers.{i}.mlp.linear_fc1.layer_norm_weight (fused norm2) vision_model.decoder.layers.{i}.mlp.linear_fc1.layer_norm_bias vision_model.decoder.layers.{i}.mlp.linear_fc2.weight vision_model.decoder.layers.{i}.mlp.linear_fc2.bias vision_model.patch_embed.proj.weight (replicated) vision_model.decoder.final_layernorm.weight vision_model.decoder.final_layernorm.bias

Note on Expert Parallelism: EP>1 is supported for dual-pool MoE. The bridge handles the expert offset between text and vision pools correctly: text experts use indices 0..N-1 and vision experts use N..2N-1 in HF on-disk format. The framework’s _megatron_local_name_to_global function handles SequentialMLP-style expert numbering, and gather_from_ep_ranks preserves pool offsets when reconstructing HF parameter names during export.

Module Contents#

Classes#

_OffsetGatedMLPMapping

GatedMLPMapping with expert index offset for vision pool.

_OffsetRowParallelMapping

RowParallelMapping with expert index offset for vision pool.

_ConcatBiasMapping

Mapping for the concatenated text+vision expert bias tensor.

Ernie45VLBridge

Megatron Bridge for ERNIE 4.5 VL MoE Conditional Generation.

Functions#

_offset_gather_from_ep_ranks

EP all-gather with pool offset for dual-pool MoE vision experts.

_resolve_with_offset

Resolve wildcard captures, shifting the 2nd capture (expert index) for HF side.

Data#

API#

bridge.models.ernie_vl.ernie45_vl_bridge.logger#

‘getLogger(
)’

bridge.models.ernie_vl.ernie45_vl_bridge._ERNIE45_VL_MOE_HF_CLASS_NAME#

‘Ernie4_5_VLMoeForConditionalGeneration’

bridge.models.ernie_vl.ernie45_vl_bridge._offset_gather_from_ep_ranks(
mapping,
megatron_weights: Optional[torch.Tensor],
megatron_module,
hf_param_name: Optional[str] = None,
) Dict[str, torch.Tensor]#

EP all-gather with pool offset for dual-pool MoE vision experts.

Per EP rank i the HF expert index is: expert_offset + local_expert_number + num_experts_per_rank * i

bridge.models.ernie_vl.ernie45_vl_bridge._resolve_with_offset(
megatron_pattern: str,
hf_pattern,
captures: Tuple[str, ...],
expert_offset: int,
) Tuple[str, ...]#

Resolve wildcard captures, shifting the 2nd capture (expert index) for HF side.

class bridge.models.ernie_vl.ernie45_vl_bridge._OffsetGatedMLPMapping(
megatron_param: str,
gate: str,
up: str,
expert_offset: int = 0,
)#

Bases: megatron.bridge.models.conversion.param_mapping.GatedMLPMapping

GatedMLPMapping with expert index offset for vision pool.

Handles both directions:

  • resolve(): shifts expert index for HF side only

  • gather_from_ep_ranks(): reconstructs offset HF indices during EP export

Initialization

resolve(captures: Tuple[str, ...])#
gather_from_ep_ranks(
megatron_weights,
megatron_module,
hf_param_name=None,
)#
class bridge.models.ernie_vl.ernie45_vl_bridge._OffsetRowParallelMapping(
megatron_param: str,
hf_param: str,
expert_offset: int = 0,
)#

Bases: megatron.bridge.models.conversion.param_mapping.RowParallelMapping

RowParallelMapping with expert index offset for vision pool.

Used for vision expert down_proj (linear_fc2), which is always row-parallel in SequentialMLP. Using explicit RowParallelMapping avoids the AutoMapping delegation issue where the delegate’s gather_from_ep_ranks bypasses offset logic.

Initialization

resolve(captures: Tuple[str, ...])#
gather_from_ep_ranks(
megatron_weights,
megatron_module,
hf_param_name=None,
)#
class bridge.models.ernie_vl.ernie45_vl_bridge._ConcatBiasMapping(
megatron_param: str,
hf_param: str,
slice_name: str,
num_experts: int,
)#

Bases: megatron.bridge.models.conversion.param_mapping.AutoMapping

Mapping for the concatenated text+vision expert bias tensor.

The on-disk HF format stores a single moe_statics.e_score_correction_bias tensor of shape [2, num_experts] where row 0 is the text pool bias and row 1 is the vision pool bias. This mapping extracts the appropriate row based on slice_name.

For export (megatron_to_hf), the text mapping buffers its bias in a class-level dict keyed by resolved HF param name. The vision mapping retrieves the buffered text bias, stacks them into [2, N], and returns the merged tensor. This ensures only one entry per HF key.

Initialization

_export_buffer: Dict[str, torch.Tensor]#

None

classmethod clear_export_buffer()#

Remove any stale entries from the class-level export buffer.

resolve(captures: Tuple[str, ...])#
hf_to_megatron(hf_weights, megatron_module)#

Extract the text or vision slice from the concatenated bias.

On-disk shape is [2, num_experts]: row 0 = text, row 1 = vision.

megatron_to_hf(megatron_weights, megatron_module)#

Export text+vision expert bias as concatenated [2, N] tensor.

The text mapping buffers its bias; the vision mapping retrieves it and stacks into [2, N]. If the text bias is not yet buffered (shouldn’t happen in practice), falls back to exporting as-is.

class bridge.models.ernie_vl.ernie45_vl_bridge.Ernie45VLBridge#

Bases: megatron.bridge.models.conversion.model_bridge.MegatronModelBridge

Megatron Bridge for ERNIE 4.5 VL MoE Conditional Generation.

This bridge handles the conversion between HuggingFace Ernie4_5_VLMoeForConditionalGeneration and Megatron-Core Ernie45VLModel formats, including weight mappings and configuration translation for this vision-language MoE model.

Key architectural features handled:

  • Heterogeneous dual-pool MoE via ErnieMultiTypeMoE:

    • text_moe_layer: standard Megatron MoELayer (TP support)

    • vision_moe_layer: standard Megatron MoELayer (TP support)

  • Shared experts across modalities

  • 3D Multimodal RoPE (M-RoPE)

  • Variable-resolution vision resampler (spatial + temporal merging)

  • GQA with configurable query/KV heads

  • HF on-disk per-expert weights <-> Megatron per-expert SequentialMLP weights

.. rubric:: Example

from megatron.bridge import AutoBridge bridge = AutoBridge.from_hf_pretrained(“baidu/ERNIE-4.5-VL-28B-A3B-Instruct”) provider = bridge.to_megatron_provider()

static _get_text_config(hf_config)#

Extract the text/language config from either nested or flat HF config.

The transformers-builtin Ernie4_5_VLMoeConfig (model_type=ernie4_5_vl_moe) uses a nested text_config sub-object, while the custom auto_map config Ernie4_5_VLMoEConfig (model_type=ernie4_5_moe_vl, e.g. the Thinking model) uses a flat layout where all LLM fields live directly on the top-level config.

Returns the appropriate config object (nested text_config or the config itself).

static _get_num_experts(text_config) int#

Extract the per-pool number of experts as an int.

The nested config stores moe_num_experts as a plain int (e.g. 4), while the flat/Thinking config stores it as a list [64, 64] (text pool, vision pool – both values are always equal).

provider_bridge(
hf_pretrained: megatron.bridge.models.hf_pretrained.vlm.PreTrainedVLM,
) megatron.bridge.models.ernie_vl.ernie45_vl_provider.Ernie45VLModelProvider#

Create an Ernie45VLModelProvider from a HuggingFace pretrained model.

Maps HuggingFace Ernie4_5_VLMoeConfig fields to Megatron provider parameters, including vision config, MoE settings, M-RoPE sections, and token IDs.

Supports both nested config (transformers builtin, model_type=ernie4_5_vl_moe) and flat config (auto_map custom, model_type=ernie4_5_moe_vl).

Parameters:

hf_pretrained – HuggingFace pretrained VLM model.

Returns:

Ernie45VLModelProvider configured with the HF model’s parameters.

stream_weights_megatron_to_hf(*args, **kwargs)#

Override to clear the _ConcatBiasMapping export buffer before each run.

mapping_registry() megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#

Return MegatronMappingRegistry with parameter mappings for ERNIE 4.5 VL MoE.

Uses the HF on-disk (safetensors) key format, which differs from the in-memory state_dict() format due to HuggingFace’s _checkpoint_conversion_mapping.

On-disk format:

  • No language_model. prefix: model.layers.* not model.language_model.layers.*

  • Per-expert flat-indexed weights: experts.{j}.gate_proj.weight

  • Text experts indices 0..N-1, vision experts indices N..2N-1

  • Single gate.weight (text router) and gate.weight_1 (vision router)

  • Concatenated moe_statics.e_score_correction_bias for text+vision

  • model.vision_model.** not model.vision_tower.**

  • Resampler: spatial_linear.0/2/3 not spatial_linear.fc1/fc2/ln (same for temporal_linear)

Returns:

MegatronMappingRegistry with all parameter mappings.