bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer#
Custom MoE modules for ERNIE 4.5 VL MoE dual-pool architecture.
ERNIE 4.5 VL uses a heterogeneous dual-pool MoE where each transformer layer (except layer 0 which is dense) contains:
text_moe_layer: 64 experts with intermediate_size=1536 for text tokens
vision_moe_layer: 64 experts with intermediate_size=512 for vision tokens
shared_experts: 2 shared experts with intermediate_size=3072 for all tokens
Both pools use separate routers and expert sets. Tokens are dispatched to their respective pool based on modality (token_type_ids: 0=text, 1=vision).
Module hierarchy (MoE layers): decoder.layers.{i}.mlp = ErnieMultiTypeMoE .text_moe_layer = MoELayer (standard Megatron) .router = TopKRouter .experts = SequentialMLP .local_experts.{j} = MLP (with linear_fc1, linear_fc2) .vision_moe_layer = MoELayer (standard Megatron) .router = TopKRouter .experts = SequentialMLP .local_experts.{j} = MLP (with linear_fc1, linear_fc2) .shared_experts = SharedExpertMLP .linear_fc1, .linear_fc2
Communication pattern for moe_mm_token_type_ids:
Megatron-Core’s TransformerBlock / TransformerLayer do not propagate extra
kwargs to MLP layers. To pass moe_mm_token_type_ids from
Ernie45VLModel.forward() to ErnieMultiTypeMoE.forward() we use a
module-level context variable _current_moe_mm_token_type_ids that is set
before the language model forward and cleared afterwards.
Module Contents#
Classes#
Submodule specs for the dual-pool MoE layer. |
|
Dual-pool Mixture of Experts layer for ERNIE 4.5 VL. |
Functions#
Set the current moe_mm_token_type_ids for MoE routing. |
|
Clear the current moe_mm_token_type_ids after forward pass. |
Data#
API#
- bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer._current_moe_mm_token_type_ids: torch.Tensor | None#
None
- bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.set_moe_mm_token_type_ids(token_type_ids)#
Set the current moe_mm_token_type_ids for MoE routing.
Called by
Ernie45VLModel.forward()beforelanguage_model.forward().
- bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.clear_moe_mm_token_type_ids()#
Clear the current moe_mm_token_type_ids after forward pass.
Called by
Ernie45VLModel.forward()afterlanguage_model.forward().
- class bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.MultiTypeMoeSubmodules#
Submodule specs for the dual-pool MoE layer.
.. attribute:: text_moe_layer
Spec for the text MoE pool (larger FFN).
.. attribute:: vision_moe_layer
Spec for the vision MoE pool (smaller FFN).
.. attribute:: shared_experts
Spec for the shared expert MLP.
- text_moe_layer: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
- vision_moe_layer: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#
None
None
- class bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.ErnieMultiTypeMoE(
- config: megatron.core.transformer.transformer_config.TransformerConfig,
- submodules: Optional[bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.MultiTypeMoeSubmodules] = None,
- layer_number: Optional[int] = None,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
- is_mtp_layer: bool = False,
- name: str | None = None,
Bases:
megatron.core.transformer.module.MegatronModuleDual-pool Mixture of Experts layer for ERNIE 4.5 VL.
Routes text tokens to text_moe_layer and vision tokens to vision_moe_layer, then combines outputs with shared expert output.
Each pool is a standard Megatron MoELayer with its own router and experts, supporting TP and EP parallelism natively.
- Parameters:
config – TransformerConfig with moe_intermediate_size as a tuple/list of [text_ffn_size, vision_ffn_size].
submodules – MultiTypeMoeSubmodules containing specs for both pools.
layer_number – Layer index in the transformer stack.
pg_collection – Process group collection for parallelism.
is_mtp_layer – Whether this MoE is used inside an MTP layer.
name – Optional module instance name passed top-down by Megatron-Core.
Initialization
- forward(
- hidden_states: torch.Tensor,
- token_type_ids: torch.Tensor = None,
- padding_mask: Optional[torch.Tensor] = None,
Forward pass for dual-pool MoE.
- Parameters:
hidden_states – Input tensor [seq_len, batch, hidden_size]. When Sequence Parallel (SP) is enabled, seq_len is the local partition size (full_seq_len / tp_size).
token_type_ids – Modality indicator [batch, seq_len]. 0 = text token -> text_moe_layer 1 or 2 = vision token -> vision_moe_layer When SP is enabled, this must already be sliced to match the local sequence partition (done by Ernie45VLModel.forward()).
padding_mask – Optional padding mask [batch, seq_len] passed by Megatron’s TransformerLayer. Forwarded to each MoE pool’s router for filtering out padding tokens during routing.
- Returns:
Tuple of (output, bias). bias is always None.
- set_layer_number(layer_number: int)#
Set the layer number for both MoE pools.