bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer#

Custom MoE modules for ERNIE 4.5 VL MoE dual-pool architecture.

ERNIE 4.5 VL uses a heterogeneous dual-pool MoE where each transformer layer (except layer 0 which is dense) contains:

  • text_moe_layer: 64 experts with intermediate_size=1536 for text tokens

  • vision_moe_layer: 64 experts with intermediate_size=512 for vision tokens

  • shared_experts: 2 shared experts with intermediate_size=3072 for all tokens

Both pools use separate routers and expert sets. Tokens are dispatched to their respective pool based on modality (token_type_ids: 0=text, 1=vision).

Module hierarchy (MoE layers): decoder.layers.{i}.mlp = ErnieMultiTypeMoE .text_moe_layer = MoELayer (standard Megatron) .router = TopKRouter .experts = SequentialMLP .local_experts.{j} = MLP (with linear_fc1, linear_fc2) .vision_moe_layer = MoELayer (standard Megatron) .router = TopKRouter .experts = SequentialMLP .local_experts.{j} = MLP (with linear_fc1, linear_fc2) .shared_experts = SharedExpertMLP .linear_fc1, .linear_fc2

Communication pattern for moe_mm_token_type_ids: Megatron-Core’s TransformerBlock / TransformerLayer do not propagate extra kwargs to MLP layers. To pass moe_mm_token_type_ids from Ernie45VLModel.forward() to ErnieMultiTypeMoE.forward() we use a module-level context variable _current_moe_mm_token_type_ids that is set before the language model forward and cleared afterwards.

Module Contents#

Classes#

MultiTypeMoeSubmodules

Submodule specs for the dual-pool MoE layer.

ErnieMultiTypeMoE

Dual-pool Mixture of Experts layer for ERNIE 4.5 VL.

Functions#

set_moe_mm_token_type_ids

Set the current moe_mm_token_type_ids for MoE routing.

clear_moe_mm_token_type_ids

Clear the current moe_mm_token_type_ids after forward pass.

Data#

API#

bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer._current_moe_mm_token_type_ids: torch.Tensor | None#

None

bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.set_moe_mm_token_type_ids(token_type_ids)#

Set the current moe_mm_token_type_ids for MoE routing.

Called by Ernie45VLModel.forward() before language_model.forward().

bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.clear_moe_mm_token_type_ids()#

Clear the current moe_mm_token_type_ids after forward pass.

Called by Ernie45VLModel.forward() after language_model.forward().

class bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.MultiTypeMoeSubmodules#

Submodule specs for the dual-pool MoE layer.

.. attribute:: text_moe_layer

Spec for the text MoE pool (larger FFN).

.. attribute:: vision_moe_layer

Spec for the vision MoE pool (smaller FFN).

.. attribute:: shared_experts

Spec for the shared expert MLP.

text_moe_layer: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

vision_moe_layer: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

shared_experts: Union[megatron.core.transformer.spec_utils.ModuleSpec, type]#

None

class bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.ErnieMultiTypeMoE(
config: megatron.core.transformer.transformer_config.TransformerConfig,
submodules: Optional[bridge.models.ernie_vl.modeling_ernie45_vl.ernie_moe_layer.MultiTypeMoeSubmodules] = None,
layer_number: Optional[int] = None,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
is_mtp_layer: bool = False,
name: str | None = None,
)#

Bases: megatron.core.transformer.module.MegatronModule

Dual-pool Mixture of Experts layer for ERNIE 4.5 VL.

Routes text tokens to text_moe_layer and vision tokens to vision_moe_layer, then combines outputs with shared expert output.

Each pool is a standard Megatron MoELayer with its own router and experts, supporting TP and EP parallelism natively.

Parameters:
  • config – TransformerConfig with moe_intermediate_size as a tuple/list of [text_ffn_size, vision_ffn_size].

  • submodules – MultiTypeMoeSubmodules containing specs for both pools.

  • layer_number – Layer index in the transformer stack.

  • pg_collection – Process group collection for parallelism.

  • is_mtp_layer – Whether this MoE is used inside an MTP layer.

  • name – Optional module instance name passed top-down by Megatron-Core.

Initialization

forward(
hidden_states: torch.Tensor,
token_type_ids: torch.Tensor = None,
padding_mask: Optional[torch.Tensor] = None,
)#

Forward pass for dual-pool MoE.

Parameters:
  • hidden_states – Input tensor [seq_len, batch, hidden_size]. When Sequence Parallel (SP) is enabled, seq_len is the local partition size (full_seq_len / tp_size).

  • token_type_ids – Modality indicator [batch, seq_len]. 0 = text token -> text_moe_layer 1 or 2 = vision token -> vision_moe_layer When SP is enabled, this must already be sliced to match the local sequence partition (done by Ernie45VLModel.forward()).

  • padding_mask – Optional padding mask [batch, seq_len] passed by Megatron’s TransformerLayer. Forwarded to each MoE pool’s router for filtering out padding tokens during routing.

Returns:

Tuple of (output, bias). bias is always None.

set_layer_number(layer_number: int)#

Set the layer number for both MoE pools.