bridge.models.ernie_vl.ernie45_vl_provider#

Provider for ERNIE 4.5 VL MoE model.

Maps HuggingFace Ernie4_5_VLMoeConfig to Megatron-Core TransformerConfig and provides model instantiation logic for the dual-pool MoE architecture.

The language model uses a custom ErnieMultiTypeMoE layer containing both text_moe_layer and vision_moe_layer as separate MoELayer instances, each with their own router, experts, and EP support.

Module Contents#

Classes#

Ernie45VLModelProvider

Model provider for ERNIE 4.5 VL MoE.

API#

class bridge.models.ernie_vl.ernie45_vl_provider.Ernie45VLModelProvider#

Bases: megatron.bridge.models.gpt_provider.GPTModelProvider

Model provider for ERNIE 4.5 VL MoE.

This provider extends GPTModelProvider with ERNIE 4.5 VL-specific fields:

  • Vision configuration for the ViT encoder and resampler

  • 3D M-RoPE parameters (mrope_section)

  • Dual-pool MoE configuration (moe_intermediate_size as tuple)

  • Custom decoder layer spec with ErnieMultiTypeMoE

  • Token IDs for image/video placeholder tokens

  • Freeze options for vision/language components

scatter_embedding_sequence_parallel: bool#

False

position_embedding_type: str#

‘mrope’

mrope_section: List[int]#

‘field(…)’

vision_config: Any#

‘field(…)’

hf_config: Any#

None

moe_intermediate_size: Tuple[int, int]#

(1536, 512)

image_start_token_id: int#

101304

image_end_token_id: int#

101305

image_token_id: int#

100295

video_start_token_id: int#

101306

video_end_token_id: int#

101307

video_token_id: int#

103367

freeze_language_model: bool#

False

freeze_vision_model: bool#

False

freeze_vision_projection: bool#

False

use_mg_vit: bool#

False

transformer_layer_spec: Union[megatron.core.transformer.spec_utils.ModuleSpec, Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.spec_utils.ModuleSpec]]#

None

provide(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.bridge.models.ernie_vl.modeling_ernie45_vl.model.Ernie45VLModel#

Build the composite VLM model (vision + resampler + language model).

Parameters:
  • pre_process – Whether to include pre-processing (embedding + vision). Defaults to first PP stage.

  • post_process – Whether to include post-processing (output layer). Defaults to last PP stage.

  • vp_stage – Virtual pipeline stage index.

Returns:

Configured ERNIE 4.5 VL MoE model instance.

Return type:

Ernie45VLModel

provide_language_model(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.core.models.gpt.GPTModel#

Build only the language model (MCoreGPTModel) for weight conversion.

This uses GPTModelProvider.provide() which builds a standard MCoreGPTModel but with the custom ErnieMultiTypeMoE layer spec set via transformer_layer_spec. The resulting model has both text_moe_layer and vision_moe_layer as proper submodules of each MoE transformer layer.

Parameters:
  • pre_process – Whether to include pre-processing.

  • post_process – Whether to include post-processing.

  • vp_stage – Virtual pipeline stage index.

Returns:

Configured Megatron-Core GPT model instance with dual-pool MoE.

Return type:

MCoreGPTModel