bridge.models.ernie.ernie_45_bridge#

Megatron Bridge for ERNIE 4.5 text-only MoE model.

Maps HuggingFace Ernie4_5_MoeForCausalLM weights and config to Megatron-Core GPTModel with single-pool MoE (64 experts, top-6 routing, shared experts, expert bias for aux-free load balancing).

Module Contents#

Classes#

_PPSafeMixin

Mixin that makes megatron_to_hf safe for PP export of MoE-only params.

_PPSafeAutoMapping

AutoMapping that skips export for missing parameters.

_PPSafeReplicatedMapping

ReplicatedMapping that skips export for missing parameters.

_PPSafeGatedMLPMapping

GatedMLPMapping that skips export for missing parameters.

_SqueezeBiasMapping

Mapping for the single-pool expert bias tensor.

Ernie45Bridge

Megatron Bridge for ERNIE 4.5 text-only MoE Causal LM.

Functions#

_ernie45_decoder_block_spec

Create a decoder block spec that respects moe_layer_freq.

Data#

API#

bridge.models.ernie.ernie_45_bridge._ernie45_decoder_block_spec(
config: megatron.bridge.models.gpt_provider.GPTModelProvider,
vp_stage: int | None = None,
)#

Create a decoder block spec that respects moe_layer_freq.

The default GPTModelProvider.transformer_layer_spec calls get_gpt_layer_with_transformer_engine_spec which returns a single MoE layer spec applied uniformly to ALL layers, ignoring moe_layer_freq.

ERNIE 4.5 has mixed dense/MoE layers (layer 0 is dense, layers 1-N are MoE). This function uses get_gpt_decoder_block_spec which calls get_gpt_decoder_layer_specs — the code path that parses config.moe_layer_freq and creates per-layer specs (dense for pattern=0, MoE for pattern=1).

bridge.models.ernie.ernie_45_bridge._ERNIE45_MOE_HF_CLASS_NAME#

‘Ernie4_5_MoeForCausalLM’

class bridge.models.ernie.ernie_45_bridge._PPSafeMixin#

Mixin that makes megatron_to_hf safe for PP export of MoE-only params.

When moe_layer_freq makes some layers dense and others MoE, MoE-only parameters (router weight, expert bias, shared/routed expert weights) do not exist on dense layers. With PP > 1, broadcast_from_pp_rank raises ValueError because no PP rank owns the tensor.

This mixin catches that error and returns {} so the conversion loop simply omits the parameter from the output.

Must be listed before the base mapping class in the MRO so that super().megatron_to_hf resolves to the concrete mapping’s method.

megatron_to_hf(megatron_weights, megatron_module)#
class bridge.models.ernie.ernie_45_bridge._PPSafeAutoMapping#

Bases: bridge.models.ernie.ernie_45_bridge._PPSafeMixin, megatron.bridge.models.conversion.param_mapping.AutoMapping

AutoMapping that skips export for missing parameters.

class bridge.models.ernie.ernie_45_bridge._PPSafeReplicatedMapping#

Bases: bridge.models.ernie.ernie_45_bridge._PPSafeMixin, megatron.bridge.models.conversion.param_mapping.ReplicatedMapping

ReplicatedMapping that skips export for missing parameters.

class bridge.models.ernie.ernie_45_bridge._PPSafeGatedMLPMapping#

Bases: bridge.models.ernie.ernie_45_bridge._PPSafeMixin, megatron.bridge.models.conversion.param_mapping.GatedMLPMapping

GatedMLPMapping that skips export for missing parameters.

class bridge.models.ernie.ernie_45_bridge._SqueezeBiasMapping#

Bases: bridge.models.ernie.ernie_45_bridge._PPSafeReplicatedMapping

Mapping for the single-pool expert bias tensor.

The HF text-only model stores moe_statics.e_score_correction_bias with shape [1, num_experts] (1 expert group for text-only). Megatron stores router.expert_bias as a flat [num_experts] tensor.

This mapping squeezes dim-0 on import and unsqueezes on export.

Inherits from _PPSafeReplicatedMapping to gracefully skip dense layers during PP export.

hf_to_megatron(hf_weights, megatron_module)#
megatron_to_hf(megatron_weights, megatron_module)#
class bridge.models.ernie.ernie_45_bridge.Ernie45Bridge#

Bases: megatron.bridge.models.conversion.model_bridge.MegatronModelBridge

Megatron Bridge for ERNIE 4.5 text-only MoE Causal LM.

This bridge handles the conversion between HuggingFace Ernie4_5_MoeForCausalLM and Megatron-Core GPTModel formats with single-pool MoE architecture.

Key architectural features:

  • Single-pool MoE: 64 experts, top-6 routing, shared experts

  • Softmax routing with expert bias for aux-free load balancing

  • Interleaved RoPE (base=500000)

  • GQA with 20 query heads, 4 KV heads, kv_channels=128

  • RMSNorm, SiLU-gated MLP

  • Router gate weight stored as [H, E] in HF (transposed for Megatron [E, H])

.. rubric:: Example

from megatron.bridge import AutoBridge bridge = AutoBridge.from_hf_pretrained(“baidu/ERNIE-4.5-0.3B-PT”) provider = bridge.to_megatron_provider()

static _get_num_experts(hf_config) int#

Extract num_experts as an int.

The config may store moe_num_experts as a plain int or as a list [N] (single pool) or [N, N] (dual pool – take first).

provider_bridge(hf_pretrained)#

Convert HuggingFace ERNIE 4.5 MoE config to GPTModelProvider.

Uses super().provider_bridge() for standard CONFIG_MAPPING fields (hidden_size, num_layers, rope_theta, tie_word_embeddings, etc.) and then overrides ERNIE-specific settings.

mapping_registry() megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#

Return MegatronMappingRegistry with parameter mappings for ERNIE 4.5 MoE.