bridge.models.exaone.exaone4_provider#
Model provider and custom layer specifications for EXAONE 4.0.
EXAONE 4.0 uses a pure Post-LayerNorm architecture: h = x + Attn(x) # no pre-norm before attention h = PostAttnNorm(h) # RMSNorm after residual add o = h + MLP(h) # no pre-norm before MLP o = PostFFNNorm(o) # RMSNorm after residual add
This requires a custom layer spec because the standard Megatron GPT spec assumes Pre-LN (fusing layernorm into the column-parallel linear via TELayerNormColumnParallelLinear). EXAONE instead needs:
Plain column-parallel linears for QKV and FC1 (no fused pre-norm)
Row-parallel linears with post-layernorm for output projection and FC2
The Post-LN implementation reuses the TERowParallelLinearLayerNorm pattern established by Gemma2 bridge.
Module Contents#
Functions#
EXAONE 4.0 layer specification with pure Post-LayerNorm. |
API#
- bridge.models.exaone.exaone4_provider.exaone4_layer_spec(
- config: megatron.bridge.models.gpt_provider.GPTModelProvider,
EXAONE 4.0 layer specification with pure Post-LayerNorm.
Key differences from standard GPT layer spec:
linear_qkv: TEColumnParallelLinear (no fused pre-norm, since no input_layernorm)
linear_proj: TERowParallelLinearLayerNorm (post-attention norm)
linear_fc1: TEColumnParallelLinear (no fused pre-norm, since no pre_feedforward_layernorm)
linear_fc2: TERowParallelLinearLayerNorm (post-feedforward norm)
QK layernorm is handled by qk_layernorm=True in TransformerConfig
- Parameters:
config – Reserved for future use (e.g., 32B hybrid attention with layer-wise branching between local and global attention).
- Returns:
ModuleSpec for EXAONE 4.0 transformer layer