`bridge.models.gemma.gemma4_provider`#

Gemma 4 text-only model providers.

Gemma4DenseProvider: Dense (E2B, E4B, and 31B) — builds GPTModel with local spec, dual RoPE, PLE, and shared KV. Gemma4ModelProvider: MoE (26B-A4B and similar) — extends GPTModelProvider with TE-based layer spec, dual RoPE, and softcapped output layer.

Module Contents#

Classes#

`Gemma4DenseProvider`	Gemma 4 dense E2B, E4B, and 31B provider for clean Megatron-Core.
`Gemma4ModelProvider`	Configuration and provider for Megatron Core Gemma 4 MoE models.

Functions#

`_validate_gemma4_moe_orchestration`	Reject MCore execution modes bypassed by Gemma 4’s custom MoE forward.
`_install_gemma4_dense_load_state_aliases`	Translate Gemma4 Dense checkpoint attention aliases before load_state_dict.

API#

bridge.models.gemma.gemma4_provider._validate_gemma4_moe_orchestration( provider: megatron.bridge.models.gpt_provider.GPTModelProvider, ) → None#: Reject MCore execution modes bypassed by Gemma 4’s custom MoE forward.

bridge.models.gemma.gemma4_provider._install_gemma4_dense_load_state_aliases( model: torch.nn.Module, ) → None#

Translate Gemma4 Dense checkpoint attention aliases before load_state_dict.

Gemma4 Dense saves sliding/global attention tensors under separate names in dist-checkpoints because the two layer types have different sharded shapes. After dist-checkpoint load materializes a regular state_dict, PyTorch module loading expects the real module attribute name, self_attention.

class bridge.models.gemma.gemma4_provider.Gemma4DenseProvider#

Bases: megatron.bridge.models.gpt_provider.GPTModelProvider

Gemma 4 dense E2B, E4B, and 31B provider for clean Megatron-Core.

All Gemma4-specific settings are encoded here as dataclass fields so that no Gemma4-specific CLI arguments are required.

num_layers: int#: 42

hidden_size: int#: 2560

ffn_hidden_size: int#: 10240

num_attention_heads: int#: 8

num_query_groups: int#: 2

kv_channels: int#: 256

seq_length: int#: 131072

vocab_size: int#: 262143

make_vocab_size_divisible_by: int#: 128

normalization: str#: ‘RMSNorm’

layernorm_epsilon: float#: 1e-06

gated_linear_unit: bool#: True

add_bias_linear: bool#: False

activation_func: Callable#: ‘field(…)’

scale_embeddings_by_hidden_size: bool#: True

share_embeddings_and_output_weights: bool#: True

position_embedding_type: str#: ‘rope’

rotary_percent: float#: 1.0

attention_dropout: float#: 0.0

hidden_dropout: float#: 0.0

window_size: Optional[Tuple[int, int]]#: (511, 0)

window_attn_skip_freq: Union[int, List[int]]#: 6

bf16: bool#: True

fp16: bool#: False

params_dtype: torch.dtype#: None

autocast_dtype: torch.dtype#: None

use_cpu_initialization: bool#: False

global_kv_channels: int#: 512

num_global_query_groups: int#: 2

attention_k_eq_v: bool#: False

sliding_window_rope_base: float#: 10000.0

full_attention_rope_base: float#: 1000000.0

full_attention_rope_partial_factor: float#: 0.25

num_kv_shared_layers: int#: 18

use_double_wide_mlp: bool#: False

per_layer_embed_vocab_size: int#: 262144

per_layer_embed_dim: int#: 256

final_logit_softcapping: float | None#: 30.0

num_moe_experts: Optional[int]#: None

moe_router_topk: Optional[int]#: None

moe_ffn_hidden_size: Optional[int]#: None

finalize() → None#

_ensure_finalized() → None#

provide( pre_process: Optional[bool] = None, post_process: Optional[bool] = None, vp_stage: Optional[int] = None, ) → torch.nn.Module#

build( pre_process: bool = True, post_process: bool = True, ) → torch.nn.Module#: Build a Gemma-4 Dense GPTModel and attach Bridge-specific components.

class bridge.models.gemma.gemma4_provider.Gemma4ModelProvider#