Megatron Bridge Conversion Technical Details#
Megatron Bridge provides a robust, parallelism-aware pathway to convert models and checkpoints between 🤗 Hugging Face Transformers and Megatron-Core formats. This page dives into the architecture, data flows, and per-parameter conversion engine, with examples.
Megatron Bridge performs on-the-fly, model-parallel-aware, per-parameter conversion—unlike traditional converters that require a single GPU and full in-memory loading of both Megatron-Core and HF models.
For API-centric usage, see the guide: Bridge with 🤗 Hugging Face
Architecture at a glance#
flowchart LR
HF[Hugging Face<br/>PreTrained Model/Config] -->|detect| AutoBridge
AutoBridge -->|select| Bridge[MegatronModelBridge<br/>architecture-specific]
Bridge -->|provider bridge| Provider[Model Provider<br/>TransformerConfig builder]
Provider -->|instantiate| Megatron[Distributed Megatron Model]
Bridge -->|mapping registry| Registry[MegatronMappingRegistry]
Registry -->|maps to| Mappings[Param Mapping<br/>Auto Row Col QKV ...]
HF <-->|convert per-parameter| Bridge
Megatron --|TP/PP/VPP-aware|--> Mappings
Key components:
AutoBridge: Detects HF architecture, constructs the appropriate bridge, exposes high-level convert/save APIs. See bridge.models.conversion.auto_bridge.
MegatronModelBridge: Orchestrates conversion, builds conversion tasks, handles per-parameter streaming. See model_bridge.py.
MegatronMappingRegistry: Registry of parameter-name mappings; resolves concrete
MegatronParamMappingfor each weight. See mapping_registry.py.Param Mappings: Implement parameter transformations and parallel distribution (Auto, ColumnParallel, RowParallel, QKV, GatedMLP, Replicated, custom). See param_mapping.py.
Model Providers: Build
TransformerConfig-compatible providers for Megatron-Core and instantiate distributed models. See models/.Specific model bridge definitions: Architecture-specific bridges live under their model folders, for example LlamaBridge and Qwen3Bridge.
Conversion workflow#
The conversion engine is powered by two pieces: the architecture-specific Megatron model bridge and the parameter mappings.
(1) Config mapping + model creation: You specify a config mapping and parameter mapping strategy (name patterns + mapping types). The bridge loads the HF config, translates it into a Megatron provider, and the provider instantiates a (possibly distributed) Megatron model. With TP/PP/EP enabled, each rank holds only a shard of the model.
from megatron.bridge import AutoBridge
# Build bridge and instantiate Megatron model(s)
bridge = AutoBridge.from_hf_pretrained("meta-llama/Llama-3.2-1B")
provider = bridge.to_megatron_provider()
provider.finalize()
megatron_model = provider.provide_distributed_model(wrap_with_ddp=False)
(2) Gather all params: After the model is created, the bridge enumerates all named parameters and buffers across PP ranks. It then sorts them to produce a deterministic global order, ensuring every rank uses the same mapping order for collective operations during conversion.
(3) Resolve mappings: Using the global Megatron parameter names, the bridge queries the mapping registry to resolve a concrete mapping per parameter. For example, in Qwen3, decoder.layers.0.self_attention.linear_qkv.weight matches a QKVMapping pattern. Resolution always starts from the Megatron name; HF names are derived via wildcard substitution. Only the referenced HF tensors are fetched from safetensors—the full HF model is never fully loaded.
from megatron.bridge.models.conversion.mapping_registry import MegatronMappingRegistry
from megatron.bridge.models.conversion.param_mapping import AutoMapping, QKVMapping
registry = MegatronMappingRegistry(
AutoMapping(
megatron_param="decoder.layers.*.mlp.linear_fc2.weight",
hf_param="model.layers.*.mlp.down_proj.weight",
),
QKVMapping(
megatron_param="decoder.layers.*.self_attention.linear_qkv.weight",
q="model.layers.*.self_attn.q_proj.weight",
k="model.layers.*.self_attn.k_proj.weight",
v="model.layers.*.self_attn.v_proj.weight",
),
)
# Example: "decoder.layers.0.self_attention.linear_qkv.weight" → QKVMapping
(4) Create conversion tasks: The bridge pairs every Megatron parameter with its resolved mapping and associated metadata (owning module, tensor handle, parallelism context). These per-parameter tasks become the unit of work for conversion.
(5) Execute conversions: For HF→Megatron or Megatron→HF, the bridge iterates tasks and calls the mapping’s hf_to_megatron or megatron_to_hf routines. Conversion streams parameter-by-parameter to minimize memory.
# HF → Megatron streaming import (internally iterates conversion tasks)
bridge.load_hf_weights(megatron_model)
(6) Mapping semantics: Each mapping handles the necessary distribution logic—broadcasting across PP, scattering/gathering across TP/EP—and applies structural transforms (e.g., QKV fuse/split, gated-MLP concat/split, row/column parallel splits).
Properties:
Per-parameter streaming: Only the weight currently being processed is kept in memory.
Parallelism aware: Distribution respects TP (tensor), PP (pipeline), VPP (virtual pipeline), and expert parallel settings.
Deterministic mapping: Names resolve via
MegatronMappingRegistry, including wildcard support.
HF → Megatron (import)#
sequenceDiagram
participant U as User
participant AB as AutoBridge
participant B as MegatronModelBridge
participant R as MappingRegistry
participant M as Megatron Model(s)
U->>AB: from_hf_pretrained(path_or_id)
AB->>B: select bridge for architecture
U->>AB: load_hf_weights(model)
AB->>B: load_hf_weights(model)
B->>M: enumerate params/buffers across PP ranks
B->>B: gather global param names (all ranks)
B->>B: sort names deterministically
B->>R: mapping_registry()
B->>R: build_conversion_tasks(megatron_params, hf_state)
loop for each global Megatron param name
R-->>B: resolve mapping by Megatron name
B->>B: fetch referenced HF tensor(s) (safetensors)
B->>B: apply transforms (QKV/GatedMLP/Row/Column)
B->>M: scatter/gather TP/EP, broadcast PP
B->>M: copy into destination tensor
end
B-->>U: model with loaded weights
Megatron → HF (export)#
sequenceDiagram
participant U as User
participant AB as AutoBridge
participant B as MegatronModelBridge
participant R as MappingRegistry
participant M as Megatron Model(s)
U->>AB: export_hf_weights(model) / save_hf_pretrained(model, path)
AB->>B: export_hf_weights(model)
B->>M: enumerate params/buffers across PP ranks
B->>B: gather global param names (all ranks)
B->>B: sort names deterministically
B->>R: mapping_registry()
B->>R: build_conversion_tasks(megatron_params)
loop for each global Megatron param name
R-->>B: resolve mapping by Megatron name
B->>M: read shard(s) from owning PP rank
B->>M: gather across TP/EP as needed
B->>B: apply inverse transforms (split QKV/GatedMLP/Row/Column)
alt streaming
B-->>U: yield (hf_name, tensor)
else saving
B->>U: write to safetensors / HF dir
end
end
B-->>U: export complete
Param mappings and parallelism#
Mapping types available via param_mapping.py:
AutoMapping: General-purpose 1:1 parameter mapping with automatic TP-type detection; dispatches to ColumnParallelMapping, RowParallelMapping, or ReplicatedMapping based on the layer/module type (wildcards supported). Participates in PP broadcast and EP gather when applicable.
ColumnParallelMapping: Splits along the output dimension (dim 0) under TP. Participates in PP broadcast and EP gather when applicable.
RowParallelMapping: Splits along the input dimension (dim 1) under TP. Participates in PP broadcast and EP gather when applicable.
QKVMapping: Fuses/splits HF Q, K, V projections to Megatron’s interleaved QKV format and vice versa. Uses PP broadcast as needed and delegates TP to the underlying mapping.
GatedMLPMapping: Concatenates/splits gate and up projections. Participates in PP broadcast and EP gather when applicable.
ReplicatedMapping: Keeps parameters fully replicated across TP ranks (e.g., LayerNorm). Participates in PP broadcast and EP gather when applicable.
Note: If you need a one-to-many or many-to-one mapping that is not covered by QKVMapping or GatedMLPMapping, implement a custom mapping.
Example Mapping - ColumnParallelMapping: PP, TP, EP in practice#
HF → Megatron (import):
HF tensors are available from storage to all ranks; TP rank 0 reads the full tensor and performs the split/scatter.
TP: Rank 0 splits along dim 0 into
tp_sizechunks and scatters shards to TP ranks so each rank receives a tensor matching its local parameter shape/dtype/device.PP: No PP collectives are needed; the owning PP stage writes its shard directly.
EP: For expert parameters, each EP rank receives its local experts by name; no cross-EP collectives are required on import.
Megatron → HF (export):
Only the owning PP stage initially holds the local Megatron shard; it broadcasts to all PP ranks before TP gather.
PP: The owning PP stage first broadcasts the tensor to all PP ranks so every rank participates in the collectives.
TP: All TP shards are gathered and concatenated along dim 0 to reconstruct the full tensor.
EP: For expert parameters, shards are gathered across EP ranks and one HF tensor per expert is emitted with the correct names.
Let total experts be E and EP size be S (assume E % S == 0). Each EP rank owns E/S experts. For a given local expert index L on each EP rank, the global expert ids are L, L+E/S, …, L+(S-1)*E/S. We gather tensors from all EP ranks and emit one HF tensor per global expert id by substituting that id into the HF parameter name.
This mirrors ColumnParallelMapping.hf_to_megatron and ColumnParallelMapping.megatron_to_hf in param_mapping.py.
Implementation notes (from code):
Dtype handling: When HF and Megatron dtypes differ, weights are cast to the Megatron parameter dtype with a warning before TP scatter (see ColumnParallelMapping.hf_to_megatron in param_mapping.py).
FP8 export: Tensors are dequantized on export when using FP8 tensor classes (see
maybe_dequantizein param_mapping.py).MoE experts: Expert parameter names are normalized for lookup and expert shards are gathered across EP ranks and re-emitted per global expert id (see
gather_from_ep_ranksin param_mapping.py).
Architecture-specific bridge example: Qwen3#
Embedded from src/megatron/bridge/models/qwen/qwen3_bridge.py:
1@MegatronModelBridge.register_bridge(source=Qwen3ForCausalLM, target=GPTModel, model_type="qwen3")
2class Qwen3Bridge(MegatronModelBridge):
3 """
4 Megatron Bridge for Qwen3 Causal LM.
5
6 This bridge handles the conversion between HuggingFace Qwen3ForCausalLM
7 and Megatron-Core GPTModel formats. Qwen3 differs from Qwen2 by using
8 QK layernorm and no QKV bias.
9
10 Example:
11 >>> from megatron.bridge import AutoBridge
12 >>> bridge = AutoBridge.from_hf_pretrained("Qwen/Qwen3-1.7B")
13 >>> provider = bridge.to_megatron_provider()
14 """
15
16 def provider_bridge(self, hf_pretrained):
17 """Convert HuggingFace Qwen3 config to GPTModelProvider."""
18 provider = super().provider_bridge(hf_pretrained)
19
20 provider.normalization = "RMSNorm"
21 provider.gated_linear_unit = True
22 provider.add_bias_linear = False
23 provider.add_qkv_bias = False # Qwen3 does NOT have QKV bias (unlike Qwen2)
24 provider.hidden_dropout = 0.0
25 provider.qk_layernorm = True # Qwen3 uses QK layernorm
26 provider.autocast_dtype = torch.bfloat16
27
28 return provider
29
30 def mapping_registry(self) -> MegatronMappingRegistry:
31 """Return the MegatronMappingRegistry for Qwen3 parameter conversion.
32
33 Covers all Megatron-Core parameter names for both the standard decoder
34 layers and the MTP (Multi-Token Prediction) transformer layers that are
35 present when ``mtp_num_layers >= 1``.
36
37 Simple 1:1 renames are expressed as :class:`AutoMapping` entries.
38 The fused QKV matrix is handled by :class:`QKVMapping` and the gated
39 MLP gate+up projection by :class:`GatedMLPMapping`.
40 """
41 # Dictionary maps Megatron parameter names -> HF parameter names
42 # Supports wildcard (*) patterns for layer-specific parameters
43 param_mappings = {
44 # Embedding and output
45 "embedding.word_embeddings.weight": "model.embed_tokens.weight",
46 "output_layer.weight": "lm_head.weight",
47 "decoder.final_layernorm.weight": "model.norm.weight",
48 # Decoder layer attention norms and projections
49 "decoder.layers.*.self_attention.linear_qkv.layer_norm_weight": "model.layers.*.input_layernorm.weight",
50 "decoder.layers.*.mlp.linear_fc1.layer_norm_weight": "model.layers.*.post_attention_layernorm.weight",
51 "decoder.layers.*.self_attention.q_layernorm.weight": "model.layers.*.self_attn.q_norm.weight", # Qwen3 specific
52 "decoder.layers.*.self_attention.k_layernorm.weight": "model.layers.*.self_attn.k_norm.weight", # Qwen3 specific
53 "decoder.layers.*.self_attention.linear_proj.weight": "model.layers.*.self_attn.o_proj.weight",
54 "decoder.layers.*.mlp.linear_fc2.weight": "model.layers.*.mlp.down_proj.weight",
55 # MTP projection and norms (used when mtp_num_layers >= 1)
56 "mtp.layers.0.eh_proj.weight": "mtp.fc.weight",
57 "mtp.layers.0.enorm.weight": "mtp.pre_fc_norm_embedding.weight",
58 "mtp.layers.0.hnorm.weight": "mtp.pre_fc_norm_hidden.weight",
59 "mtp.layers.0.final_layernorm.weight": "mtp.norm.weight",
60 # MTP transformer layer attention
61 "mtp.layers.0.transformer_layer.self_attention.linear_qkv.layer_norm_weight": "mtp.layers.0.input_layernorm.weight",
62 "mtp.layers.0.transformer_layer.self_attention.q_layernorm.weight": "mtp.layers.0.self_attn.q_norm.weight",
63 "mtp.layers.0.transformer_layer.self_attention.k_layernorm.weight": "mtp.layers.0.self_attn.k_norm.weight",
64 "mtp.layers.0.transformer_layer.self_attention.linear_proj.weight": "mtp.layers.0.self_attn.o_proj.weight",
65 # MTP transformer layer MLP
66 "mtp.layers.0.transformer_layer.mlp.linear_fc1.layer_norm_weight": "mtp.layers.0.post_attention_layernorm.weight",
67 "mtp.layers.0.transformer_layer.mlp.linear_fc2.weight": "mtp.layers.0.mlp.down_proj.weight",
68 }
69
70 mapping_list = []
71 # Convert each dictionary entry to AutoMapping(megatron_param, hf_param)
72 for megatron_param, hf_param in param_mappings.items():
73 mapping_list.append(AutoMapping(megatron_param=megatron_param, hf_param=hf_param))
74
75 # Add special mappings that require parameter concatenation/transformation
76 mapping_list.extend(
77 [
78 # QKV: Combine separate Q, K, V matrices into single QKV matrix
79 # Note: Qwen3 does NOT have bias in QKV projections (unlike Qwen2)
80 QKVMapping(
81 megatron_param="decoder.layers.*.self_attention.linear_qkv.weight",
82 q="model.layers.*.self_attn.q_proj.weight",
83 k="model.layers.*.self_attn.k_proj.weight",
84 v="model.layers.*.self_attn.v_proj.weight",
85 ),
86 # Gated MLP: Combine gate and up projection matrices into single FC1 matrix
87 GatedMLPMapping(
88 megatron_param="decoder.layers.*.mlp.linear_fc1.weight",
89 gate="model.layers.*.mlp.gate_proj.weight",
90 up="model.layers.*.mlp.up_proj.weight",
91 ),
92 # MTP QKV: same split/merge as decoder layers
93 QKVMapping(
94 megatron_param="mtp.layers.*.transformer_layer.self_attention.linear_qkv.weight",
95 q="mtp.layers.*.self_attn.q_proj.weight",
96 k="mtp.layers.*.self_attn.k_proj.weight",
97 v="mtp.layers.*.self_attn.v_proj.weight",
98 ),
99 # MTP Gated MLP
100 GatedMLPMapping(
101 megatron_param="mtp.layers.0.transformer_layer.mlp.linear_fc1.weight",
102 gate="mtp.layers.0.mlp.gate_proj.weight",
103 up="mtp.layers.0.mlp.up_proj.weight",
104 ),
105 ]
106 )
107
108 return MegatronMappingRegistry(*mapping_list)
Notes:
provider_bridge: Translate HF config into a Megatron-compatible provider, including architecture quirks (e.g.,qk_layernorm=True).mapping_registry: Define exact name patterns and transformation mappings. Wildcards*apply the same rule across layers.