bridge.models.deepseek.deepseek_v4_bridge#
Bridge for the DeepSeek-V4 model family.
The bridge covers DeepSeek-V4 variants that share the deepseek_v4 HF config
schema. It derives dimension- and layer-dependent fields from the HF config and
dispatches checkpoint import by tensor dtype so FP8 and FP8+MXFP4 formats can
share the same conversion path.
Checkpoint format notes: DeepSeek-V4 uses a custom serialisation format that differs from standard HuggingFace Transformers naming conventions:
embed.weight (not model.embed_tokens.weight)
head.weight (not lm_head.weight)
norm.weight (not model.norm.weight)
layers.N.attn_norm.weight / layers.N.ffn_norm.weight
layers.N.attn.wq_a / wq_b / wkv / wo_a / wo_b …
layers.N.ffn.gate / experts / shared_experts …
layers.N.hc_attn_fn / hc_attn_base / hc_attn_scale (Hyper-Connections)
layers.N.hc_ffn_fn / hc_ffn_base / hc_ffn_scale
hc_head_fn / hc_head_base / hc_head_scale (global HC head, learned output contraction)
mtp.N.* (MTP layers)
Quantisation schemes: Two on-disk formats coexist in this family. The bridge dispatches purely on tensor dtype, so the same code path handles both:
Released variant Attn / shared experts Routed experts
Flash (post-trained) FP8_E4M3 + F8_E8M0 (…) MXFP4 packed I8 + F8_E8M0 Flash-Base / Pro / FP8_E4M3 + F32 (…) FP8_E4M3 + F32 (…) Pro-Base (raw)
All scale tensors are 128x128 block-tile geometry (scale.shape[i] == ceil(weight.shape[i]/128))
except the MXFP4 expert path, where scale is per-row over 32-element K-tiles.
maybe_modify_loaded_hf_weight flattens both F8_E8M0 and F32 scales to
F32 via .to(torch.float32) and selects the tile expansion automatically.
All weights are dequantised to bfloat16 during import.
MoE router note: Hash-routing layers (layer_number <= moe_n_hash_layers)
contain a tid2eid buffer (int32 vocab→expert lookup table). Buffers are not
parameters, so Megatron does not expose them via named_parameters().
The bridge handles tid2eid via maybe_modify_loaded_hf_weight() and
a dedicated _Tid2EidMapping that writes it into state_dict directly.
Megatron-Core prerequisites:
HyperConnectionModule
DSv4HybridSelfAttention / CompressedSparseAttention / CSAIndexer / Compressor
Hash-routing tid2eid support and SwiGLU clamp
Separate MTP e_proj / h_proj modules with hyper-connections
Module Contents#
Classes#
Map Megatron’s three scalar HC alpha parameters to/from the V4 checkpoint’s 3-element hc_*_scale tensor. |
|
Secondary mapping for alpha_post (index=1) or alpha_res (index=2). |
|
ReplicatedMapping for CSA-optional weights (compressor / indexer). |
|
Megatron Bridge implementation for DeepSeek-V4 causal language models. |
Functions#
Routed DSv4 experts use packed MXFP4; all other scaled weights export as FP8. |
Data#
API#
- bridge.models.deepseek.deepseek_v4_bridge._DSV4_LAYER_TYPE_TO_COMPRESS_RATIO#
None
- bridge.models.deepseek.deepseek_v4_bridge._DSV4_COMPRESS_RATIO_TO_LAYER_TYPE#
None
- bridge.models.deepseek.deepseek_v4_bridge._dsv4_num_hash_layers(hf_config) int#
- bridge.models.deepseek.deepseek_v4_bridge._dsv4_compress_ratios(hf_config) list[int]#
- bridge.models.deepseek.deepseek_v4_bridge._dsv4_use_mxfp4_export(
- hf_param: str,
- weight: torch.Tensor,
- source_scale: torch.Tensor,
Routed DSv4 experts use packed MXFP4; all other scaled weights export as FP8.
- class bridge.models.deepseek.deepseek_v4_bridge._HCAlphaMapping(
- megatron_pre: str,
- megatron_post: str,
- megatron_res: str,
- hf_param: str,
Bases:
megatron.bridge.models.conversion.param_mapping.MegatronParamMappingMap Megatron’s three scalar HC alpha parameters to/from the V4 checkpoint’s 3-element hc_*_scale tensor.
V4 checkpoint : layers.N.hc_attn_scale shape [3] = [alpha_pre, alpha_post, alpha_res] Megatron : three separate nn.Parameter([1]) tensors
Initialization
- static _resolve_single(pattern: str, captures) str#
- resolve(captures)#
- hf_to_megatron(hf_weights, megatron_module)#
- megatron_to_hf(megatron_weights, megatron_module)#
- class bridge.models.deepseek.deepseek_v4_bridge._HCAlphaSecondaryMapping(
- megatron_param: str,
- hf_scale_param: str,
- index: int,
Bases:
megatron.bridge.models.conversion.param_mapping.MegatronParamMappingSecondary mapping for alpha_post (index=1) or alpha_res (index=2).
Import: extracts element [index] from the 3-element hc_*_scale tensor. Export: returns {} because the primary _HCAlphaMapping (alpha_pre) already exports all three alpha values together. This mapping just suppresses the “No mapping found” warning for the secondary Megatron params during export.
Initialization
- hf_to_megatron(hf_weights, megatron_module)#
- resolve(captures)#
- megatron_to_hf(megatron_weights, megatron_module)#
- class bridge.models.deepseek.deepseek_v4_bridge._ReplicatedOptional(megatron_param: str, hf_param: str)#
Bases:
megatron.bridge.models.conversion.param_mapping.ReplicatedMappingReplicatedMapping for CSA-optional weights (compressor / indexer).
Sets allow_hf_name_mismatch=True so the export path does not validate the HF key against the real checkpoint’s key set. Compressor and indexer weights only exist on non-hash layers; when we build a tiny smoke-test model whose layer indices don’t match the production compress_ratios, a strict hf_keys check would wrongly skip those weights.
resolve_wildcards() uses type(self)(…) which preserves this subclass, so allow_hf_name_mismatch stays True after wildcard expansion.
Initialization
- class bridge.models.deepseek.deepseek_v4_bridge.DeepSeekV4Bridge#
Bases:
megatron.bridge.models.conversion.model_bridge.MegatronModelBridgeMegatron Bridge implementation for DeepSeek-V4 causal language models.
- static generate_pipeline_layout(
- num_layers: int,
- pp: int,
- mtp_layers: int = 1,
Generate a pipeline-parallel layout for DSv4 models.
DSv4 with hash MoE routing requires an explicit pipeline layout when PP > 1. The layout distributes decoder layers across PP stages, placing the embedding on the first stage and MTP + loss on the last stage.
- Parameters:
num_layers – Number of decoder layers (e.g. 43 for Flash, 61 for Pro).
pp – Pipeline parallel size.
mtp_layers – Number of MTP layers (default 1).
- Returns:
List of lists, where each inner list describes one pipeline stage.
- provider_bridge(
- hf_pretrained: megatron.bridge.models.hf_pretrained.causal_lm.PreTrainedCausalLM,
- classmethod megatron_to_hf_config(
- provider: megatron.bridge.models.mla_provider.MLAModelProvider,
- maybe_modify_loaded_hf_weight(
- hf_param,
- hf_state_dict: Mapping[str, torch.Tensor],
Dequantise quantized weights using their accompanying block-scale tensor.
V4 stores attention/embedding weights as float8_e4m3fn with 128x128-block scales, and expert FFN weights as MXFP4 packed (I8, 2 nibbles/byte) with F8_E8M0 per-32-element scales. For dict hf_param (GatedMLPMapping etc.), dequantizes each key individually so expert gate/up weights are also handled.
- mapping_registry() megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#
- maybe_modify_converted_hf_weight(
- task: megatron.bridge.models.conversion.model_bridge.WeightConversionTask,
- converted_weights_dict: Dict[str, torch.Tensor],
- hf_state_dict: Mapping[str, torch.Tensor],
Recreate DSv4 quantized weight/scale pairs expected by the source shard index.