bridge.models.deepseek.deepseek_v4_bridge#

Bridge for the DeepSeek-V4 model family.

The bridge covers DeepSeek-V4 variants that share the deepseek_v4 HF config schema. It derives dimension- and layer-dependent fields from the HF config and dispatches checkpoint import by tensor dtype so FP8 and FP8+MXFP4 formats can share the same conversion path.

Checkpoint format notes: DeepSeek-V4 uses a custom serialisation format that differs from standard HuggingFace Transformers naming conventions:

  • embed.weight (not model.embed_tokens.weight)

  • head.weight (not lm_head.weight)

  • norm.weight (not model.norm.weight)

  • layers.N.attn_norm.weight / layers.N.ffn_norm.weight

  • layers.N.attn.wq_a / wq_b / wkv / wo_a / wo_b …

  • layers.N.ffn.gate / experts / shared_experts …

  • layers.N.hc_attn_fn / hc_attn_base / hc_attn_scale (Hyper-Connections)

  • layers.N.hc_ffn_fn / hc_ffn_base / hc_ffn_scale

  • hc_head_fn / hc_head_base / hc_head_scale (global HC head, learned output contraction)

  • mtp.N.* (MTP layers)

Quantisation schemes: Two on-disk formats coexist in this family. The bridge dispatches purely on tensor dtype, so the same code path handles both:

Released variant Attn / shared experts Routed experts


Flash (post-trained) FP8_E4M3 + F8_E8M0 (…) MXFP4 packed I8 + F8_E8M0 Flash-Base / Pro / FP8_E4M3 + F32 (…) FP8_E4M3 + F32 (…) Pro-Base (raw)

All scale tensors are 128x128 block-tile geometry (scale.shape[i] == ceil(weight.shape[i]/128)) except the MXFP4 expert path, where scale is per-row over 32-element K-tiles. maybe_modify_loaded_hf_weight flattens both F8_E8M0 and F32 scales to F32 via .to(torch.float32) and selects the tile expansion automatically. All weights are dequantised to bfloat16 during import.

MoE router note: Hash-routing layers (layer_number <= moe_n_hash_layers) contain a tid2eid buffer (int32 vocab→expert lookup table). Buffers are not parameters, so Megatron does not expose them via named_parameters(). The bridge handles tid2eid via maybe_modify_loaded_hf_weight() and a dedicated _Tid2EidMapping that writes it into state_dict directly.

Megatron-Core prerequisites:

  • HyperConnectionModule

  • DSv4HybridSelfAttention / CompressedSparseAttention / CSAIndexer / Compressor

  • Hash-routing tid2eid support and SwiGLU clamp

  • Separate MTP e_proj / h_proj modules with hyper-connections

Module Contents#

Classes#

_HCAlphaMapping

Map Megatron’s three scalar HC alpha parameters to/from the V4 checkpoint’s 3-element hc_*_scale tensor.

_HCAlphaSecondaryMapping

Secondary mapping for alpha_post (index=1) or alpha_res (index=2).

_ReplicatedOptional

ReplicatedMapping for CSA-optional weights (compressor / indexer).

DeepSeekV4Bridge

Megatron Bridge implementation for DeepSeek-V4 causal language models.

Functions#

_dsv4_num_hash_layers

_dsv4_compress_ratios

_dsv4_use_mxfp4_export

Routed DSv4 experts use packed MXFP4; all other scaled weights export as FP8.

Data#

API#

bridge.models.deepseek.deepseek_v4_bridge._DSV4_LAYER_TYPE_TO_COMPRESS_RATIO#

None

bridge.models.deepseek.deepseek_v4_bridge._DSV4_COMPRESS_RATIO_TO_LAYER_TYPE#

None

bridge.models.deepseek.deepseek_v4_bridge._dsv4_num_hash_layers(hf_config) int#
bridge.models.deepseek.deepseek_v4_bridge._dsv4_compress_ratios(hf_config) list[int]#
bridge.models.deepseek.deepseek_v4_bridge._dsv4_use_mxfp4_export(
hf_param: str,
weight: torch.Tensor,
source_scale: torch.Tensor,
) bool#

Routed DSv4 experts use packed MXFP4; all other scaled weights export as FP8.

class bridge.models.deepseek.deepseek_v4_bridge._HCAlphaMapping(
megatron_pre: str,
megatron_post: str,
megatron_res: str,
hf_param: str,
)#

Bases: megatron.bridge.models.conversion.param_mapping.MegatronParamMapping

Map Megatron’s three scalar HC alpha parameters to/from the V4 checkpoint’s 3-element hc_*_scale tensor.

V4 checkpoint : layers.N.hc_attn_scale shape [3] = [alpha_pre, alpha_post, alpha_res] Megatron : three separate nn.Parameter([1]) tensors

Initialization

static _resolve_single(pattern: str, captures) str#
resolve(captures)#
hf_to_megatron(hf_weights, megatron_module)#
megatron_to_hf(megatron_weights, megatron_module)#
class bridge.models.deepseek.deepseek_v4_bridge._HCAlphaSecondaryMapping(
megatron_param: str,
hf_scale_param: str,
index: int,
)#

Bases: megatron.bridge.models.conversion.param_mapping.MegatronParamMapping

Secondary mapping for alpha_post (index=1) or alpha_res (index=2).

Import: extracts element [index] from the 3-element hc_*_scale tensor. Export: returns {} because the primary _HCAlphaMapping (alpha_pre) already exports all three alpha values together. This mapping just suppresses the “No mapping found” warning for the secondary Megatron params during export.

Initialization

hf_to_megatron(hf_weights, megatron_module)#
resolve(captures)#
megatron_to_hf(megatron_weights, megatron_module)#
class bridge.models.deepseek.deepseek_v4_bridge._ReplicatedOptional(megatron_param: str, hf_param: str)#

Bases: megatron.bridge.models.conversion.param_mapping.ReplicatedMapping

ReplicatedMapping for CSA-optional weights (compressor / indexer).

Sets allow_hf_name_mismatch=True so the export path does not validate the HF key against the real checkpoint’s key set. Compressor and indexer weights only exist on non-hash layers; when we build a tiny smoke-test model whose layer indices don’t match the production compress_ratios, a strict hf_keys check would wrongly skip those weights.

resolve_wildcards() uses type(self)(…) which preserves this subclass, so allow_hf_name_mismatch stays True after wildcard expansion.

Initialization

class bridge.models.deepseek.deepseek_v4_bridge.DeepSeekV4Bridge#

Bases: megatron.bridge.models.conversion.model_bridge.MegatronModelBridge

Megatron Bridge implementation for DeepSeek-V4 causal language models.

static generate_pipeline_layout(
num_layers: int,
pp: int,
mtp_layers: int = 1,
) list[list[str]]#

Generate a pipeline-parallel layout for DSv4 models.

DSv4 with hash MoE routing requires an explicit pipeline layout when PP > 1. The layout distributes decoder layers across PP stages, placing the embedding on the first stage and MTP + loss on the last stage.

Parameters:
  • num_layers – Number of decoder layers (e.g. 43 for Flash, 61 for Pro).

  • pp – Pipeline parallel size.

  • mtp_layers – Number of MTP layers (default 1).

Returns:

List of lists, where each inner list describes one pipeline stage.

provider_bridge(
hf_pretrained: megatron.bridge.models.hf_pretrained.causal_lm.PreTrainedCausalLM,
) megatron.bridge.models.mla_provider.MLAModelProvider#
classmethod megatron_to_hf_config(
provider: megatron.bridge.models.mla_provider.MLAModelProvider,
) dict#
maybe_modify_loaded_hf_weight(
hf_param,
hf_state_dict: Mapping[str, torch.Tensor],
)#

Dequantise quantized weights using their accompanying block-scale tensor.

V4 stores attention/embedding weights as float8_e4m3fn with 128x128-block scales, and expert FFN weights as MXFP4 packed (I8, 2 nibbles/byte) with F8_E8M0 per-32-element scales. For dict hf_param (GatedMLPMapping etc.), dequantizes each key individually so expert gate/up weights are also handled.

mapping_registry() megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#
maybe_modify_converted_hf_weight(
task: megatron.bridge.models.conversion.model_bridge.WeightConversionTask,
converted_weights_dict: Dict[str, torch.Tensor],
hf_state_dict: Mapping[str, torch.Tensor],
) Dict[str, torch.Tensor]#

Recreate DSv4 quantized weight/scale pairs expected by the source shard index.