`bridge.models.deepseek.deepseek_v4_bridge`#

Bridge for the DeepSeek-V4 model family.

The bridge covers DeepSeek-V4 variants that share the deepseek_v4 HF config schema. It derives dimension- and layer-dependent fields from the HF config and dispatches checkpoint import by tensor dtype so FP8 and FP8+MXFP4 formats can share the same conversion path.

Checkpoint format notes: DeepSeek-V4 uses a custom serialisation format that differs from standard HuggingFace Transformers naming conventions:

embed.weight (not model.embed_tokens.weight)
head.weight (not lm_head.weight)
norm.weight (not model.norm.weight)
layers.N.attn_norm.weight / layers.N.ffn_norm.weight
layers.N.attn.wq_a / wq_b / wkv / wo_a / wo_b …
layers.N.ffn.gate / experts / shared_experts …
layers.N.hc_attn_fn / hc_attn_base / hc_attn_scale (Hyper-Connections)
layers.N.hc_ffn_fn / hc_ffn_base / hc_ffn_scale
hc_head_fn / hc_head_base / hc_head_scale (global HC head, learned output contraction)
mtp.N.* (MTP layers)

Quantisation schemes: Two on-disk formats coexist in this family. The bridge dispatches purely on tensor dtype, so the same code path handles both:

Released variant Attn / shared experts Routed experts

Flash (post-trained) FP8_E4M3 + F8_E8M0 (…) MXFP4 packed I8 + F8_E8M0 Flash-Base / Pro / FP8_E4M3 + F32 (…) FP8_E4M3 + F32 (…) Pro-Base (raw)

All scale tensors are 128x128 block-tile geometry (scale.shape[i] == ceil(weight.shape[i]/128)) except the MXFP4 expert path, where scale is per-row over 32-element K-tiles. maybe_modify_loaded_hf_weight flattens both F8_E8M0 and F32 scales to F32 via .to(torch.float32) and selects the tile expansion automatically. All weights are dequantised to bfloat16 during import.

MoE router note: Hash-routing layers (layer_number <= moe_n_hash_layers) contain a tid2eid buffer (int32 vocab→expert lookup table). Buffers are not parameters, so Megatron does not expose them via named_parameters(). The bridge handles tid2eid via maybe_modify_loaded_hf_weight() and a dedicated _Tid2EidMapping that writes it into state_dict directly.

Megatron-Core prerequisites:

HyperConnectionModule
DSv4HybridSelfAttention / CompressedSparseAttention / CSAIndexer / Compressor
Hash-routing tid2eid support and SwiGLU clamp
Separate MTP e_proj / h_proj modules with hyper-connections

Module Contents#

Classes#

`_HCAlphaMapping`	Map Megatron’s three scalar HC alpha parameters to/from the V4 checkpoint’s 3-element hc_*_scale tensor.
`_HCAlphaSecondaryMapping`	Secondary mapping for alpha_post (index=1) or alpha_res (index=2).
`_ReplicatedOptional`	ReplicatedMapping for CSA-optional weights (compressor / indexer).
`DeepSeekV4Bridge`	Megatron Bridge implementation for DeepSeek-V4 causal language models.

Functions#

`deepseek_v4_supports_blackwell_fused_kernels`	Return whether DSv4 Blackwell-only fused kernels should default on.
`deepseek_v4_supports_fused_dsa_kernels`	Return whether DSv4 fused DSA kernels can be enabled.
`set_deepseek_v4_pipeline_model_parallel_layout`	Set an even DSv4 pipeline layout with MTP and loss on the last stage.
`_dsv4_num_hash_layers`
`_dsv4_compress_ratios`
`_dsv4_use_mxfp4_export`	Routed DSv4 experts use packed MXFP4; all other scaled weights export as FP8.

Data#

`_DSV4_LAYER_TYPE_TO_COMPRESS_RATIO`
`_DSV4_COMPRESS_RATIO_TO_LAYER_TYPE`

API#

bridge.models.deepseek.deepseek_v4_bridge._DSV4_LAYER_TYPE_TO_COMPRESS_RATIO#: None

bridge.models.deepseek.deepseek_v4_bridge._DSV4_COMPRESS_RATIO_TO_LAYER_TYPE#: None

bridge.models.deepseek.deepseek_v4_bridge.deepseek_v4_supports_blackwell_fused_kernels() → bool#: Return whether DSv4 Blackwell-only fused kernels should default on.

bridge.models.deepseek.deepseek_v4_bridge.deepseek_v4_supports_fused_dsa_kernels() → bool#: Return whether DSv4 fused DSA kernels can be enabled.

bridge.models.deepseek.deepseek_v4_bridge.set_deepseek_v4_pipeline_model_parallel_layout( model_cfg: megatron.bridge.models.mla_provider.MLAModelProvider, ) → None#

Set an even DSv4 pipeline layout with MTP and loss on the last stage.

DeepSeek-V4 uses hash-routed MoE layers that must co-locate with the embedding on the first pipeline stage, so an explicit pipeline_model_parallel_layout is required whenever pipeline_model_parallel_size > 1. This builds an even decoder split with the embedding on the first stage and the MTP/loss layers on the last stage.

Parameters:: model_cfg – The DeepSeek-V4 model provider to configure in place.

bridge.models.deepseek.deepseek_v4_bridge._dsv4_num_hash_layers(hf_config) → int#

bridge.models.deepseek.deepseek_v4_bridge._dsv4_compress_ratios(hf_config) → list[int]#

bridge.models.deepseek.deepseek_v4_bridge._dsv4_use_mxfp4_export( hf_param: str, weight: torch.Tensor, source_scale: torch.Tensor, ) → bool#: Routed DSv4 experts use packed MXFP4; all other scaled weights export as FP8.

class bridge.models.deepseek.deepseek_v4_bridge._HCAlphaMapping( megatron_pre: str, megatron_post: str, megatron_res: str, hf_param: str, )#

Bases: megatron.bridge.models.conversion.param_mapping.MegatronParamMapping

Map Megatron’s three scalar HC alpha parameters to/from the V4 checkpoint’s 3-element hc_*_scale tensor.

V4 checkpoint : layers.N.hc_attn_scale shape [3] = [alpha_pre, alpha_post, alpha_res] Megatron : three separate nn.Parameter([1]) tensors

Initialization

static _resolve_single(pattern: str, captures) → str#

resolve(captures)#

hf_to_megatron(hf_weights, megatron_module)#

megatron_to_hf(megatron_weights, megatron_module)#

class bridge.models.deepseek.deepseek_v4_bridge._HCAlphaSecondaryMapping( megatron_param: str, hf_scale_param: str, index: int, )#

Bases: megatron.bridge.models.conversion.param_mapping.MegatronParamMapping

Secondary mapping for alpha_post (index=1) or alpha_res (index=2).

Import: extracts element [index] from the 3-element hc_*_scale tensor. Export: returns {} because the primary _HCAlphaMapping (alpha_pre) already exports all three alpha values together. This mapping just suppresses the “No mapping found” warning for the secondary Megatron params during export.

Initialization

hf_to_megatron(hf_weights, megatron_module)#

resolve(captures)#

megatron_to_hf(megatron_weights, megatron_module)#

class bridge.models.deepseek.deepseek_v4_bridge._ReplicatedOptional(megatron_param: str, hf_param: str)#

Bases: megatron.bridge.models.conversion.param_mapping.ReplicatedMapping

ReplicatedMapping for CSA-optional weights (compressor / indexer).

Sets allow_hf_name_mismatch=True so the export path does not validate the HF key against the real checkpoint’s key set. Compressor and indexer weights only exist on non-hash layers; when we build a tiny smoke-test model whose layer indices don’t match the production compress_ratios, a strict hf_keys check would wrongly skip those weights.

resolve_wildcards() uses type(self)(…) which preserves this subclass, so allow_hf_name_mismatch stays True after wildcard expansion.

Initialization

class bridge.models.deepseek.deepseek_v4_bridge.DeepSeekV4Bridge#

Bases: megatron.bridge.models.conversion.model_bridge.MegatronModelBridge

Megatron Bridge implementation for DeepSeek-V4 causal language models.

static generate_pipeline_layout( num_layers: int, pp: int, mtp_layers: int = 1, ) → list[list[str]]#

Generate a pipeline-parallel layout for DSv4 models.

DSv4 with hash MoE routing requires an explicit pipeline layout when PP > 1. The layout distributes decoder layers across PP stages, placing the embedding on the first stage and MTP + loss on the last stage.

Parameters:

num_layers – Number of decoder layers (e.g. 43 for Flash, 61 for Pro).
pp – Pipeline parallel size.
mtp_layers – Number of MTP layers (default 1).

Returns:

List of lists, where each inner list describes one pipeline stage.

provider_bridge( hf_pretrained: megatron.bridge.models.hf_pretrained.causal_lm.PreTrainedCausalLM, ) → megatron.bridge.models.mla_provider.MLAModelProvider#

classmethod megatron_to_hf_config( provider: megatron.bridge.models.mla_provider.MLAModelProvider, ) → dict#

maybe_modify_loaded_hf_weight( hf_param, hf_state_dict: Mapping[str, torch.Tensor], )#

Dequantise quantized weights using their accompanying block-scale tensor.

V4 stores attention/embedding weights as float8_e4m3fn with 128x128-block scales, and expert FFN weights as MXFP4 packed (I8, 2 nibbles/byte) with F8_E8M0 per-32-element scales. For dict hf_param (GatedMLPMapping etc.), dequantizes each key individually so expert gate/up weights are also handled.

mapping_registry() → megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#

maybe_modify_converted_hf_weight( task: megatron.bridge.models.conversion.model_bridge.WeightConversionTask, converted_weights_dict: Dict[str, torch.Tensor], hf_state_dict: Mapping[str, torch.Tensor], ) → Dict[str, torch.Tensor]#

Recreate DSv4 quantized weight/scale pairs expected by the source shard index.

When task.weight_dtype is set, skip requantization and return the weights unchanged — the generic export path casts the dtype.

bridge.models.deepseek.deepseek_v4_bridge#

Module Contents#

Classes#

Functions#

Data#

API#

`bridge.models.deepseek.deepseek_v4_bridge`#