bridge.models.deepseek.deepseek_v3_bridge#
Module Contents#
Classes#
Megatron Bridge for DeepSeek-V3. |
Data#
API#
- bridge.models.deepseek.deepseek_v3_bridge.__all__#
[‘DeepSeekV3Bridge’, ‘_dequant_fp8_blockwise’]
- bridge.models.deepseek.deepseek_v3_bridge._dequant_fp8_blockwise#
None
- class bridge.models.deepseek.deepseek_v3_bridge.DeepSeekV3Bridge#
Bases:
megatron.bridge.models.conversion.model_bridge.MegatronModelBridgeMegatron Bridge for DeepSeek-V3.
- provider_bridge(
- hf_pretrained: megatron.bridge.models.hf_pretrained.causal_lm.PreTrainedCausalLM,
- classmethod megatron_to_hf_config(
- provider: megatron.bridge.models.mla_provider.MLAModelProvider,
- mapping_registry() megatron.bridge.models.conversion.mapping_registry.MegatronMappingRegistry#
- maybe_modify_loaded_hf_weight(
- hf_param: Union[str, dict[str, str]],
- hf_state_dict: Mapping[str, torch.Tensor],
Load HF weights and dequantize FP8 tensors on the fly.
DeepSeek-V3 ships linear weights as
float8_e4m3fnwith per-block scale factors stored in<key>_scale_inv(128x128 blocks). The true bf16 weight is::w_bf16 = fp8_weight.float() * scale_inv_block
Without this override the bridge would do a bare
.to(bf16)cast inColumnParallelMapping.hf_to_megatron(param_mapping.py:905), discarding the per-block scales — the resulting model produces random-looking logits.
- static _maybe_dequantize_fp8(
- weight: torch.Tensor,
- param_name: str,
- hf_state_dict: Mapping[str, torch.Tensor],
Dequantize
weightif it is stored as FP8 with a matching*_scale_inv.
- maybe_modify_converted_hf_weight(
- task: megatron.bridge.models.conversion.model_bridge.WeightConversionTask,
- converted_weights_dict: Dict[str, torch.Tensor],
- hf_state_dict: Mapping[str, torch.Tensor],
Add rotary embedding inverse frequency parameter if needed.