nemo_automodel.components.models.deepseek_v4.model#
DeepSeek V4 Model.
Key architectural points (from official inference/model.py):
HC (Hyper-Connections): Every transformer block maintains hc_mult=4 copies of the hidden state. The embedding output is expanded: [B,S,dim] -> [B,S,hc_mult,dim]. hc_pre reduces [B,S,hc_mult,dim] -> [B,S,dim] before attn/ffn. hc_post expands [B,S,dim] -> [B,S,hc_mult,dim] after attn/ffn. Full HC requires the hc_split_sinkhorn CUDA kernel. Current fallback: mean-pooling for hc_pre, broadcast add for hc_post.
HC parameters (ALL layers, stored in float32): hc_attn_fn : [mix_hc, hc_mult*dim] where mix_hc = (2+hc_mult)hc_mult = 24 hc_attn_base : [mix_hc] hc_attn_scale : [3] hc_ffn_fn : [mix_hc, hc_multdim] hc_ffn_base : [mix_hc] hc_ffn_scale : [3]
Gate hash layers (layer_idx < num_hash_layers): Instead of score-based routing, the gate uses a fixed token-id -> expert-id lookup table (tid2eid: [vocab_size, n_activated_experts]).
All layers use MoE FFN (no dense layers). Compress-ratio sliding-window attention is not yet implemented.
Module Contents#
Classes#
Single transformer block for DeepSeek V4. |
|
Hash gate for first num_hash_layers: routes tokens via a fixed lookup table. |
|
Data#
API#
- class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block(
- layer_idx: int,
- config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig,
- backend: nemo_automodel.components.models.common.BackendConfig,
Bases:
torch.nn.ModuleSingle transformer block for DeepSeek V4.
Uses HuggingFace transformers PR 45616’s HyperConnection decoder-layer pattern: two
DeepseekV4HyperConnectionmodules own the collapse / expand mixer weights at the attention and FFN sites respectively. Checkpoint’s flathc_attn_*/hc_ffn_*keys are routed intoattn_hc.*/ffn_hc.*by the state-dict adapter.Initialization
- forward(
- x: torch.Tensor,
- position_embeddings: tuple[torch.Tensor, torch.Tensor],
- position_embeddings_compress: tuple[torch.Tensor, torch.Tensor] | None = None,
- rotary_compress: torch.nn.Module | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- input_ids: torch.Tensor | None = None,
- **attn_kwargs: Any,
- init_weights(buffer_device: torch.device) None#
- class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate(
- config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig,
Bases:
torch.nn.ModuleHash gate for first num_hash_layers: routes tokens via a fixed lookup table.
Instead of computing routing scores, the gate uses tid2eid[token_id] to pre-assign expert indices. The routing weight is still computed from the gate weight but the selection is deterministic per token id.
tid2eid shape: [vocab_size, n_activated_experts] (int32, non-trainable)
Signature matches
components.moe.layers.Gate—forward(x, token_mask, cp_mesh)returning(weights, indices, aux_loss)— so the generic MoE module can call it interchangeably. The per-forwardinput_idsneeded for the tid2eid lookup is stashed on the module by the enclosing Block via- Meth:
set_input_idsimmediately before the MoE call.
Initialization
- set_input_ids(input_ids: torch.Tensor | None) None#
Stash the current batch’s input_ids for the next
forwardcall.
- update_bias() None#
No-op for compat with callers that walk MoE gates and call update_bias.
- init_weights(buffer_device: torch.device | None = None) None#
- forward(
- x: torch.Tensor,
- token_mask: torch.Tensor | None = None,
- cp_mesh: DeviceMesh | None = None,
- class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model(
- config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
- backend: nemo_automodel.components.models.common.BackendConfig,
- *,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- moe_overrides: dict | None = None,
Bases:
torch.nn.ModuleInitialization
- forward(
- input_ids: torch.Tensor | None = None,
- *,
- inputs_embeds: torch.Tensor | None = None,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- update_moe_gate_bias() None#
- init_weights(buffer_device: torch.device | None = None) None#
- class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM(
- config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,torch.nn.Module,nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin- _keep_in_fp32_modules_strict#
[‘attn_hc.fn’, ‘attn_hc.base’, ‘attn_hc.scale’, ‘ffn_hc.fn’, ‘ffn_hc.base’, ‘ffn_hc.scale’, ‘hc_head…
- classmethod from_config(
- config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_output_embeddings()#
- set_output_embeddings(new_embeddings)#
- forward(
- input_ids: torch.Tensor,
- *,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- update_moe_gate_bias() None#
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
- nemo_automodel.components.models.deepseek_v4.model.ModelClass#
None