`nemo_automodel.components.models.deepseek_v4.model`#

DeepSeek V4 Model.

Key architectural points (from official inference/model.py):

HC (Hyper-Connections): Every transformer block maintains hc_mult=4 copies of the hidden state. The embedding output is expanded: [B,S,dim] -> [B,S,hc_mult,dim]. hc_pre reduces [B,S,hc_mult,dim] -> [B,S,dim] before attn/ffn. hc_post expands [B,S,dim] -> [B,S,hc_mult,dim] after attn/ffn. Full HC requires the hc_split_sinkhorn CUDA kernel. Current fallback: mean-pooling for hc_pre, broadcast add for hc_post.

HC parameters (ALL layers, stored in float32): hc_attn_fn : [mix_hc, hc_mult*dim] where mix_hc = (2+hc_mult)hc_mult = 24 hc_attn_base : [mix_hc] hc_attn_scale : [3] hc_ffn_fn : [mix_hc, hc_multdim] hc_ffn_base : [mix_hc] hc_ffn_scale : [3]

Gate hash layers (layer_idx < num_hash_layers): Instead of score-based routing, the gate uses a fixed token-id -> expert-id lookup table (tid2eid: [vocab_size, n_activated_experts]).

All layers use MoE FFN (no dense layers). Compress-ratio sliding-window attention is not yet implemented.

Module Contents#

Classes#

`DeepseekV4Block`	Single transformer block for DeepSeek V4.
`DeepseekV4HashGate`	Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.
`DeepseekV4Model`
`DeepseekV4ForCausalLM`

Data#

ModelClass

API#

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block( layer_idx: int, config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config, moe_config: nemo_automodel.components.moe.config.MoEConfig, backend: nemo_automodel.components.models.common.BackendConfig, )#

Bases: torch.nn.Module

Single transformer block for DeepSeek V4.

Uses HuggingFace transformers PR 45616’s HyperConnection decoder-layer pattern: two DeepseekV4HyperConnection modules own the collapse / expand mixer weights at the attention and FFN sites respectively. Checkpoint’s flat hc_attn_* / hc_ffn_* keys are routed into attn_hc.* / ffn_hc.* by the state-dict adapter.

Initialization

forward(

x: torch.Tensor,

position_embeddings: tuple[torch.Tensor, torch.Tensor],

position_embeddings_compress: tuple[torch.Tensor, torch.Tensor] | None = None,

rotary_compress: torch.nn.Module | None = None,

attention_mask: torch.Tensor | None = None,

padding_mask: torch.Tensor | None = None,

input_ids: torch.Tensor | None = None,

**attn_kwargs: Any,

) → torch.Tensor#

init_weights(buffer_device: torch.device) → None#

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate( config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config, moe_config: nemo_automodel.components.moe.config.MoEConfig, )#

Bases: torch.nn.Module

Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.

Instead of computing routing scores, the gate uses tid2eid[token_id] to pre-assign expert indices. The routing weight is still computed from the gate weight but the selection is deterministic per token id.

tid2eid shape: [vocab_size, n_activated_experts] (int32, non-trainable)

Signature matches components.moe.layers.Gate — forward(x, token_mask, cp_mesh) returning (weights, indices, aux_loss) — so the generic MoE module can call it interchangeably. The per-forward input_ids needed for the tid2eid lookup is stashed on the module by the enclosing Block via

Meth:: set_input_ids immediately before the MoE call.

Initialization

set_input_ids(input_ids: torch.Tensor | None) → None#: Stash the current batch’s input_ids for the next forward call.

update_bias() → None#: No-op for compat with callers that walk MoE gates and call update_bias.

init_weights(buffer_device: torch.device | None = None) → None#

forward( x: torch.Tensor, token_mask: torch.Tensor | None = None, cp_mesh: DeviceMesh | None = None, ) → tuple[torch.Tensor, torch.Tensor, None]#

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model( config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config, backend: nemo_automodel.components.models.common.BackendConfig, *, moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None, moe_overrides: dict | None = None, )#

Bases: torch.nn.Module

Initialization

forward(

input_ids: torch.Tensor | None = None,

*,

inputs_embeds: torch.Tensor | None = None,

position_ids: torch.Tensor | None = None,

attention_mask: torch.Tensor | None = None,

padding_mask: torch.Tensor | None = None,

**attn_kwargs: Any,

) → torch.Tensor#

update_moe_gate_bias() → None#

init_weights(buffer_device: torch.device | None = None) → None#

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM(

config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,

moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

_keep_in_fp32_modules_strict#: [‘attn_hc.fn’, ‘attn_hc.base’, ‘attn_hc.scale’, ‘ffn_hc.fn’, ‘ffn_hc.base’, ‘ffn_hc.scale’, ‘hc_head…

classmethod from_config(

config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,

moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

classmethod from_pretrained(

pretrained_model_name_or_path: str,

*model_args,

**kwargs,

)#

get_input_embeddings()#

set_input_embeddings(value)#

get_output_embeddings()#

set_output_embeddings(new_embeddings)#

forward(

input_ids: torch.Tensor,

*,

position_ids: torch.Tensor | None = None,

attention_mask: torch.Tensor | None = None,

padding_mask: torch.Tensor | None = None,

**attn_kwargs: Any,

) → torch.Tensor#

update_moe_gate_bias() → None#

initialize_weights( buffer_device: torch.device | None = None, dtype: torch.dtype = torch.bfloat16, ) → None#

nemo_automodel.components.models.deepseek_v4.model.ModelClass#: None

nemo_automodel.components.models.deepseek_v4.model#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.models.deepseek_v4.model`#