nemo_automodel.components.models.deepseek_v4.model#

DeepSeek V4 Model.

Key architectural points (from official inference/model.py):

HC (Hyper-Connections): Every transformer block maintains hc_mult=4 copies of the hidden state. The embedding output is expanded: [B,S,dim] -> [B,S,hc_mult,dim]. hc_pre reduces [B,S,hc_mult,dim] -> [B,S,dim] before attn/ffn. hc_post expands [B,S,dim] -> [B,S,hc_mult,dim] after attn/ffn. Full HC requires the hc_split_sinkhorn CUDA kernel. Current fallback: mean-pooling for hc_pre, broadcast add for hc_post.

HC parameters (ALL layers, stored in float32): hc_attn_fn : [mix_hc, hc_mult*dim] where mix_hc = (2+hc_mult)hc_mult = 24 hc_attn_base : [mix_hc] hc_attn_scale : [3] hc_ffn_fn : [mix_hc, hc_multdim] hc_ffn_base : [mix_hc] hc_ffn_scale : [3]

Gate hash layers (layer_idx < num_hash_layers): Instead of score-based routing, the gate uses a fixed token-id -> expert-id lookup table (tid2eid: [vocab_size, n_activated_experts]).

All layers use MoE FFN (no dense layers). Compress-ratio sliding-window attention is not yet implemented.

Module Contents#

Classes#

DeepseekV4Block

Single transformer block for DeepSeek V4.

DeepseekV4HashGate

Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.

DeepseekV4Model

DeepseekV4ForCausalLM

Data#

API#

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block(
layer_idx: int,
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
)#

Bases: torch.nn.Module

Single transformer block for DeepSeek V4.

Uses HuggingFace transformers PR 45616’s HyperConnection decoder-layer pattern: two DeepseekV4HyperConnection modules own the collapse / expand mixer weights at the attention and FFN sites respectively. Checkpoint’s flat hc_attn_* / hc_ffn_* keys are routed into attn_hc.* / ffn_hc.* by the state-dict adapter.

Initialization

forward(
x: torch.Tensor,
position_embeddings: tuple[torch.Tensor, torch.Tensor],
position_embeddings_compress: tuple[torch.Tensor, torch.Tensor] | None = None,
rotary_compress: torch.nn.Module | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
input_ids: torch.Tensor | None = None,
**attn_kwargs: Any,
) torch.Tensor#
init_weights(buffer_device: torch.device) None#
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
)#

Bases: torch.nn.Module

Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.

Instead of computing routing scores, the gate uses tid2eid[token_id] to pre-assign expert indices. The routing weight is still computed from the gate weight but the selection is deterministic per token id.

tid2eid shape: [vocab_size, n_activated_experts] (int32, non-trainable)

Signature matches components.moe.layers.Gate — forward(x, token_mask, cp_mesh) returning (weights, indices, aux_loss) — so the generic MoE module can call it interchangeably. The per-forward input_ids needed for the tid2eid lookup is stashed on the module by the enclosing Block via

Meth:

set_input_ids immediately before the MoE call.

Initialization

set_input_ids(input_ids: torch.Tensor | None) None#

Stash the current batch’s input_ids for the next forward call.

update_bias() None#

No-op for compat with callers that walk MoE gates and call update_bias.

init_weights(buffer_device: torch.device | None = None) None#
forward(
x: torch.Tensor,
token_mask: torch.Tensor | None = None,
cp_mesh: DeviceMesh | None = None,
) tuple[torch.Tensor, torch.Tensor, None]#
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
backend: nemo_automodel.components.models.common.BackendConfig,
*,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
moe_overrides: dict | None = None,
)#

Bases: torch.nn.Module

Initialization

forward(
input_ids: torch.Tensor | None = None,
*,
inputs_embeds: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
**attn_kwargs: Any,
) torch.Tensor#
update_moe_gate_bias() None#
init_weights(buffer_device: torch.device | None = None) None#
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

_keep_in_fp32_modules_strict#

[‘attn_hc.fn’, ‘attn_hc.base’, ‘attn_hc.scale’, ‘ffn_hc.fn’, ‘ffn_hc.base’, ‘ffn_hc.scale’, ‘hc_head…

classmethod from_config(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#
classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args,
**kwargs,
)#
get_input_embeddings()#
set_input_embeddings(value)#
get_output_embeddings()#
set_output_embeddings(new_embeddings)#
forward(
input_ids: torch.Tensor,
*,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
**attn_kwargs: Any,
) torch.Tensor#
update_moe_gate_bias() None#
initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16,
) None#
nemo_automodel.components.models.deepseek_v4.model.ModelClass#

None