nemo_automodel.components.models.deepseek_v4.model

View as Markdown

DeepSeek V4 Model.

Key architectural points (from official inference/model.py):

HC (Hyper-Connections): Every transformer block maintains hc_mult=4 copies of the hidden state. The embedding output is expanded: [B,S,dim] -> [B,S,hc_mult,dim]. hc_pre reduces [B,S,hc_mult,dim] -> [B,S,dim] before attn/ffn. hc_post expands [B,S,dim] -> [B,S,hc_mult,dim] after attn/ffn. Full HC requires the hc_split_sinkhorn CUDA kernel. Current fallback: mean-pooling for hc_pre, broadcast add for hc_post.

HC parameters (ALL layers, stored in float32): hc_attn_fn : [mix_hc, hc_mult*dim] where mix_hc = (2+hc_mult)hc_mult = 24 hc_attn_base : [mix_hc] hc_attn_scale : [3] hc_ffn_fn : [mix_hc, hc_multdim] hc_ffn_base : [mix_hc] hc_ffn_scale : [3]

Gate hash layers (layer_idx < num_hash_layers): Instead of score-based routing, the gate uses a fixed token-id -> expert-id lookup table (tid2eid: [vocab_size, n_activated_experts]).

All layers use MoE FFN (no dense layers). Compress-ratio sliding-window attention is not yet implemented.

Module Contents

Classes

NameDescription
DeepseekV4BlockSingle transformer block for DeepSeek V4.
DeepseekV4CausalLMOutputOutput of DeepseekV4ForCausalLM.forward.
DeepseekV4ForCausalLM-
DeepseekV4HashGateHash gate for first num_hash_layers: routes tokens via a fixed lookup table.
DeepseekV4Model-

Data

ModelClass

API

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block(
layer_idx: int,
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Single transformer block for DeepSeek V4.

Uses HuggingFace transformers PR 45616’s HyperConnection decoder-layer pattern: two DeepseekV4HyperConnection modules own the collapse / expand mixer weights at the attention and FFN sites respectively. Checkpoint’s flat hc_attn_* / hc_ffn_* keys are routed into attn_hc.* / ffn_hc.* by the state-dict adapter.

attn_hc
= DeepseekV4HyperConnection(**hc_kwargs)
ffn_hc
= DeepseekV4HyperConnection(**hc_kwargs)
hc_mult
= config.hc_mult
input_layernorm
is_hash_routing_layer
mlp
= MoE(moe_config, backend)
post_attention_layernorm
self_attn
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.forward(
x: torch.Tensor,
position_embeddings: tuple[torch.Tensor, torch.Tensor],
position_ids: torch.Tensor | None = None,
position_embeddings_compress: tuple[torch.Tensor, torch.Tensor] | None = None,
rotary_compress: torch.nn.Module | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
input_ids: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.init_weights(
buffer_device: torch.device
) -> None
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.set_activation_checkpointing(
enabled: bool = True
) -> None

Enable block-local checkpointing that avoids replaying MoE dispatch.

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4CausalLMOutput(
mtp_per_depth_h: typing.Optional[list[torch.Tensor]] = None,
mtp_loss_scaling_factor: typing.Optional[float] = None
)
Dataclass

Bases: CausalLMOutputWithPast

Output of DeepseekV4ForCausalLM.forward.

Subclasses transformers.modeling_outputs.CausalLMOutputWithPast so the standard logits / hidden_states fields are present (the recipe’s fused cross-entropy path requires "hidden_states" in out and reads the final hidden states off the output) while the DSV4-specific MTP fields are carried as declared dataclass fields. As required by ModelOutput, every field after the first declares a None default.

mtp_loss_scaling_factor
Optional[float] = None
mtp_per_depth_h
Optional[list[Tensor]] = None
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin

_keep_in_fp32_modules_strict
backend
= backend or BackendConfig()
lm_head
model
mtp
mtp_config
state_dict_adapter
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM._build_mtp_embed_inputs_for_pp(
input_ids: torch.Tensor
) -> tuple[torch.Tensor, ...]
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM._is_pipeline_parallel_stage() -> bool
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.customize_pipeline_stage_modules(
module_names_per_stage: list[list[str]],
layers_prefix: str,
text_model: torch.nn.Module | None = None
) -> list[list[str]]

Keep DSV4 non-layer PP dependencies with the stages that need them.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.forward(
input_ids: torch.Tensor,
mtp_embed_inputs: torch.Tensor = (),
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
output_hidden_states: typing.Optional[bool] = None,
attn_kwargs: typing.Any = {}
) -> 'DeepseekV4CausalLMOutput' | tuple[torch.Tensor, ...] | torch.Tensor
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.from_config(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_input_embeddings()
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_output_embeddings()
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_pipeline_stage_metas(
is_first: bool,
microbatch_size: int,
seq_len: int,
dtype: torch.dtype
) -> tuple[tuple[torch.Tensor, ...], tuple[torch.Tensor, ...]]

Return PP input/output meta tensors for DSV4’s HC and MTP contract.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16
) -> None
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.prepare_model_inputs_for_cp(
input_ids: torch.Tensor,
kwargs: typing.Any = {}
) -> dict[str, typing.Any]

Model-owned context-parallel batch prep (Miles-style contiguous shard).

Returns the keys cp_utils.make_cp_batch_and_ctx needs to delegate CP sharding back to this model: a _cp_make_batch_fn callable (with the config-derived per-rank shard multiple bound) plus a flag asking the recipe to keep the full logits in the autograd graph so every CP rank’s backward reaches all parameters even when its local loss is fully masked. DSV4 embeds internally, so (unlike VLM models) this does not pre-embed — it leaves input_ids for the sharding callable.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.set_input_embeddings(
value
)
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.set_output_embeddings(
new_embeddings
)
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.update_moe_gate_bias() -> None
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
moe_config: nemo_automodel.components.moe.config.MoEConfig
)

Bases: Module

Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.

Instead of computing routing scores, the gate uses tid2eid[token_id] to pre-assign expert indices. The routing weight is still computed from the gate weight but the selection is deterministic per token id.

tid2eid shape: [vocab_size, n_activated_experts] (int64 runtime, non-trainable)

Signature matches components.moe.layers.Gateforward(x, token_mask, cp_mesh) returning (weights, indices, aux_loss) — so the generic MoE module can call it interchangeably. The per-forward input_ids needed for the tid2eid lookup is stashed on the module by the enclosing Block via :meth:set_input_ids immediately before the MoE call.

_pending_input_ids
Tensor | None = None
n_experts
= moe_config.n_routed_experts
norm_topk_prob
= moe_config.norm_topk_prob
route_scale
= moe_config.route_scale
score_func
= moe_config.score_func
topk
= moe_config.n_activated_experts
weight
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.forward(
x: torch.Tensor,
token_mask: torch.Tensor | None = None,
cp_mesh: 'DeviceMesh | None' = None
) -> tuple[torch.Tensor, torch.Tensor, None]
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.init_weights(
buffer_device: torch.device | None = None
) -> None
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.set_input_ids(
input_ids: torch.Tensor | None
) -> None

Stash the current batch’s input_ids for the next forward call.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.update_bias() -> None

No-op for compat with callers that walk MoE gates and call update_bias.

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model(
config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
backend: nemo_automodel.components.models.common.BackendConfig,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
moe_overrides: dict | None = None
)

Bases: Module

embed_tokens
hc_head
layers
= nn.ModuleDict()
max_seq_len
= config.max_position_embeddings
moe_config
= moe_config or MoEConfig(**moe_defaults)
norm
rotary_emb
rotary_emb_compress
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.forward(
input_ids: torch.Tensor | None = None,
inputs_embeds: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
return_hc_hidden: bool = False,
attn_kwargs: typing.Any = {}
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.init_weights(
buffer_device: torch.device | None = None
) -> None
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.update_moe_gate_bias() -> None
nemo_automodel.components.models.deepseek_v4.model.ModelClass = DeepseekV4ForCausalLM