nemo_automodel.components.models.deepseek_v4.model

DeepSeek V4 Model.

Key architectural points (from official inference/model.py):

HC (Hyper-Connections): Every transformer block maintains hc_mult=4 copies of the hidden state. The embedding output is expanded: [B,S,dim] -> [B,S,hc_mult,dim]. hc_pre reduces [B,S,hc_mult,dim] -> [B,S,dim] before attn/ffn. hc_post expands [B,S,dim] -> [B,S,hc_mult,dim] after attn/ffn. Full HC requires the hc_split_sinkhorn CUDA kernel. Current fallback: mean-pooling for hc_pre, broadcast add for hc_post.

HC parameters (ALL layers, stored in float32): hc_attn_fn : [mix_hc, hc_mult*dim] where mix_hc = (2+hc_mult)hc_mult = 24 hc_attn_base : [mix_hc] hc_attn_scale : [3] hc_ffn_fn : [mix_hc, hc_multdim] hc_ffn_base : [mix_hc] hc_ffn_scale : [3]

Gate hash layers (layer_idx < num_hash_layers): Instead of score-based routing, the gate uses a fixed token-id -> expert-id lookup table (tid2eid: [vocab_size, n_activated_experts]).

All layers use MoE FFN (no dense layers). Compress-ratio sliding-window attention is not yet implemented.

Module Contents

Classes

Name	Description
`DeepseekV4Block`	Single transformer block for DeepSeek V4.
`DeepseekV4CausalLMOutput`	Output of DeepseekV4ForCausalLM.forward.
`DeepseekV4ForCausalLM`	-
`DeepseekV4HashGate`	Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.
`DeepseekV4Model`	-

Data

ModelClass

API

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block(
    layer_idx: int,
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Single transformer block for DeepSeek V4.

Uses HuggingFace transformers PR 45616’s HyperConnection decoder-layer pattern: two DeepseekV4HyperConnection modules own the collapse / expand mixer weights at the attention and FFN sites respectively. Checkpoint’s flat hc_attn_* / hc_ffn_* keys are routed into attn_hc.* / ffn_hc.* by the state-dict adapter.

attn_hc

= DeepseekV4HyperConnection(**hc_kwargs)

ffn_hc

= DeepseekV4HyperConnection(**hc_kwargs)

hc_mult

= config.hc_mult

input_layernorm

is_hash_routing_layer

mlp

= MoE(moe_config, backend)

post_attention_layernorm

self_attn

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.forward(
    x: torch.Tensor,
    position_embeddings: tuple[torch.Tensor, torch.Tensor],
    position_ids: torch.Tensor | None = None,
    position_embeddings_compress: tuple[torch.Tensor, torch.Tensor] | None = None,
    rotary_compress: torch.nn.Module | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    input_ids: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.init_weights(
    buffer_device: torch.device
) -> None

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.set_activation_checkpointing(
    enabled: bool = True
) -> None

Enable block-local checkpointing that avoids replaying MoE dispatch.

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4CausalLMOutput(
    mtp_per_depth_h: typing.Optional[list[torch.Tensor]] = None,
    mtp_loss_scaling_factor: typing.Optional[float] = None
)

Dataclass

Bases: CausalLMOutputWithPast

Output of DeepseekV4ForCausalLM.forward.

Subclasses transformers.modeling_outputs.CausalLMOutputWithPast so the standard logits / hidden_states fields are present (the recipe’s fused cross-entropy path requires "hidden_states" in out and reads the final hidden states off the output) while the DSV4-specific MTP fields are carried as declared dataclass fields. As required by ModelOutput, every field after the first declares a None default.

mtp_loss_scaling_factor

Optional[float] = None

mtp_per_depth_h

Optional[list[Tensor]] = None

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin

_keep_in_fp32_modules_strict

backend

= backend or BackendConfig()

lm_head

model

mtp

mtp_config

state_dict_adapter

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM._build_mtp_embed_inputs_for_pp(
    input_ids: torch.Tensor
) -> tuple[torch.Tensor, ...]

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM._is_pipeline_parallel_stage() -> bool

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.customize_pipeline_stage_modules(
    module_names_per_stage: list[list[str]],
    layers_prefix: str,
    text_model: torch.nn.Module | None = None
) -> list[list[str]]

Keep DSV4 non-layer PP dependencies with the stages that need them.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.forward(
    input_ids: torch.Tensor,
    mtp_embed_inputs: torch.Tensor = (),
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    attn_kwargs: typing.Any = {}
) -> 'DeepseekV4CausalLMOutput' | tuple[torch.Tensor, ...] | torch.Tensor

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.from_config(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

classmethod

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)

classmethod

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_input_embeddings()

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_output_embeddings()

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_pipeline_stage_metas(
    is_first: bool,
    microbatch_size: int,
    seq_len: int,
    dtype: torch.dtype
) -> tuple[tuple[torch.Tensor, ...], tuple[torch.Tensor, ...]]

Return PP input/output meta tensors for DSV4’s HC and MTP contract.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.prepare_model_inputs_for_cp(
    input_ids: torch.Tensor,
    kwargs: typing.Any = {}
) -> dict[str, typing.Any]

Model-owned context-parallel batch prep (Miles-style contiguous shard).

Returns the keys cp_utils.make_cp_batch_and_ctx needs to delegate CP sharding back to this model: a _cp_make_batch_fn callable (with the config-derived per-rank shard multiple bound) plus a flag asking the recipe to keep the full logits in the autograd graph so every CP rank’s backward reaches all parameters even when its local loss is fully masked. DSV4 embeds internally, so (unlike VLM models) this does not pre-embed — it leaves input_ids for the sharding callable.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.set_input_embeddings(
    value
)

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.set_output_embeddings(
    new_embeddings
)

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.update_moe_gate_bias() -> None

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig
)

Bases: Module

Hash gate for first num_hash_layers: routes tokens via a fixed lookup table.

Instead of computing routing scores, the gate uses tid2eid[token_id] to pre-assign expert indices. The routing weight is still computed from the gate weight but the selection is deterministic per token id.

tid2eid shape: [vocab_size, n_activated_experts] (int64 runtime, non-trainable)

Signature matches components.moe.layers.Gate — forward(x, token_mask, cp_mesh) returning (weights, indices, aux_loss) — so the generic MoE module can call it interchangeably. The per-forward input_ids needed for the tid2eid lookup is stashed on the module by the enclosing Block via :meth:set_input_ids immediately before the MoE call.

_pending_input_ids

Tensor | None = None

n_experts

= moe_config.n_routed_experts

norm_topk_prob

= moe_config.norm_topk_prob

route_scale

= moe_config.route_scale

score_func

= moe_config.score_func

topk

= moe_config.n_activated_experts

weight

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.forward(
    x: torch.Tensor,
    token_mask: torch.Tensor | None = None,
    cp_mesh: 'DeviceMesh | None' = None
) -> tuple[torch.Tensor, torch.Tensor, None]

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.init_weights(
    buffer_device: torch.device | None = None
) -> None

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.set_input_ids(
    input_ids: torch.Tensor | None
) -> None

Stash the current batch’s input_ids for the next forward call.

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.update_bias() -> None

No-op for compat with callers that walk MoE gates and call update_bias.

class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)

Bases: Module

embed_tokens

hc_head

layers

= nn.ModuleDict()

max_seq_len

= config.max_position_embeddings

moe_config

= moe_config or MoEConfig(**moe_defaults)

norm

rotary_emb

rotary_emb_compress

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    return_hc_hidden: bool = False,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.init_weights(
    buffer_device: torch.device | None = None
) -> None

nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.update_moe_gate_bias() -> None

nemo_automodel.components.models.deepseek_v4.model.ModelClass = DeepseekV4ForCausalLM