> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.deepseek_v4.model

DeepSeek V4 Model.

Key architectural points (from official inference/model.py):

HC (Hyper-Connections):
Every transformer block maintains hc\_mult=4 copies of the hidden state.
The embedding output is expanded: \[B,S,dim] -> \[B,S,hc\_mult,dim].
hc\_pre  reduces \[B,S,hc\_mult,dim] -> \[B,S,dim] before attn/ffn.
hc\_post expands \[B,S,dim] -> \[B,S,hc\_mult,dim] after attn/ffn.
Full HC requires the hc\_split\_sinkhorn CUDA kernel.
Current fallback: mean-pooling for hc\_pre, broadcast add for hc\_post.

HC parameters (ALL layers, stored in float32):
hc\_attn\_fn    : \[mix\_hc, hc\_mult\*dim]  where mix\_hc = (2+hc\_mult)*hc\_mult = 24
hc\_attn\_base  : \[mix\_hc]
hc\_attn\_scale : \[3]
hc\_ffn\_fn     : \[mix\_hc, hc\_mult*dim]
hc\_ffn\_base   : \[mix\_hc]
hc\_ffn\_scale  : \[3]

Gate hash layers (layer\_idx \< num\_hash\_layers):
Instead of score-based routing, the gate uses a fixed token-id -> expert-id
lookup table (tid2eid: \[vocab\_size, n\_activated\_experts]).

All layers use MoE FFN (no dense layers).
Compress-ratio sliding-window attention is not yet implemented.

## Module Contents

### Classes

| Name                                                                                                       | Description                                                                    |
| ---------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ |
| [`DeepseekV4Block`](#nemo_automodel-components-models-deepseek_v4-model-DeepseekV4Block)                   | Single transformer block for DeepSeek V4.                                      |
| [`DeepseekV4CausalLMOutput`](#nemo_automodel-components-models-deepseek_v4-model-DeepseekV4CausalLMOutput) | Output of DeepseekV4ForCausalLM.forward.                                       |
| [`DeepseekV4ForCausalLM`](#nemo_automodel-components-models-deepseek_v4-model-DeepseekV4ForCausalLM)       | -                                                                              |
| [`DeepseekV4HashGate`](#nemo_automodel-components-models-deepseek_v4-model-DeepseekV4HashGate)             | Hash gate for first num\_hash\_layers: routes tokens via a fixed lookup table. |
| [`DeepseekV4Model`](#nemo_automodel-components-models-deepseek_v4-model-DeepseekV4Model)                   | -                                                                              |

### Data

[`ModelClass`](#nemo_automodel-components-models-deepseek_v4-model-ModelClass)

### API

```python
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block(
    layer_idx: int,
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Single transformer block for DeepSeek V4.

Uses HuggingFace transformers PR 45616's HyperConnection decoder-layer
pattern: two `DeepseekV4HyperConnection` modules own the collapse /
expand mixer weights at the attention and FFN sites respectively.
Checkpoint's flat `hc_attn_*` / `hc_ffn_*` keys are routed into
`attn_hc.*` / `ffn_hc.*` by the state-dict adapter.

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.forward(
    x: torch.Tensor,
    position_embeddings: tuple[torch.Tensor, torch.Tensor],
    position_ids: torch.Tensor | None = None,
    position_embeddings_compress: tuple[torch.Tensor, torch.Tensor] | None = None,
    rotary_compress: torch.nn.Module | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    input_ids: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.init_weights(
    buffer_device: torch.device
) -> None
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Block.set_activation_checkpointing(
    enabled: bool = True
) -> None
```

Enable block-local checkpointing that avoids replaying MoE dispatch.

```python
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4CausalLMOutput(
    mtp_per_depth_h: typing.Optional[list[torch.Tensor]] = None,
    mtp_loss_scaling_factor: typing.Optional[float] = None
)
```

Dataclass

**Bases:** `CausalLMOutputWithPast`

Output of DeepseekV4ForCausalLM.forward.

Subclasses `transformers.modeling_outputs.CausalLMOutputWithPast` so the
standard `logits` / `hidden_states` fields are present (the recipe's
fused cross-entropy path requires `"hidden_states" in out` and reads the
final hidden states off the output) while the DSV4-specific MTP fields are
carried as declared dataclass fields. As required by `ModelOutput`, every
field after the first declares a `None` default.

```python
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM._build_mtp_embed_inputs_for_pp(
    input_ids: torch.Tensor
) -> tuple[torch.Tensor, ...]
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM._is_pipeline_parallel_stage() -> bool
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.customize_pipeline_stage_modules(
    module_names_per_stage: list[list[str]],
    layers_prefix: str,
    text_model: torch.nn.Module | None = None
) -> list[list[str]]
```

Keep DSV4 non-layer PP dependencies with the stages that need them.

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.forward(
    input_ids: torch.Tensor,
    mtp_embed_inputs: torch.Tensor = (),
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    attn_kwargs: typing.Any = {}
) -> 'DeepseekV4CausalLMOutput' | tuple[torch.Tensor, ...] | torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.from_config(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_input_embeddings()
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_output_embeddings()
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.get_pipeline_stage_metas(
    is_first: bool,
    microbatch_size: int,
    seq_len: int,
    dtype: torch.dtype
) -> tuple[tuple[torch.Tensor, ...], tuple[torch.Tensor, ...]]
```

Return PP input/output meta tensors for DSV4's HC and MTP contract.

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.prepare_model_inputs_for_cp(
    input_ids: torch.Tensor,
    kwargs: typing.Any = {}
) -> dict[str, typing.Any]
```

Model-owned context-parallel batch prep (Miles-style contiguous shard).

Returns the keys `cp_utils.make_cp_batch_and_ctx` needs to delegate CP
sharding back to this model: a `_cp_make_batch_fn` callable (with the
config-derived per-rank shard multiple bound) plus a flag asking the recipe
to keep the full logits in the autograd graph so every CP rank's backward
reaches all parameters even when its local loss is fully masked. DSV4 embeds
internally, so (unlike VLM models) this does not pre-embed -- it leaves
`input_ids` for the sharding callable.

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.set_output_embeddings(
    new_embeddings
)
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4ForCausalLM.update_moe_gate_bias() -> None
```

```python
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig
)
```

**Bases:** `Module`

Hash gate for first num\_hash\_layers: routes tokens via a fixed lookup table.

Instead of computing routing scores, the gate uses tid2eid\[token\_id] to
pre-assign expert indices.  The routing weight is still computed from the
gate weight but the *selection* is deterministic per token id.

tid2eid shape: \[vocab\_size, n\_activated\_experts]  (int64 runtime, non-trainable)

Signature matches `components.moe.layers.Gate` — `forward(x, token_mask,
cp_mesh)` returning `(weights, indices, aux_loss)` — so the generic MoE
module can call it interchangeably.  The per-forward `input_ids` needed
for the tid2eid lookup is stashed on the module by the enclosing Block via
:meth:`set_input_ids` immediately before the MoE call.

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.forward(
    x: torch.Tensor,
    token_mask: torch.Tensor | None = None,
    cp_mesh: 'DeviceMesh | None' = None
) -> tuple[torch.Tensor, torch.Tensor, None]
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.set_input_ids(
    input_ids: torch.Tensor | None
) -> None
```

Stash the current batch's input\_ids for the next `forward` call.

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4HashGate.update_bias() -> None
```

No-op for compat with callers that walk MoE gates and call update\_bias.

```python
class nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)
```

**Bases:** `Module`

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    return_hc_hidden: bool = False,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
nemo_automodel.components.models.deepseek_v4.model.DeepseekV4Model.update_moe_gate_bias() -> None
```

```python
nemo_automodel.components.models.deepseek_v4.model.ModelClass = DeepseekV4ForCausalLM
```