> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.deepseek_v4.layers

DeepSeek V4 Attention Layer.

Architecture (from official inference/model.py):

KV path (K = V, single latent):
x  -> wkv   \[hidden -> head\_dim]        # single KV head, K = V = kv
-> kv\_norm (RMSNorm on head\_dim)
-> apply\_rotary\_emb on last rope\_head\_dim dims
K = V = kv  (one latent vector serves both key and value)

Output path (grouped):
o \[bsz, seq, n\_heads, head\_dim]
-> reshape \[bsz, seq, n\_groups, n\_heads\_per\_group \* head\_dim]
-> wo\_a einsum per group: \[n\_heads\_per\_group \* head\_dim] -> \[o\_lora\_rank]
-> reshape \[bsz, seq, n\_groups \* o\_lora\_rank]
-> wo\_b \[n\_groups \* o\_lora\_rank -> hidden]

attn\_sink: learnable per-head scalar bias added to attention-sink position score.

HC (Hyper-Connections):
Each Block maintains hc\_mult=4 copies of the hidden state.
hc\_pre  reduces \[bsz, seq, hc\_mult, dim] -> \[bsz, seq, dim] via Sinkhorn mixing.
hc\_post expands \[bsz, seq, dim] -> \[bsz, seq, hc\_mult, dim].
See `DeepseekV4HyperConnection.compute_weights` and
`optimized_kernels.dsv4_sinkhorn_normalize` for the torch reference and
optional TileKernels Sinkhorn path.

Compress-ratio attention (Compressor + Indexer) is wired into
DeepseekV4Attention.forward for layers with compress\_ratio > 0.
All layers share the same sliding-window causal mask on the local KV path.

## Module Contents

### Classes

| Name                                                                                                          | Description                                                                 |
| ------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| [`DeepseekV4Attention`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4Attention)             | Sliding-window attention + Compressor + Indexer + attention sink.           |
| [`DeepseekV4Compressor`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4Compressor)           | HF PR 45616 port.  Long-range KV branch.  Pools `compress_ratio` tokens     |
| [`DeepseekV4FP32Parameter`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4FP32Parameter)     | Callable holder for fp32 tensors that need their own FSDP unit.             |
| [`DeepseekV4GroupedLinear`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4GroupedLinear)     | Block-diagonal grouped linear (HF PR 45616 port).                           |
| [`DeepseekV4HyperConnection`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4HyperConnection) | Per-site HyperConnection mixer (attention or FFN).  Ported from             |
| [`DeepseekV4HyperHead`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4HyperHead)             | Final HC-stream collapse before the shared RMSNorm + `lm_head`.             |
| [`DeepseekV4Indexer`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4Indexer)                 | HF PR 45616 port.  Picks the top-k compressed positions per query when      |
| [`DeepseekV4RotaryEmbedding`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4RotaryEmbedding) | V4 rotary embedding.  Produces `(cos, sin)` sized to `qk_rope_head_dim`     |
| [`DeepseekV4TrainCache`](#nemo_automodel-components-models-deepseek_v4-layers-DeepseekV4TrainCache)           | Training-only cache shim mirroring the three methods `DeepseekV4Compressor` |

### Functions

| Name                                                                                                                              | Description                                                                 |
| --------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| [`_apply_partial_rope`](#nemo_automodel-components-models-deepseek_v4-layers-_apply_partial_rope)                                 | Split `x` along its last dim into nope (first) and rope (last               |
| [`_apply_partial_rope_interleaved`](#nemo_automodel-components-models-deepseek_v4-layers-_apply_partial_rope_interleaved)         | Interleaved RoPE on the last `rope_head_dim` dims of `x` (pairs are         |
| [`_build_indexer_topk_compressed_mask`](#nemo_automodel-components-models-deepseek_v4-layers-_build_indexer_topk_compressed_mask) | Build the additive compressed-position mask for Indexer-selected pool IDs.  |
| [`_dsv4_kernel_backend`](#nemo_automodel-components-models-deepseek_v4-layers-_dsv4_kernel_backend)                               | Use TileLang DSV4 kernels only when the attention backend requests them.    |
| [`_overlap_transform`](#nemo_automodel-components-models-deepseek_v4-layers-_overlap_transform)                                   | Reshape `[B, S, ratio, 2*head_dim]` -> `[B, S, 2*ratio, head_dim]` with the |
| [`_overlap_transform_with_cp`](#nemo_automodel-components-models-deepseek_v4-layers-_overlap_transform_with_cp)                   | -                                                                           |
| [`_pool_windows`](#nemo_automodel-components-models-deepseek_v4-layers-_pool_windows)                                             | Softmax-gated sum-pool over `ratio` consecutive tokens.                     |
| [`_query_positions`](#nemo_automodel-components-models-deepseek_v4-layers-_query_positions)                                       | -                                                                           |
| [`_rms_norm_last_dim`](#nemo_automodel-components-models-deepseek_v4-layers-_rms_norm_last_dim)                                   | RMS-normalize the last dim without materializing an `x.square()` tensor.    |
| [`_rope_pool_positions`](#nemo_automodel-components-models-deepseek_v4-layers-_rope_pool_positions)                               | -                                                                           |
| [`_rotate_half`](#nemo_automodel-components-models-deepseek_v4-layers-_rotate_half)                                               | Rotate half the hidden dims of the input (Llama / GPT-NeoX style).          |
| [`_yarn_correction_dim`](#nemo_automodel-components-models-deepseek_v4-layers-_yarn_correction_dim)                               | -                                                                           |
| [`_yarn_correction_range`](#nemo_automodel-components-models-deepseek_v4-layers-_yarn_correction_range)                           | -                                                                           |
| [`_yarn_linear_ramp`](#nemo_automodel-components-models-deepseek_v4-layers-_yarn_linear_ramp)                                     | -                                                                           |
| [`apply_rotary_pos_emb`](#nemo_automodel-components-models-deepseek_v4-layers-apply_rotary_pos_emb)                               | Port of transformers.models.llama.modeling\_llama.apply\_rotary\_pos\_emb.  |
| [`build_causal_padding_mask`](#nemo_automodel-components-models-deepseek_v4-layers-build_causal_padding_mask)                     | Build a 4D additive causal+padding (+optional sliding-window) mask          |
| [`build_packed_causal_padding_mask`](#nemo_automodel-components-models-deepseek_v4-layers-build_packed_causal_padding_mask)       | Build a 4D additive block-causal mask from packed-sequence lengths.         |
| [`eager_attention_with_sink`](#nemo_automodel-components-models-deepseek_v4-layers-eager_attention_with_sink)                     | Eager attention with per-head sink: appends an extra softmax column         |
| [`repeat_kv`](#nemo_automodel-components-models-deepseek_v4-layers-repeat_kv)                                                     | Port of transformers.models.llama.modeling\_llama.repeat\_kv.               |

### API

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Attention(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    layer_idx: int,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
)
```

**Bases:** `Module`

Sliding-window attention + Compressor + Indexer + attention sink.

Single-head KV (`num_key_value_heads=1`), grouped low-rank output via
:class:`DeepseekV4GroupedLinear`.  `compress_ratio == 0` layers skip
the compressor / indexer and run pure SWA.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Attention.forward(
    hidden_states: torch.Tensor,
    position_embeddings: tuple[torch.Tensor, torch.Tensor],
    attention_mask: torch.Tensor | None = None,
    position_embeddings_compress: typing.Optional[tuple[torch.Tensor, torch.Tensor]] = None,
    rotary_compress: torch.nn.Module | None = None,
    start_pos: int = 0,
    position_ids: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> tuple[torch.Tensor, torch.Tensor | None]
```

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Attention.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Attention.setup_cp_attention(
    cp_mesh
) -> None
```

Model-owned context-parallel hook, called by `moe.parallelizer.apply_cp`.

DSV4 runs Miles-style CP (contiguous query shard + all-gathered K/V), so there
is no TE `DotProductAttention` to configure -- we just record the CP process
group that `forward` uses to all-gather K/V across CP ranks.

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Compressor(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    compress_ratio: int,
    head_dim: int,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
)
```

**Bases:** `Module`

HF PR 45616 port.  Long-range KV branch.  Pools `compress_ratio` tokens
into one compressed KV; when `ratio == 4` the Indexer narrows the pool.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Compressor._compute_fsdp_group_has_complete_hca_window(
    local_has_complete_hca_window: bool,
    device: torch.device
) -> bool
```

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Compressor._set_hca_param_sync_group(
    process_group
) -> None
```

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Compressor.forward(
    hidden_states: torch.Tensor,
    q_residual: torch.Tensor | None,
    rotary: torch.nn.Module,
    position_embeddings: tuple[torch.Tensor, torch.Tensor],
    cache: nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4TrainCache,
    layer_idx: int,
    start_pos: int,
    enable_hca_fsdp_graph_alignment: bool = False,
    position_ids: torch.Tensor | None = None,
    cp_group = None
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4FP32Parameter(
    value: torch.Tensor
)
```

**Bases:** `Module`

Callable holder for fp32 tensors that need their own FSDP unit.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4FP32Parameter.forward(
    reference: torch.Tensor | None = None
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4GroupedLinear(
    in_features_per_group: int,
    out_features: int,
    n_groups: int,
    bias: bool = False
)
```

**Bases:** `Linear`

Block-diagonal grouped linear (HF PR 45616 port).

`weight` parameter has the standard `nn.Linear` shape
`[out_features, in_features_per_group]` so quantizers keyed on
`nn.Linear.weight` still find it; `forward` does per-group bmm.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4GroupedLinear.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4HyperConnection(
    hc_mult: int,
    hidden_size: int,
    hc_sinkhorn_iters: int,
    hc_eps: float,
    rms_norm_eps: float,
    sinkhorn_backend: str = 'torch'
)
```

**Bases:** `Module`

Per-site HyperConnection mixer (attention or FFN).  Ported from
`transformers/src/transformers/models/deepseek_v4/modular_deepseek_v4.py`
class `DeepseekV4HyperConnection`.

Owns `fn` (packed linear), `base` (bias), and `scale` (scalar
per-head gains).  `compute_weights` produces three mixer tensors:

* `pre`   \[B, S, H]       : sigmoid-gated collapse weights
* `post`  \[B, S, H]       : sigmoid-gated expand weights
* `comb`  \[B, S, H, H]    : doubly-stochastic combination matrix
  from Sinkhorn-normalising sigmoid gates

All math runs in fp32 regardless of the outer cast policy; parameters
cast themselves via `.float()` on each forward.  HF lists these params
in `_keep_in_fp32_modules_strict` — the KAutomodel adapter does the
same via submodule-name matching.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4HyperConnection.compute_weights(
    hidden_streams: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]
```

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4HyperConnection.forward(
    hidden_streams: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]
```

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4HyperHead(
    hc_mult: int,
    hidden_size: int,
    hc_eps: float,
    rms_norm_eps: float
)
```

**Bases:** `Module`

Final HC-stream collapse before the shared RMSNorm + `lm_head`.
Ported from `modular_deepseek_v4.py` class `DeepseekV4HyperHead`.

Sigmoid-weighted sum over the `hc_mult` streams (no Sinkhorn).  Used
once at the end of `DeepseekV4Model.forward` to go from
`[B, S, H, D]` back to `[B, S, D]`.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4HyperHead.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Indexer(
    config: nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
)
```

**Bases:** `Module`

HF PR 45616 port.  Picks the top-k compressed positions per query when
`compress_ratio == 4`.  Owns its own pool at `index_head_dim` plus a
query projection + weights\_proj head-mixer.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4Indexer.forward(
    hidden_states: torch.Tensor,
    q_residual: torch.Tensor,
    rotary: torch.nn.Module,
    position_embeddings: tuple[torch.Tensor, torch.Tensor],
    cache: nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4TrainCache,
    layer_idx: int,
    start_pos: int,
    position_ids: torch.Tensor | None = None,
    cp_group = None
) -> torch.LongTensor
```

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4RotaryEmbedding(
    rope_theta: float,
    head_dim: int,
    partial_rotary_factor: float,
    attention_scaling: float = 1.0,
    device: torch.device | None = None,
    rope_scaling: dict | None = None
)
```

**Bases:** `Module`

V4 rotary embedding.  Produces `(cos, sin)` sized to `qk_rope_head_dim`
(via `partial_rotary_factor = qk_rope_head_dim / head_dim`), matching HF.

YaRN: when `rope_scaling` is a YaRN-typed dict
(`&#123;"type": "yarn", "factor": F, "original_max_position_embeddings": L0,
"beta_fast": ..., "beta_slow": ...&#125;`), modify `inv_freq` per
`dsv4flash/inference/model.py:precompute_freqs_cis` — frequency
interpolation with a smooth linear ramp between beta\_fast/beta\_slow
correction dims.  Used by the compress-rope (theta=160000) on layers
with `compress_ratio &gt; 0`.  The main rope (theta=10000, used only on
sliding-window layers) gets `rope_scaling=None` because the reference
builds it with `original_seq_len=0` for those layers.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4RotaryEmbedding.forward(
    x: torch.Tensor,
    position_ids: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor]
```

```python
class nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4TrainCache()
```

Training-only cache shim mirroring the three methods `DeepseekV4Compressor`
/ `DeepseekV4Indexer` call on `DeepseekV4Cache`.

KAutomodel training forward is stateless — we never persist KV or compressor
windows across calls.  Each `DeepseekV4Attention.forward` creates a fresh
cache instance, which holds per-layer scratch dicts for the duration of the
call.  When a full window hasn't accumulated yet we return an empty tensor
and let the downstream code handle it.

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4TrainCache._branch_state(
    state_key: str,
    layer_idx: int
) -> dict
```

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4TrainCache.accumulate_windows(
    kv: torch.Tensor,
    gate: torch.Tensor,
    layer_idx: int,
    state_key: str,
    ratio: int,
    start_pos: int
) -> tuple[torch.Tensor, torch.Tensor, int]
```

```python
nemo_automodel.components.models.deepseek_v4.layers.DeepseekV4TrainCache.update_pool(
    new_pooled: torch.Tensor,
    layer_idx: int,
    state_key: str
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.layers._apply_partial_rope(
    x: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
    rope_head_dim: int
) -> torch.Tensor
```

Split `x` along its last dim into nope (first) and rope (last
`rope_head_dim`) slices, rotate only the rope slice with INTERLEAVED
pair-RoPE (pairs `(2k, 2k+1)`), concat back.

The DSV4-Flash released checkpoint uses interleaved RoPE end-to-end
(see `dsv4flash/inference/model.py:apply_rotary_emb` — complex
multiplication on `view_as_complex` of pairs).  HF transformers PR
45616 / PR 45643 ship a Llama-style `rotate_half` here instead, which
pairs `(d, d+rd/2)`.  Same algebra but a different dim-to-frequency
mapping — the released weights expect the interleaved layout, so the
Llama-style helper produces wrong activations on the released checkpoint
(verified empirically: kv\_post\_rope cosine drops from 0.9999 to 0.866
after one block under Llama-style; matches at >0.999 under interleaved).

```python
nemo_automodel.components.models.deepseek_v4.layers._apply_partial_rope_interleaved(
    x: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
    rope_head_dim: int
) -> torch.Tensor
```

Interleaved RoPE on the last `rope_head_dim` dims of `x` (pairs are
`(2k, 2k+1)`).  Matches the DeepSeek inference reference's complex-mul
formulation in `dsv4flash/inference/model.py:apply_rotary_emb`: the
released DSV4-Flash weights were trained with this layout, NOT the
Llama-style `rotate_half` layout HF transformers PR 45616/45643 still
uses (pairs `(d, d+rd/2)`).

Inverse rotation: pass `-sin` instead of `sin` (caller's
responsibility — same as our existing inverse-rope call site).

**Parameters:**

`[..., rope_head_dim]` (or larger trailing dim with rope on the
last `rope_head_dim` slice).  Typical attention-layout shapes:
`[B, H, S, D]` for q/k or `[B, 1, S, D]` for shared-KV.

shape `[B, S, rope_head_dim]` produced by the Llama-style
`cat([freqs, freqs], -1)` rotary; we take the first half which
contains the unique per-pair frequencies (the second half is a
duplicate that the Llama-style helper needs and we don't).

Must be even.

```python
nemo_automodel.components.models.deepseek_v4.layers._build_indexer_topk_compressed_mask(
    attention_mask: torch.Tensor,
    indexer_topk: torch.Tensor,
    n_pooled: int
) -> torch.Tensor
```

Build the additive compressed-position mask for Indexer-selected pool IDs.

```python
nemo_automodel.components.models.deepseek_v4.layers._dsv4_kernel_backend(
    backend: nemo_automodel.components.models.common.BackendConfig
) -> str
```

Use TileLang DSV4 kernels only when the attention backend requests them.

```python
nemo_automodel.components.models.deepseek_v4.layers._overlap_transform(
    tensor: torch.Tensor,
    head_dim: int,
    fill_value: float
) -> torch.Tensor
```

Reshape `[B, S, ratio, 2*head_dim]` -> `[B, S, 2*ratio, head_dim]` with the
cross-window overlap from the DeepSeek inference reference (`Compressor.overlap_transform`
in `dsv4flash/inference/model.py:307-314`).

Window 0 has no previous block, so its `[:ratio]` slice is left at `fill_value`
(`0` for the kv tensor, `-inf` for the score tensor so softmax masks it out).

```python
nemo_automodel.components.models.deepseek_v4.layers._overlap_transform_with_cp(
    tensor: torch.Tensor,
    head_dim: int,
    fill_value: float,
    cp_group
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.layers._pool_windows(
    kv: torch.Tensor,
    gate: torch.Tensor,
    ape: torch.Tensor,
    ratio: int,
    head_dim: int,
    overlap: bool = False,
    cp_group = None
) -> torch.Tensor
```

Softmax-gated sum-pool over `ratio` consecutive tokens.

Non-overlap mode (HF PR 45616 layout, ratio==128 in V4-Flash):
Input  `kv`/`gate` of shape `[B, length, head_dim]`.
Reshape to `[B, length/ratio, ratio, head_dim]` and pool over the `ratio` axis.

Overlap mode (DeepSeek inference reference layout, ratio==4 in V4-Flash):
Input  `kv`/`gate` of shape `[B, length, 2*head_dim]` (`wkv`/`wgate`
project to `2*head_dim` so each window can carry both its own kv and a
half-overlap into the next window).
Reshape to `[B, length/ratio, ratio, 2*head_dim]`, apply :func:`_overlap_transform`
to remap to `[B, length/ratio, 2*ratio, head_dim]`, then pool over the `2*ratio`
axis.  Each compressed token thus aggregates `2*ratio = 8` raw tokens — the
`ratio` tokens of the current window plus the `ratio` tokens of the previous
window — giving smoother compression boundaries that the released checkpoint
was trained under.

HF PR 45616 omits the overlap path entirely; the released DSV4-Flash safetensors
have `ape`/`wkv`/`wgate` shapes that only match the overlap layout (`[ratio,
2*head_dim]` and `[2*head_dim, hidden]`), so we must support it here to load
the released weights.

```python
nemo_automodel.components.models.deepseek_v4.layers._query_positions(
    position_ids: torch.Tensor | None,
    batch: int,
    seq_len: int,
    device: torch.device,
    cp_group = None
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.layers._rms_norm_last_dim(
    x: torch.Tensor,
    eps: float
) -> torch.Tensor
```

RMS-normalize the last dim without materializing an `x.square()` tensor.

```python
nemo_automodel.components.models.deepseek_v4.layers._rope_pool_positions(
    pool_length: int,
    pool_base: int,
    ratio: int,
    device: torch.device,
    batch: int
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.layers._rotate_half(
    x: torch.Tensor
) -> torch.Tensor
```

Rotate half the hidden dims of the input (Llama / GPT-NeoX style).

```python
nemo_automodel.components.models.deepseek_v4.layers._yarn_correction_dim(
    num_rotations: float,
    dim: int,
    base: float,
    max_seq_len: int
) -> float
```

```python
nemo_automodel.components.models.deepseek_v4.layers._yarn_correction_range(
    low_rot: float,
    high_rot: float,
    dim: int,
    base: float,
    max_seq_len: int
) -> tuple[int, int]
```

```python
nemo_automodel.components.models.deepseek_v4.layers._yarn_linear_ramp(
    min_v: float,
    max_v: float,
    dim: int,
    device = None
) -> torch.Tensor
```

```python
nemo_automodel.components.models.deepseek_v4.layers.apply_rotary_pos_emb(
    q: torch.Tensor,
    k: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
    position_ids: torch.Tensor | None = None,
    unsqueeze_dim: int = 1
) -> tuple[torch.Tensor, torch.Tensor]
```

Port of transformers.models.llama.modeling\_llama.apply\_rotary\_pos\_emb.

```python
nemo_automodel.components.models.deepseek_v4.layers.build_causal_padding_mask(
    attention_mask: torch.Tensor | None,
    seq_len: int,
    dtype: torch.dtype,
    device: torch.device,
    batch_size: int = 1,
    sliding_window: int | None = None
) -> torch.Tensor | None
```

Build a 4D additive causal+padding (+optional sliding-window) mask
compatible with `eager_attention_with_sink`.

Mirrors HF's `create_sliding_window_causal_mask` (used in
`DeepseekV4Model.forward`): each query at position `i` attends only to
keys at positions `[max(0, i - sliding_window + 1), i]`.  The DSV4-Flash
weights were trained with this banding on every layer, so dropping it makes
the softmax see a different distribution than training and degrades loss.

Returns:
`[B, 1, S, S]` additive mask of `dtype` (0 where keep, large negative
where mask).

```python
nemo_automodel.components.models.deepseek_v4.layers.build_packed_causal_padding_mask(
    seq_lens: torch.Tensor,
    seq_len: int,
    dtype: torch.dtype,
    device: torch.device,
    sliding_window: int | None = None
) -> torch.Tensor
```

Build a 4D additive block-causal mask from packed-sequence lengths.

```python
nemo_automodel.components.models.deepseek_v4.layers.eager_attention_with_sink(
    module: torch.nn.Module,
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    attention_mask: torch.Tensor | None,
    scaling: float,
    dropout: float = 0.0,
    kwargs = {}
) -> tuple[torch.Tensor, torch.Tensor]
```

Eager attention with per-head sink: appends an extra softmax column
whose logit is `module.sinks[h]` and whose value-slot is zero.  Ported
verbatim from HF PR 45616.

```python
nemo_automodel.components.models.deepseek_v4.layers.repeat_kv(
    hidden_states: torch.Tensor,
    n_rep: int
) -> torch.Tensor
```

Port of transformers.models.llama.modeling\_llama.repeat\_kv.