nemo_automodel.components.models.deepseek_v4.optimized_kernels#

Optional DeepSeek V4 optimized kernel dispatch.

The torch implementations below are kept as the numerical reference. Optional TileLang-backed paths are sourced from:

  • Sinkhorn: imported from DeepSeek TileKernels tile_kernels.modeling.mhc.ops.sinkhorn_normalize. No TileKernels source is vendored in AutoModel. Upstream source: https://github.com/deepseek-ai/TileKernels/blob/36d9e45d38e204ebb87e6f6e833821eee0482fe5/tile_kernels/modeling/mhc/ops/sinkhorn.py Upstream license: MIT, copyright 2026 DeepSeek.

  • Sparse attention and indexer: vendored/adapted Miles DeepSeek V4 ops in nemo_automodel.components.models.deepseek_v4.kernels. Upstream source: https://github.com/yueming-yuan/miles/tree/e561465d0b9bbf06188b7a5e2020dc7fd691f732/miles_plugins/models/deepseek_v4/ops Upstream license: Apache-2.0, copyright 2025 Zhipu AI. See nemo_automodel/components/models/deepseek_v4/kernels/__init__.py for the per-file attribution.

Those packages are imported with safe_import so environments without TileLang still import the model and use the existing torch path.

Module Contents#

Classes#

_Dsv4TileKernelsSinkhorn

TileKernels Sinkhorn wrapper that accepts non-contiguous backward gradients.

Functions#

is_dsv4_kernel_available

Return whether the optional TileLang kernel package for name is importable.

_all_cuda

_should_use_tilelang

sinkhorn_normalize_torch

Torch reference for TileKernels MHC Sinkhorn normalization.

_tile_kernels_sinkhorn_contiguous_grad

dsv4_sinkhorn_normalize

Normalize HyperConnection combination logits with torch or TileKernels.

build_dsv4_sparse_topk_indices

Build Miles-style top-k key indices for DSV4 local-window + compressed KV attention.

sparse_attention_torch

Miles sparse MQA torch reference.

dense_attention_topk_torch

Dense torch oracle for the Miles top-k sparse-attention contract.

dsv4_sparse_attention

Run DSV4 sparse attention through Miles TileLang kernels or torch fallback.

indexer_scores_torch

Torch reference for the Miles DSV4 C4 indexer score kernel.

extract_indexer_topk_scores_torch

Extract top-k score values, masking -1 entries with -inf.

dsv4_indexer_scores

Run DSV4 C4 indexer scores through Miles TileLang kernels or torch fallback.

dsv4_indexer_topk_scores

Run DSV4 C4 top-k indexer scores through Miles autograd kernels or torch fallback.

Data#

API#

nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SparseAttentionBackend#

None

nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4IndexerBackend#

None

nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SinkhornBackend#

None

nemo_automodel.components.models.deepseek_v4.optimized_kernels.is_dsv4_kernel_available(
name: Literal[sinkhorn, sparse_attn, indexer],
) bool#

Return whether the optional TileLang kernel package for name is importable.

nemo_automodel.components.models.deepseek_v4.optimized_kernels._all_cuda(*tensors: torch.Tensor) bool#
nemo_automodel.components.models.deepseek_v4.optimized_kernels._should_use_tilelang(
backend: str,
*,
available: bool,
kernel_name: str,
tensors: tuple[torch.Tensor, ...],
require_bf16: bool = False,
) bool#
nemo_automodel.components.models.deepseek_v4.optimized_kernels.sinkhorn_normalize_torch(
x: torch.Tensor,
repeat: int,
eps: float,
) torch.Tensor#

Torch reference for TileKernels MHC Sinkhorn normalization.

class nemo_automodel.components.models.deepseek_v4.optimized_kernels._Dsv4TileKernelsSinkhorn#

Bases: torch.autograd.Function

TileKernels Sinkhorn wrapper that accepts non-contiguous backward gradients.

The upstream high-level wrapper launches the backward kernel with grad_output as-is. DSV4 consumes HC combinations through transposed matmul sites, so autograd can provide a transposed gradient layout. The low-level TileKernels backward kernel requires contiguous row-major inputs.

static forward(
ctx: torch.autograd.function.FunctionCtx,
x: torch.Tensor,
repeat: int,
eps: float,
) torch.Tensor#
static backward(
ctx: torch.autograd.function.FunctionCtx,
grad_output: torch.Tensor,
) tuple[torch.Tensor, None, None]#
nemo_automodel.components.models.deepseek_v4.optimized_kernels._tile_kernels_sinkhorn_contiguous_grad(
x: torch.Tensor,
repeat: int,
eps: float,
) torch.Tensor#
nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_sinkhorn_normalize(
x: torch.Tensor,
*,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SinkhornBackend,
repeat: int,
eps: float,
) torch.Tensor#

Normalize HyperConnection combination logits with torch or TileKernels.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.build_dsv4_sparse_topk_indices(
*,
batch_size: int,
seq_len: int,
key_len: int,
window_size: int,
device: torch.device,
attention_mask: torch.Tensor | None = None,
compress_ratio: int = 0,
compressed_topk: torch.Tensor | None = None,
n_pooled: int = 0,
) torch.Tensor#

Build Miles-style top-k key indices for DSV4 local-window + compressed KV attention.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.sparse_attention_torch(
q: torch.Tensor,
kv: torch.Tensor,
sinks: torch.Tensor,
topk_idxs: torch.Tensor,
sm_scale: float,
) torch.Tensor#

Miles sparse MQA torch reference.

Parameters:
  • q – Query tensor with shape [B, S, H, D].

  • kv – Single-head KV tensor with shape [B, K, D].

  • sinks – Per-head attention sink logits with shape [H].

  • topk_idxs – Key indices with shape [B, S, K_top]; -1 masks an entry.

  • sm_scale – Attention scaling factor.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dense_attention_topk_torch(
q: torch.Tensor,
kv: torch.Tensor,
sinks: torch.Tensor,
topk_idxs: torch.Tensor,
sm_scale: float,
) torch.Tensor#

Dense torch oracle for the Miles top-k sparse-attention contract.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_sparse_attention(
q: torch.Tensor,
kv: torch.Tensor,
sinks: torch.Tensor,
topk_idxs: torch.Tensor,
sm_scale: float,
*,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SparseAttentionBackend,
) torch.Tensor#

Run DSV4 sparse attention through Miles TileLang kernels or torch fallback.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.indexer_scores_torch(
q: torch.Tensor,
pooled_kv: torch.Tensor,
weights: torch.Tensor,
softmax_scale: float,
) torch.Tensor#

Torch reference for the Miles DSV4 C4 indexer score kernel.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.extract_indexer_topk_scores_torch(
logits: torch.Tensor,
topk_indices: torch.Tensor,
) torch.Tensor#

Extract top-k score values, masking -1 entries with -inf.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_indexer_scores(
q: torch.Tensor,
pooled_kv: torch.Tensor,
weights: torch.Tensor,
*,
compress_ratio: int,
softmax_scale: float,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4IndexerBackend,
) torch.Tensor#

Run DSV4 C4 indexer scores through Miles TileLang kernels or torch fallback.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_indexer_topk_scores(
q: torch.Tensor,
pooled_kv: torch.Tensor,
weights: torch.Tensor,
topk_indices: torch.Tensor,
*,
compress_ratio: int,
softmax_scale: float,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4IndexerBackend,
) torch.Tensor#

Run DSV4 C4 top-k indexer scores through Miles autograd kernels or torch fallback.