nemo_automodel.components.models.deepseek_v4.optimized_kernels

View as Markdown

Optional DeepSeek V4 optimized kernel dispatch.

The torch implementations below are kept as the numerical reference. Optional TileLang-backed paths are sourced from:

Those packages are imported with safe_import so environments without TileLang still import the model and use the existing torch path.

Module Contents

Classes

NameDescription
_Dsv4TileKernelsSinkhornTileKernels Sinkhorn wrapper that accepts non-contiguous backward gradients.

Functions

NameDescription
_all_cuda-
_should_use_tilelang-
_tile_kernels_sinkhorn_contiguous_grad-
build_dsv4_sparse_topk_indicesBuild Miles-style top-k key indices for DSV4 local-window + compressed KV attention.
dense_attention_topk_torchDense torch oracle for the Miles top-k sparse-attention contract.
dsv4_indexer_scoresRun DSV4 C4 indexer scores through Miles TileLang kernels or torch fallback.
dsv4_indexer_topk_scoresRun DSV4 C4 top-k indexer scores through Miles autograd kernels or torch fallback.
dsv4_sinkhorn_normalizeNormalize HyperConnection combination logits with torch or TileKernels.
dsv4_sparse_attentionRun DSV4 sparse attention through Miles TileLang kernels or torch fallback.
extract_indexer_topk_scores_torchExtract top-k score values, masking -1 entries with -inf.
indexer_scores_torchTorch reference for the Miles DSV4 C4 indexer score kernel.
is_dsv4_kernel_availableReturn whether the optional TileLang kernel package for name is importable.
sinkhorn_normalize_torchTorch reference for TileKernels MHC Sinkhorn normalization.
sparse_attention_torchMiles sparse MQA torch reference.

Data

Dsv4IndexerBackend

Dsv4SinkhornBackend

Dsv4SparseAttentionBackend

API

class nemo_automodel.components.models.deepseek_v4.optimized_kernels._Dsv4TileKernelsSinkhorn()

Bases: Function

TileKernels Sinkhorn wrapper that accepts non-contiguous backward gradients.

The upstream high-level wrapper launches the backward kernel with grad_output as-is. DSV4 consumes HC combinations through transposed matmul sites, so autograd can provide a transposed gradient layout. The low-level TileKernels backward kernel requires contiguous row-major inputs.

nemo_automodel.components.models.deepseek_v4.optimized_kernels._Dsv4TileKernelsSinkhorn.backward(
ctx: torch.autograd.function.FunctionCtx,
grad_output: torch.Tensor
) -> tuple[torch.Tensor, None, None]
staticmethod
nemo_automodel.components.models.deepseek_v4.optimized_kernels._Dsv4TileKernelsSinkhorn.forward(
ctx: torch.autograd.function.FunctionCtx,
x: torch.Tensor,
repeat: int,
eps: float
) -> torch.Tensor
staticmethod
nemo_automodel.components.models.deepseek_v4.optimized_kernels._all_cuda(
tensors: torch.Tensor = ()
) -> bool
nemo_automodel.components.models.deepseek_v4.optimized_kernels._should_use_tilelang(
backend: str,
available: bool,
kernel_name: str,
tensors: tuple[torch.Tensor, ...],
require_bf16: bool = False
) -> bool
nemo_automodel.components.models.deepseek_v4.optimized_kernels._tile_kernels_sinkhorn_contiguous_grad(
x: torch.Tensor,
repeat: int,
eps: float
) -> torch.Tensor
nemo_automodel.components.models.deepseek_v4.optimized_kernels.build_dsv4_sparse_topk_indices(
batch_size: int,
seq_len: int,
key_len: int,
window_size: int,
device: torch.device,
attention_mask: torch.Tensor | None = None,
compress_ratio: int = 0,
compressed_topk: torch.Tensor | None = None,
n_pooled: int = 0,
vanilla_key_len: int | None = None,
q_positions: torch.Tensor | None = None
) -> torch.Tensor

Build Miles-style top-k key indices for DSV4 local-window + compressed KV attention.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dense_attention_topk_torch(
q: torch.Tensor,
kv: torch.Tensor,
sinks: torch.Tensor,
topk_idxs: torch.Tensor,
sm_scale: float
) -> torch.Tensor

Dense torch oracle for the Miles top-k sparse-attention contract.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_indexer_scores(
q: torch.Tensor,
pooled_kv: torch.Tensor,
weights: torch.Tensor,
compress_ratio: int,
softmax_scale: float,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4IndexerBackend,
query_start: int = 0,
query_total_len: int | None = None
) -> torch.Tensor

Run DSV4 C4 indexer scores through Miles TileLang kernels or torch fallback.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_indexer_topk_scores(
q: torch.Tensor,
pooled_kv: torch.Tensor,
weights: torch.Tensor,
topk_indices: torch.Tensor,
compress_ratio: int,
softmax_scale: float,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4IndexerBackend
) -> torch.Tensor

Run DSV4 C4 top-k indexer scores through Miles autograd kernels or torch fallback.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_sinkhorn_normalize(
x: torch.Tensor,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SinkhornBackend,
repeat: int,
eps: float
) -> torch.Tensor

Normalize HyperConnection combination logits with torch or TileKernels.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.dsv4_sparse_attention(
q: torch.Tensor,
kv: torch.Tensor,
sinks: torch.Tensor,
topk_idxs: torch.Tensor,
sm_scale: float,
backend: nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SparseAttentionBackend
) -> torch.Tensor

Run DSV4 sparse attention through Miles TileLang kernels or torch fallback.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.extract_indexer_topk_scores_torch(
logits: torch.Tensor,
topk_indices: torch.Tensor
) -> torch.Tensor

Extract top-k score values, masking -1 entries with -inf.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.indexer_scores_torch(
q: torch.Tensor,
pooled_kv: torch.Tensor,
weights: torch.Tensor,
softmax_scale: float
) -> torch.Tensor

Torch reference for the Miles DSV4 C4 indexer score kernel.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.is_dsv4_kernel_available(
name: typing.Literal['sinkhorn', 'sparse_attn', 'indexer']
) -> bool

Return whether the optional TileLang kernel package for name is importable.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.sinkhorn_normalize_torch(
x: torch.Tensor,
repeat: int,
eps: float
) -> torch.Tensor

Torch reference for TileKernels MHC Sinkhorn normalization.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.sparse_attention_torch(
q: torch.Tensor,
kv: torch.Tensor,
sinks: torch.Tensor,
topk_idxs: torch.Tensor,
sm_scale: float
) -> torch.Tensor

Miles sparse MQA torch reference.

Parameters:

q
torch.Tensor

Query tensor with shape [B, S, H, D].

kv
torch.Tensor

Single-head KV tensor with shape [B, K, D].

sinks
torch.Tensor

Per-head attention sink logits with shape [H].

topk_idxs
torch.Tensor

Key indices with shape [B, S, K_top]; -1 masks an entry.

sm_scale
float

Attention scaling factor.

nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4IndexerBackend = Literal['torch', 'tilelang', 'auto']
nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SinkhornBackend = Literal['torch', 'tilelang', 'auto']
nemo_automodel.components.models.deepseek_v4.optimized_kernels.Dsv4SparseAttentionBackend = Literal['torch', 'sparse_torch', 'tilelang', 'auto']