nemo_automodel.components.models.deepseek_v4.optimized_kernels
nemo_automodel.components.models.deepseek_v4.optimized_kernels
Optional DeepSeek V4 optimized kernel dispatch.
The torch implementations below are kept as the numerical reference. Optional TileLang-backed paths are sourced from:
- Sinkhorn: imported from DeepSeek TileKernels
tile_kernels.modeling.mhc.ops.sinkhorn_normalize. No TileKernels source is vendored in AutoModel. Upstream source: https://github.com/deepseek-ai/TileKernels/blob/36d9e45d38e204ebb87e6f6e833821eee0482fe5/tile_kernels/modeling/mhc/ops/sinkhorn.py Upstream license: MIT, copyright 2026 DeepSeek. - Sparse attention and indexer: vendored/adapted Miles DeepSeek V4 ops in
nemo_automodel.components.models.deepseek_v4.kernels. Upstream source: https://github.com/yueming-yuan/miles/tree/e561465d0b9bbf06188b7a5e2020dc7fd691f732/miles_plugins/models/deepseek_v4/ops Upstream license: Apache-2.0, copyright 2025 Zhipu AI. Seenemo_automodel/components/models/deepseek_v4/kernels/__init__.pyfor the per-file attribution.
Those packages are imported with safe_import so environments without
TileLang still import the model and use the existing torch path.
Module Contents
Classes
Functions
Data
API
Bases: Function
TileKernels Sinkhorn wrapper that accepts non-contiguous backward gradients.
The upstream high-level wrapper launches the backward kernel with
grad_output as-is. DSV4 consumes HC combinations through transposed
matmul sites, so autograd can provide a transposed gradient layout. The
low-level TileKernels backward kernel requires contiguous row-major inputs.
Build Miles-style top-k key indices for DSV4 local-window + compressed KV attention.
Dense torch oracle for the Miles top-k sparse-attention contract.
Run DSV4 C4 indexer scores through Miles TileLang kernels or torch fallback.
Run DSV4 C4 top-k indexer scores through Miles autograd kernels or torch fallback.
Normalize HyperConnection combination logits with torch or TileKernels.
Run DSV4 sparse attention through Miles TileLang kernels or torch fallback.
Extract top-k score values, masking -1 entries with -inf.
Torch reference for the Miles DSV4 C4 indexer score kernel.
Return whether the optional TileLang kernel package for name is importable.
Torch reference for TileKernels MHC Sinkhorn normalization.
Miles sparse MQA torch reference.
Parameters:
Query tensor with shape [B, S, H, D].
Single-head KV tensor with shape [B, K, D].
Per-head attention sink logits with shape [H].
Key indices with shape [B, S, K_top]; -1 masks an entry.
Attention scaling factor.