nemo_automodel.components.models.deepseek_v4.kernels.sparse_attention#
Autograd wrapper for vendored Miles DeepSeek V4 sparse-attention kernels.
Attribution:
Upstream project: Miles, https://github.com/yueming-yuan/miles
Upstream revision: e561465d0b9bbf06188b7a5e2020dc7fd691f732, deepseek-v4 branch
Upstream license: Apache-2.0, copyright 2025 Zhipu AI
Original source: https://github.com/yueming-yuan/miles/blob/e561465d0b9bbf06188b7a5e2020dc7fd691f732/miles_plugins/models/deepseek_v4/ops/attention_core.py
Module Contents#
Classes#
TileLang sparse MQA attention with custom backward. |
|
TileLang sparse attention with smaller head groups and fp32 KV-grad accumulation. |
Functions#
Run vendored Miles DeepSeek V4 TileLang sparse attention. |
|
Run vendored Miles sparse attention in TileLang head chunks. |
API#
- class nemo_automodel.components.models.deepseek_v4.kernels.sparse_attention.DeepSeekV4SparseAttention#
Bases:
torch.autograd.FunctionTileLang sparse MQA attention with custom backward.
- static forward(
- ctx,
- q: torch.Tensor,
- kv: torch.Tensor,
- attn_sink: torch.Tensor,
- topk_idxs: torch.Tensor,
- sm_scale: float | None = None,
Run the vendored sparse attention forward kernel.
- static backward(
- ctx,
- grad_output: torch.Tensor,
Run the vendored sparse attention backward kernel.
- nemo_automodel.components.models.deepseek_v4.kernels.sparse_attention.sparse_attn_tilelang(
- q: torch.Tensor,
- kv: torch.Tensor,
- attn_sink: torch.Tensor,
- topk_idxs: torch.Tensor,
- sm_scale: float | None = None,
Run vendored Miles DeepSeek V4 TileLang sparse attention.
- class nemo_automodel.components.models.deepseek_v4.kernels.sparse_attention.DeepSeekV4SparseAttentionHeadChunked#
Bases:
torch.autograd.FunctionTileLang sparse attention with smaller head groups and fp32 KV-grad accumulation.
- static forward(
- ctx,
- q: torch.Tensor,
- kv: torch.Tensor,
- attn_sink: torch.Tensor,
- topk_idxs: torch.Tensor,
- max_heads_per_kernel: int,
- sm_scale: float | None = None,
Run the vendored sparse attention forward kernel over head chunks.
- static backward(
- ctx,
- grad_output: torch.Tensor,
Run chunked backward and accumulate shared KV gradients in fp32.
- nemo_automodel.components.models.deepseek_v4.kernels.sparse_attention.sparse_attn_tilelang_head_chunked(
- q: torch.Tensor,
- kv: torch.Tensor,
- attn_sink: torch.Tensor,
- topk_idxs: torch.Tensor,
- max_heads_per_kernel: int,
- sm_scale: float | None = None,
Run vendored Miles sparse attention in TileLang head chunks.