> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.speculative.dflash.draft_qwen3

DFlash draft model (Qwen3-style).

Ported from SpecForge's `specforge/modeling/draft/dflash.py`. DFlash drafts a
whole block of `block_size` tokens in parallel: the block's first position
holds the real anchor token and the rest are `MASK` tokens, and the draft
predicts the whole block in a single non-causal forward conditioned on the
target model's context hidden states.

The draft attention is therefore **not causal** -- a draft block's queries
attend to (a) the projected target-hidden context strictly before its anchor and
(b) bidirectionally to the other (noise) tokens of the same block. The attention
mask that enforces this is built by the trainer wrapper in
`nemo_automodel.components.speculative.dflash.core`.

## Module Contents

### Classes

| Name                                                                                                           | Description                                                                   |
| -------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| [`Qwen3DFlashAttention`](#nemo_automodel-components-speculative-dflash-draft_qwen3-Qwen3DFlashAttention)       | Non-causal attention whose keys/values are `[context \| noise-block]`.        |
| [`Qwen3DFlashDecoderLayer`](#nemo_automodel-components-speculative-dflash-draft_qwen3-Qwen3DFlashDecoderLayer) | A DFlash decoder block: non-causal attention over `[context \| noise]` + MLP. |
| [`Qwen3DFlashDraftModel`](#nemo_automodel-components-speculative-dflash-draft_qwen3-Qwen3DFlashDraftModel)     | DFlash draft model: a small non-causal Qwen3 stack over `[context \| noise]`. |

### Functions

| Name                                                                                                           | Description                                                                  |
| -------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| [`apply_rotary_pos_emb`](#nemo_automodel-components-speculative-dflash-draft_qwen3-apply_rotary_pos_emb)       | Apply RoPE where queries (draft block) are a suffix of the key positions.    |
| [`build_target_layer_ids`](#nemo_automodel-components-speculative-dflash-draft_qwen3-build_target_layer_ids)   | Pick `num_draft_layers` target layers spread across the target's depth.      |
| [`extract_context_feature`](#nemo_automodel-components-speculative-dflash-draft_qwen3-extract_context_feature) | Concatenate the selected target layers' hidden states along the feature dim. |
| [`sample`](#nemo_automodel-components-speculative-dflash-draft_qwen3-sample)                                   | Greedy (temperature \~ 0) or temperature sampling over the last dim.         |

### API

```python
class nemo_automodel.components.speculative.dflash.draft_qwen3.Qwen3DFlashAttention(
    config: transformers.models.qwen3.configuration_qwen3.Qwen3Config,
    layer_idx: int
)
```

**Bases:** `Module`

Non-causal attention whose keys/values are `[context | noise-block]`.

Queries come from the draft (noise) tokens only; keys and values are the
concatenation of the projected target-hidden context and the noise tokens.
The bidirectional/block structure is supplied entirely by `attention_mask`.

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.Qwen3DFlashAttention.forward(
    hidden_states: torch.Tensor,
    target_hidden: torch.Tensor,
    position_embeddings: typing.Tuple[torch.Tensor, torch.Tensor],
    attention_mask: typing.Optional[torch.Tensor],
    past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
    cache_position: typing.Optional[torch.LongTensor] = None,
    kwargs = {}
) -> typing.Tuple[torch.Tensor, typing.Optional[torch.Tensor]]
```

```python
class nemo_automodel.components.speculative.dflash.draft_qwen3.Qwen3DFlashDecoderLayer(
    config: transformers.models.qwen3.configuration_qwen3.Qwen3Config,
    layer_idx: int
)
```

**Bases:** `GradientCheckpointingLayer`

A DFlash decoder block: non-causal attention over `[context | noise]` + MLP.

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.Qwen3DFlashDecoderLayer.forward(
    target_hidden: typing.Optional[torch.Tensor] = None,
    hidden_states: typing.Optional[torch.Tensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    past_key_value: typing.Optional[transformers.cache_utils.Cache] = None,
    use_cache: typing.Optional[bool] = False,
    cache_position: typing.Optional[torch.LongTensor] = None,
    position_embeddings: typing.Optional[typing.Tuple[torch.Tensor, torch.Tensor]] = None,
    kwargs = {}
) -> torch.Tensor
```

```python
class nemo_automodel.components.speculative.dflash.draft_qwen3.Qwen3DFlashDraftModel(
    config
)
```

**Bases:** `Qwen3PreTrainedModel`

DFlash draft model: a small non-causal Qwen3 stack over `[context | noise]`.

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.Qwen3DFlashDraftModel.forward(
    position_ids: torch.LongTensor,
    attention_mask: typing.Optional[torch.Tensor] = None,
    noise_embedding: typing.Optional[torch.Tensor] = None,
    target_hidden: typing.Optional[torch.Tensor] = None,
    past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
    use_cache: bool = False,
    kwargs = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.Qwen3DFlashDraftModel.spec_generate(
    target: torch.nn.Module,
    input_ids: torch.LongTensor,
    max_new_tokens: int,
    stop_token_ids: typing.Optional[list[int]],
    temperature: float
) -> torch.LongTensor
```

Block-parallel speculative decoding: draft a block, verify with the target, accept the matching prefix.

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.apply_rotary_pos_emb(
    q,
    k,
    cos,
    sin,
    unsqueeze_dim = 1
)
```

Apply RoPE where queries (draft block) are a suffix of the key positions.

The keys span `[context | noise-block]` while the queries are only the
noise block, so `q` is rotated with the trailing `q_len` slice of the
rotary tables and `k` with the full table.

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.build_target_layer_ids(
    num_target_layers: int,
    num_draft_layers: int
) -> list[int]
```

Pick `num_draft_layers` target layers spread across the target's depth.

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.extract_context_feature(
    hidden_states: list[torch.Tensor],
    layer_ids: list[int]
) -> torch.Tensor
```

Concatenate the selected target layers' hidden states along the feature dim.

`hidden_states` follows HF's `output_hidden_states` convention where
index 0 is the embedding output, so layer `i`'s output is at index
`i + 1`.

```python
nemo_automodel.components.speculative.dflash.draft_qwen3.sample(
    logits: torch.Tensor,
    temperature: float = 0.0
) -> torch.Tensor
```

Greedy (temperature \~ 0) or temperature sampling over the last dim.