> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.qwen3_5.model

Qwen3.5 dense causal LM with Megatron-style MTP support.

## Module Contents

### Classes

| Name                                                                                                                       | Description                                                            |
| -------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| [`Fp32SafeQwen3_5TextRotaryEmbedding`](#nemo_automodel-components-models-qwen3_5-model-Fp32SafeQwen3_5TextRotaryEmbedding) | Ensure inv\_freq stays in float32 across `.to(dtype)` calls.           |
| [`Qwen3_5CausalLMOutputWithPast`](#nemo_automodel-components-models-qwen3_5-model-Qwen3_5CausalLMOutputWithPast)           | Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.    |
| [`Qwen3_5DenseBlock`](#nemo_automodel-components-models-qwen3_5-model-Qwen3_5DenseBlock)                                   | Qwen3.5 dense decoder block on top of the Qwen3-Next `Block`.          |
| [`Qwen3_5DenseMTPSublayer`](#nemo_automodel-components-models-qwen3_5-model-Qwen3_5DenseMTPSublayer)                       | One full-attention Qwen3.5 dense MTP sublayer.                         |
| [`Qwen3_5DenseTextBackbone`](#nemo_automodel-components-models-qwen3_5-model-Qwen3_5DenseTextBackbone)                     | Qwen3.5 dense text decoder rebuilt on the Qwen3-Next `Block`.          |
| [`Qwen3_5ForCausalLM`](#nemo_automodel-components-models-qwen3_5-model-Qwen3_5ForCausalLM)                                 | Qwen3.5 dense causal LM with optional Megatron-style MTP head.         |
| [`Qwen3_5ForConditionalGeneration`](#nemo_automodel-components-models-qwen3_5-model-Qwen3_5ForConditionalGeneration)       | Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.       |
| [`Qwen3_5Model`](#nemo_automodel-components-models-qwen3_5-model-Qwen3_5Model)                                             | Thin VLM wrapper exposing `language_model` internals as properties and |

### Functions

| Name                                                                                                         | Description                                                          |
| ------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------- |
| [`_default_init_device`](#nemo_automodel-components-models-qwen3_5-model-_default_init_device)               | -                                                                    |
| [`_dense_moe_config`](#nemo_automodel-components-models-qwen3_5-model-_dense_moe_config)                     | Trivial MoEConfig for the dense Qwen3.5 backbone.                    |
| [`_make_full_attention_config`](#nemo_automodel-components-models-qwen3_5-model-_make_full_attention_config) | -                                                                    |
| [`_mtp_block_causal_mask`](#nemo_automodel-components-models-qwen3_5-model-_mtp_block_causal_mask)           | Build a 4D block-causal attention mask from an indexed packing mask. |
| [`_qwen3_5_backend`](#nemo_automodel-components-models-qwen3_5-model-_qwen3_5_backend)                       | Return a Qwen3.5 backend with TE fused RoPE disabled.                |
| [`_resolve_mtp_num_layers`](#nemo_automodel-components-models-qwen3_5-model-_resolve_mtp_num_layers)         | -                                                                    |
| [`_rolled_embed_inputs`](#nemo_automodel-components-models-qwen3_5-model-_rolled_embed_inputs)               | -                                                                    |
| [`_split_qwen3_5_position_ids`](#nemo_automodel-components-models-qwen3_5-model-_split_qwen3_5_position_ids) | -                                                                    |
| [`build_mtp_config_from_hf`](#nemo_automodel-components-models-qwen3_5-model-build_mtp_config_from_hf)       | Build Qwen3.5 MTP runtime config from HF-style config fields.        |
| [`build_qwen3_5_dense_mtp`](#nemo_automodel-components-models-qwen3_5-model-build_qwen3_5_dense_mtp)         | Construct dense Qwen3.5 MTP blocks.                                  |

### Data

[`ModelClass`](#nemo_automodel-components-models-qwen3_5-model-ModelClass)

### API

```python
class nemo_automodel.components.models.qwen3_5.model.Fp32SafeQwen3_5TextRotaryEmbedding()
```

**Bases:** `Qwen3_5TextRotaryEmbedding`

Ensure inv\_freq stays in float32 across `.to(dtype)` calls.

```python
nemo_automodel.components.models.qwen3_5.model.Fp32SafeQwen3_5TextRotaryEmbedding._apply(
    fn: typing.Any,
    recurse: bool = True
)
```

```python
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast(
    rope_deltas: torch.Tensor | None = None,
    mtp_per_depth_h: list[torch.Tensor] | None = None,
    mtp_loss_scaling_factor: float | None = None
)
```

Dataclass

**Bases:** `CausalLMOutputWithPast`

Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.

```python
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock(
    layer_idx,
    config,
    moe_config,
    backend
)
```

**Bases:** [Block](/nemo-automodel/nemo_automodel/components/models/qwen3_next/model#nemo_automodel-components-models-qwen3_next-model-Block)

Qwen3.5 dense decoder block on top of the Qwen3-Next `Block`.

Identical to `Qwen3_5MoeBlock` except the MLP degrades to a dense `MLP`
(no experts). The CP-aware GatedDeltaNet is built natively for
linear-attention layers, and the forward threads NEAT-packing kwargs.

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock.init_weights(
    buffer_device: torch.device
)
```

```python
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    layer_idx: int,
    has_fusion: bool = False,
    has_final_norm: bool = False,
    dtype: torch.dtype = torch.bfloat16
)
```

**Bases:** `Qwen3_5DecoderLayer`

One full-attention Qwen3.5 dense MTP sublayer.

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer.forward(
    hidden_states: torch.Tensor,
    embed_input: torch.Tensor | None = None,
    rotary_emb: torch.nn.Module,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    past_key_values: typing.Any | None = None,
    kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Qwen3.5 dense text decoder rebuilt on the Qwen3-Next `Block`.

Native counterpart of `Qwen3_5MoeTextModelBackend` for the dense model:
reuses the same blocks/GatedDeltaNet/norm/rotary so dense and MoE share one
code path, with the fp32 `SSMGate` built at construction (no runtime patch).

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    cache_position: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    past_key_values: typing.Any | None = None,
    use_cache: bool | None = None,
    output_hidden_states: bool | None = None,
    attn_kwargs: typing.Any = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.get_input_embeddings() -> torch.nn.Module
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.set_input_embeddings(
    value: torch.nn.Module
) -> None
```

```python
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    mtp_loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None,
    kwargs: typing.Any = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`

Qwen3.5 dense causal LM with optional Megatron-style MTP head.

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.forward(
    input_ids: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.LongTensor | None = None,
    past_key_values: typing.Any | None = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    use_cache: bool | None = None,
    logits_to_keep: int | torch.Tensor = 0,
    kwargs: typing.Any = {}
) -> nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.from_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs: typing.Any = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args: typing.Any = (),
    kwargs: typing.Any = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.get_input_embeddings() -> torch.nn.Module
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.get_output_embeddings() -> torch.nn.Module
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.set_input_embeddings(
    value: torch.nn.Module
) -> None
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.set_output_embeddings(
    new_embeddings: torch.nn.Module
) -> None
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.tie_weights() -> None
```

```python
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    mtp_loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None,
    kwargs: typing.Any = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `HFQwen3_5ForConditionalGeneration`

Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.

The base VLM stays on the upstream HF implementation so image/video feature
insertion, M-RoPE position handling, and generation helpers remain intact.
MTP is added as an auxiliary train-time module over the final language
hidden states, matching the dense text-only MTP architecture.

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration._pop_staged_vlm_media(
    input_ids: torch.Tensor | None,
    kwargs: dict[str, typing.Any]
) -> tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.forward(
    input_ids: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.LongTensor | None = None,
    past_key_values: typing.Any | None = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    pixel_values: torch.Tensor | None = None,
    pixel_values_videos: torch.FloatTensor | None = None,
    image_grid_thw: torch.LongTensor | None = None,
    video_grid_thw: torch.LongTensor | None = None,
    mm_token_type_ids: torch.IntTensor | None = None,
    use_cache: bool | None = None,
    logits_to_keep: int | torch.Tensor = 0,
    padding_mask: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.from_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs: typing.Any = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args: typing.Any = (),
    kwargs: typing.Any = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.prepare_model_inputs_for_cp(
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    pixel_values: torch.Tensor | None = None,
    pixel_values_videos: torch.Tensor | None = None,
    image_grid_thw: torch.Tensor | None = None,
    image_grid_hws: torch.Tensor | None = None,
    video_grid_thw: torch.Tensor | None = None,
    mm_token_type_ids: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> dict[str, torch.Tensor]
```

Build full-sequence multimodal embeddings and mRoPE positions before CP sharding.

The VLM->LM multimodal scatter and mRoPE `get_rope_index` must run on the
*full* (unsharded) sequence; context-parallel sharding then happens on the
returned `inputs_embeds` / `position_ids` via `make_cp_batch_and_ctx`.

```python
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5Model()
```

**Bases:** `HFQwen3_5Model`

Thin VLM wrapper exposing `language_model` internals as properties and
routing the forward: HF vision+scatter path when media is present, else the
NeMo dense backbone directly. Mirrors `Qwen3_5MoeModel`.

```python
nemo_automodel.components.models.qwen3_5.model.Qwen3_5Model.forward(
    input_ids = None,
    attention_mask = None,
    position_ids = None,
    past_key_values = None,
    inputs_embeds = None,
    pixel_values = None,
    pixel_values_videos = None,
    image_grid_thw = None,
    video_grid_thw = None,
    cache_position = None,
    kwargs = {}
)
```

```python
nemo_automodel.components.models.qwen3_5.model._default_init_device() -> torch.device
```

```python
nemo_automodel.components.models.qwen3_5.model._dense_moe_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    dtype: torch.dtype
) -> nemo_automodel.components.moe.layers.MoEConfig
```

Trivial MoEConfig for the dense Qwen3.5 backbone.

The dense model has no experts (`num_experts` is 0/absent), so `Block`
builds a dense `MLP` and never consults this config; it is only required to
satisfy `Block.__init__`'s signature.

```python
nemo_automodel.components.models.qwen3_5.model._make_full_attention_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    layer_idx: int
) -> transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig
```

```python
nemo_automodel.components.models.qwen3_5.model._mtp_block_causal_mask(
    packing_mask: torch.Tensor,
    inputs_embeds: torch.Tensor
) -> torch.Tensor
```

Build a 4D block-causal attention mask from an indexed packing mask.

`packing_mask` is `[B, S]` with the 1-based document index per token
(0 = padding). The returned bool mask `[B, 1, S, S]` (`True` = attend)
keeps attention causal *and* within each packed document, matching the
backbone's packed-sequence semantics. Used for the MTP sublayers, which run
SDPA self-attention over the same packed batch (NVBugs 6330129).

```python
nemo_automodel.components.models.qwen3_5.model._qwen3_5_backend(
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
) -> nemo_automodel.components.models.common.BackendConfig
```

Return a Qwen3.5 backend with TE fused RoPE disabled.

Qwen3.5 VLM training can feed full-attention layers in packed/THD shape via
the shared Qwen3-Next attention block. TE fused RoPE expects 4D inputs there,
so keep the non-fused RoPE path while preserving the rest of the backend
selection (TE Linear, attention backend, etc.).

```python
nemo_automodel.components.models.qwen3_5.model._resolve_mtp_num_layers(
    config: typing.Any,
    override: int | None = None
) -> int
```

```python
nemo_automodel.components.models.qwen3_5.model._rolled_embed_inputs(
    inputs_embeds: torch.Tensor,
    num_depths: int
) -> tuple[torch.Tensor, ...]
```

```python
nemo_automodel.components.models.qwen3_5.model._split_qwen3_5_position_ids(
    position_ids: torch.Tensor | None,
    batch_size: int,
    seq_len: int,
    device: torch.device,
    past_key_values: typing.Any | None = None
) -> tuple[torch.Tensor, torch.Tensor | None]
```

```python
nemo_automodel.components.models.qwen3_5.model.build_mtp_config_from_hf(
    config: typing.Any,
    loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None
) -> nemo_automodel.components.models.common.mtp.MTPConfig
```

Build Qwen3.5 MTP runtime config from HF-style config fields.

```python
nemo_automodel.components.models.qwen3_5.model.build_qwen3_5_dense_mtp(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
    dtype: torch.dtype
) -> nemo_automodel.components.models.common.mtp.MTPModule
```

Construct dense Qwen3.5 MTP blocks.

Qwen3.5 MTP follows Megatron Bridge: each depth is one full-attention
Qwen3.5 decoder block, regardless of the backbone's GatedDeltaNet layers.

```python
nemo_automodel.components.models.qwen3_5.model.ModelClass = Qwen3_5ForCausalLM
```