nemo_automodel.components.models.qwen3_5.model

Qwen3.5 dense causal LM with Megatron-style MTP support.

Module Contents

Classes

Name	Description
`Fp32SafeQwen3_5TextRotaryEmbedding`	Ensure inv_freq stays in float32 across `.to(dtype)` calls.
`Qwen3_5CausalLMOutputWithPast`	Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.
`Qwen3_5DenseBlock`	Qwen3.5 dense decoder block on top of the Qwen3-Next `Block`.
`Qwen3_5DenseMTPSublayer`	One full-attention Qwen3.5 dense MTP sublayer.
`Qwen3_5DenseTextBackbone`	Qwen3.5 dense text decoder rebuilt on the Qwen3-Next `Block`.
`Qwen3_5ForCausalLM`	Qwen3.5 dense causal LM with optional Megatron-style MTP head.
`Qwen3_5ForConditionalGeneration`	Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.
`Qwen3_5Model`	Thin VLM wrapper exposing `language_model` internals as properties and

Functions

Name	Description
`_default_init_device`	-
`_dense_moe_config`	Trivial MoEConfig for the dense Qwen3.5 backbone.
`_make_full_attention_config`	-
`_mtp_block_causal_mask`	Build a 4D block-causal attention mask from an indexed packing mask.
`_qwen3_5_backend`	Return a Qwen3.5 backend with TE fused RoPE disabled.
`_resolve_mtp_num_layers`	-
`_rolled_embed_inputs`	-
`_split_qwen3_5_position_ids`	-
`build_mtp_config_from_hf`	Build Qwen3.5 MTP runtime config from HF-style config fields.
`build_qwen3_5_dense_mtp`	Construct dense Qwen3.5 MTP blocks.

Data

ModelClass

API

class nemo_automodel.components.models.qwen3_5.model.Fp32SafeQwen3_5TextRotaryEmbedding()

Bases: Qwen3_5TextRotaryEmbedding

Ensure inv_freq stays in float32 across .to(dtype) calls.

nemo_automodel.components.models.qwen3_5.model.Fp32SafeQwen3_5TextRotaryEmbedding._apply(
    fn: typing.Any,
    recurse: bool = True
)

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast(
    rope_deltas: torch.Tensor | None = None,
    mtp_per_depth_h: list[torch.Tensor] | None = None,
    mtp_loss_scaling_factor: float | None = None
)

Dataclass

Bases: CausalLMOutputWithPast

Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.

mtp_loss_scaling_factor

float | None = None

mtp_per_depth_h

list[Tensor] | None = None

rope_deltas

Tensor | None = None

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock(
    layer_idx,
    config,
    moe_config,
    backend
)

Bases: Block

Qwen3.5 dense decoder block on top of the Qwen3-Next Block.

Identical to Qwen3_5MoeBlock except the MLP degrades to a dense MLP (no experts). The CP-aware GatedDeltaNet is built natively for linear-attention layers, and the forward threads NEAT-packing kwargs.

linear_attn

= CPAwareGatedDeltaNet(config, layer_idx)

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock.init_weights(
    buffer_device: torch.device
)

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    layer_idx: int,
    has_fusion: bool = False,
    has_final_norm: bool = False,
    dtype: torch.dtype = torch.bfloat16
)

Bases: Qwen3_5DecoderLayer

One full-attention Qwen3.5 dense MTP sublayer.

eh_proj

enorm

final_layernorm

hnorm

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer.forward(
    hidden_states: torch.Tensor,
    embed_input: torch.Tensor | None = None,
    rotary_emb: torch.nn.Module,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    past_key_values: typing.Any | None = None,
    kwargs: typing.Any = {}
) -> torch.Tensor

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer.init_weights(
    buffer_device: torch.device | None = None
) -> None

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Qwen3.5 dense text decoder rebuilt on the Qwen3-Next Block.

Native counterpart of Qwen3_5MoeTextModelBackend for the dense model: reuses the same blocks/GatedDeltaNet/norm/rotary so dense and MoE share one code path, with the fp32 SSMGate built at construction (no runtime patch).

embed_tokens

layers

norm

padding_idx

= getattr(config, 'pad_token_id', None)

rotary_emb

= Fp32SafeQwen3_5TextRotaryEmbedding(config=config)

vocab_size

= config.vocab_size

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    cache_position: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    past_key_values: typing.Any | None = None,
    use_cache: bool | None = None,
    output_hidden_states: bool | None = None,
    attn_kwargs: typing.Any = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.get_input_embeddings() -> torch.nn.Module

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.init_weights(
    buffer_device: torch.device | None = None
) -> None

nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.set_input_embeddings(
    value: torch.nn.Module
) -> None

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    mtp_loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None,
    kwargs: typing.Any = {}
)

Bases: HFCheckpointingMixin, Module

Qwen3.5 dense causal LM with optional Megatron-style MTP head.

backend

= _qwen3_5_backend(backend)

lm_head

model

= Qwen3_5DenseTextBackbone(config, self.backend)

mtp

mtp_config

state_dict_adapter

vocab_size

= config.vocab_size

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.forward(
    input_ids: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.LongTensor | None = None,
    past_key_values: typing.Any | None = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    use_cache: bool | None = None,
    logits_to_keep: int | torch.Tensor = 0,
    kwargs: typing.Any = {}
) -> nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.from_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs: typing.Any = {}
)

classmethod

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args: typing.Any = (),
    kwargs: typing.Any = {}
)

classmethod

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.get_input_embeddings() -> torch.nn.Module

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.get_output_embeddings() -> torch.nn.Module

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.set_input_embeddings(
    value: torch.nn.Module
) -> None

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.set_output_embeddings(
    new_embeddings: torch.nn.Module
) -> None

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.tie_weights() -> None

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    mtp_loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None,
    kwargs: typing.Any = {}
)

Bases: HFCheckpointingMixin, HFQwen3_5ForConditionalGeneration

Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.

The base VLM stays on the upstream HF implementation so image/video feature insertion, M-RoPE position handling, and generation helpers remain intact. MTP is added as an auxiliary train-time module over the final language hidden states, matching the dense text-only MTP architecture.

_pp_keep_self_forward

bool = True

backend

= _qwen3_5_backend(backend)

lm_head

= self.lm_head.to(dtype)

mtp

mtp_config

state_dict_adapter

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration._pop_staged_vlm_media(
    input_ids: torch.Tensor | None,
    kwargs: dict[str, typing.Any]
) -> tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.forward(
    input_ids: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.LongTensor | None = None,
    past_key_values: typing.Any | None = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    pixel_values: torch.Tensor | None = None,
    pixel_values_videos: torch.FloatTensor | None = None,
    image_grid_thw: torch.LongTensor | None = None,
    video_grid_thw: torch.LongTensor | None = None,
    mm_token_type_ids: torch.IntTensor | None = None,
    use_cache: bool | None = None,
    logits_to_keep: int | torch.Tensor = 0,
    padding_mask: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.from_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs: typing.Any = {}
)

classmethod

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args: typing.Any = (),
    kwargs: typing.Any = {}
)

classmethod

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None

nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.prepare_model_inputs_for_cp(
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    pixel_values: torch.Tensor | None = None,
    pixel_values_videos: torch.Tensor | None = None,
    image_grid_thw: torch.Tensor | None = None,
    image_grid_hws: torch.Tensor | None = None,
    video_grid_thw: torch.Tensor | None = None,
    mm_token_type_ids: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> dict[str, torch.Tensor]

Build full-sequence multimodal embeddings and mRoPE positions before CP sharding.

The VLM->LM multimodal scatter and mRoPE get_rope_index must run on the full (unsharded) sequence; context-parallel sharding then happens on the returned inputs_embeds / position_ids via make_cp_batch_and_ctx.

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5Model()

Bases: HFQwen3_5Model

Thin VLM wrapper exposing language_model internals as properties and routing the forward: HF vision+scatter path when media is present, else the NeMo dense backbone directly. Mirrors Qwen3_5MoeModel.

nemo_automodel.components.models.qwen3_5.model.Qwen3_5Model.forward(
    input_ids = None,
    attention_mask = None,
    position_ids = None,
    past_key_values = None,
    inputs_embeds = None,
    pixel_values = None,
    pixel_values_videos = None,
    image_grid_thw = None,
    video_grid_thw = None,
    cache_position = None,
    kwargs = {}
)

nemo_automodel.components.models.qwen3_5.model._default_init_device() -> torch.device

nemo_automodel.components.models.qwen3_5.model._dense_moe_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    dtype: torch.dtype
) -> nemo_automodel.components.moe.layers.MoEConfig

Trivial MoEConfig for the dense Qwen3.5 backbone.

The dense model has no experts (num_experts is 0/absent), so Block builds a dense MLP and never consults this config; it is only required to satisfy Block.__init__’s signature.

nemo_automodel.components.models.qwen3_5.model._make_full_attention_config(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    layer_idx: int
) -> transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig

nemo_automodel.components.models.qwen3_5.model._mtp_block_causal_mask(
    packing_mask: torch.Tensor,
    inputs_embeds: torch.Tensor
) -> torch.Tensor

Build a 4D block-causal attention mask from an indexed packing mask.

packing_mask is [B, S] with the 1-based document index per token (0 = padding). The returned bool mask [B, 1, S, S] (True = attend) keeps attention causal and within each packed document, matching the backbone’s packed-sequence semantics. Used for the MTP sublayers, which run SDPA self-attention over the same packed batch (NVBugs 6330129).

nemo_automodel.components.models.qwen3_5.model._qwen3_5_backend(
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
) -> nemo_automodel.components.models.common.BackendConfig

Return a Qwen3.5 backend with TE fused RoPE disabled.

Qwen3.5 VLM training can feed full-attention layers in packed/THD shape via the shared Qwen3-Next attention block. TE fused RoPE expects 4D inputs there, so keep the non-fused RoPE path while preserving the rest of the backend selection (TE Linear, attention backend, etc.).

nemo_automodel.components.models.qwen3_5.model._resolve_mtp_num_layers(
    config: typing.Any,
    override: int | None = None
) -> int

nemo_automodel.components.models.qwen3_5.model._rolled_embed_inputs(
    inputs_embeds: torch.Tensor,
    num_depths: int
) -> tuple[torch.Tensor, ...]

nemo_automodel.components.models.qwen3_5.model._split_qwen3_5_position_ids(
    position_ids: torch.Tensor | None,
    batch_size: int,
    seq_len: int,
    device: torch.device,
    past_key_values: typing.Any | None = None
) -> tuple[torch.Tensor, torch.Tensor | None]

nemo_automodel.components.models.qwen3_5.model.build_mtp_config_from_hf(
    config: typing.Any,
    loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None
) -> nemo_automodel.components.models.common.mtp.MTPConfig

Build Qwen3.5 MTP runtime config from HF-style config fields.

nemo_automodel.components.models.qwen3_5.model.build_qwen3_5_dense_mtp(
    config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
    mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
    dtype: torch.dtype
) -> nemo_automodel.components.models.common.mtp.MTPModule

Construct dense Qwen3.5 MTP blocks.

Qwen3.5 MTP follows Megatron Bridge: each depth is one full-attention Qwen3.5 decoder block, regardless of the backbone’s GatedDeltaNet layers.

nemo_automodel.components.models.qwen3_5.model.ModelClass = Qwen3_5ForCausalLM