nemo_automodel.components.models.qwen3_5.model

View as Markdown

Qwen3.5 dense causal LM with Megatron-style MTP support.

Module Contents

Classes

NameDescription
Fp32SafeQwen3_5TextRotaryEmbeddingEnsure inv_freq stays in float32 across .to(dtype) calls.
Qwen3_5CausalLMOutputWithPastQwen3.5 causal-LM output extended with MTP auxiliary hidden states.
Qwen3_5DenseBlockQwen3.5 dense decoder block on top of the Qwen3-Next Block.
Qwen3_5DenseMTPSublayerOne full-attention Qwen3.5 dense MTP sublayer.
Qwen3_5DenseTextBackboneQwen3.5 dense text decoder rebuilt on the Qwen3-Next Block.
Qwen3_5ForCausalLMQwen3.5 dense causal LM with optional Megatron-style MTP head.
Qwen3_5ForConditionalGenerationQwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.
Qwen3_5ModelThin VLM wrapper exposing language_model internals as properties and

Functions

NameDescription
_default_init_device-
_dense_moe_configTrivial MoEConfig for the dense Qwen3.5 backbone.
_make_full_attention_config-
_mtp_block_causal_maskBuild a 4D block-causal attention mask from an indexed packing mask.
_qwen3_5_backendReturn a Qwen3.5 backend with TE fused RoPE disabled.
_resolve_mtp_num_layers-
_rolled_embed_inputs-
_split_qwen3_5_position_ids-
build_mtp_config_from_hfBuild Qwen3.5 MTP runtime config from HF-style config fields.
build_qwen3_5_dense_mtpConstruct dense Qwen3.5 MTP blocks.

Data

ModelClass

API

class nemo_automodel.components.models.qwen3_5.model.Fp32SafeQwen3_5TextRotaryEmbedding()

Bases: Qwen3_5TextRotaryEmbedding

Ensure inv_freq stays in float32 across .to(dtype) calls.

nemo_automodel.components.models.qwen3_5.model.Fp32SafeQwen3_5TextRotaryEmbedding._apply(
fn: typing.Any,
recurse: bool = True
)
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast(
rope_deltas: torch.Tensor | None = None,
mtp_per_depth_h: list[torch.Tensor] | None = None,
mtp_loss_scaling_factor: float | None = None
)
Dataclass

Bases: CausalLMOutputWithPast

Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.

mtp_loss_scaling_factor
float | None = None
mtp_per_depth_h
list[Tensor] | None = None
rope_deltas
Tensor | None = None
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock(
layer_idx,
config,
moe_config,
backend
)

Bases: Block

Qwen3.5 dense decoder block on top of the Qwen3-Next Block.

Identical to Qwen3_5MoeBlock except the MLP degrades to a dense MLP (no experts). The CP-aware GatedDeltaNet is built natively for linear-attention layers, and the forward threads NEAT-packing kwargs.

linear_attn
= CPAwareGatedDeltaNet(config, layer_idx)
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock.forward(
x: torch.Tensor,
freqs_cis: torch.Tensor,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseBlock.init_weights(
buffer_device: torch.device
)
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
layer_idx: int,
has_fusion: bool = False,
has_final_norm: bool = False,
dtype: torch.dtype = torch.bfloat16
)

Bases: Qwen3_5DecoderLayer

One full-attention Qwen3.5 dense MTP sublayer.

eh_proj
enorm
final_layernorm
hnorm
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer.forward(
hidden_states: torch.Tensor,
embed_input: torch.Tensor | None = None,
rotary_emb: torch.nn.Module,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
past_key_values: typing.Any | None = None,
kwargs: typing.Any = {}
) -> torch.Tensor
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer.init_weights(
buffer_device: torch.device | None = None
) -> None
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Module

Qwen3.5 dense text decoder rebuilt on the Qwen3-Next Block.

Native counterpart of Qwen3_5MoeTextModelBackend for the dense model: reuses the same blocks/GatedDeltaNet/norm/rotary so dense and MoE share one code path, with the fp32 SSMGate built at construction (no runtime patch).

embed_tokens
layers
norm
padding_idx
= getattr(config, 'pad_token_id', None)
rotary_emb
= Fp32SafeQwen3_5TextRotaryEmbedding(config=config)
vocab_size
= config.vocab_size
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.forward(
input_ids: torch.Tensor | None = None,
inputs_embeds: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
cache_position: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
past_key_values: typing.Any | None = None,
use_cache: bool | None = None,
output_hidden_states: bool | None = None,
attn_kwargs: typing.Any = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.get_input_embeddings() -> torch.nn.Module
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.init_weights(
buffer_device: torch.device | None = None
) -> None
nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseTextBackbone.set_input_embeddings(
value: torch.nn.Module
) -> None
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
mtp_loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
kwargs: typing.Any = {}
)

Bases: HFCheckpointingMixin, Module

Qwen3.5 dense causal LM with optional Megatron-style MTP head.

backend
= _qwen3_5_backend(backend)
lm_head
model
= Qwen3_5DenseTextBackbone(config, self.backend)
mtp
mtp_config
state_dict_adapter
vocab_size
= config.vocab_size
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.forward(
input_ids: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.LongTensor | None = None,
past_key_values: typing.Any | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
use_cache: bool | None = None,
logits_to_keep: int | torch.Tensor = 0,
kwargs: typing.Any = {}
) -> nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.from_config(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs: typing.Any = {}
)
classmethod
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.from_pretrained(
pretrained_model_name_or_path: str,
model_args: typing.Any = (),
kwargs: typing.Any = {}
)
classmethod
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.get_input_embeddings() -> torch.nn.Module
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.get_output_embeddings() -> torch.nn.Module
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16
) -> None
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.set_input_embeddings(
value: torch.nn.Module
) -> None
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.set_output_embeddings(
new_embeddings: torch.nn.Module
) -> None
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM.tie_weights() -> None
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
mtp_loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
kwargs: typing.Any = {}
)

Bases: HFCheckpointingMixin, HFQwen3_5ForConditionalGeneration

Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.

The base VLM stays on the upstream HF implementation so image/video feature insertion, M-RoPE position handling, and generation helpers remain intact. MTP is added as an auxiliary train-time module over the final language hidden states, matching the dense text-only MTP architecture.

_pp_keep_self_forward
bool = True
backend
= _qwen3_5_backend(backend)
lm_head
= self.lm_head.to(dtype)
mtp
mtp_config
state_dict_adapter
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration._pop_staged_vlm_media(
input_ids: torch.Tensor | None,
kwargs: dict[str, typing.Any]
) -> tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.forward(
input_ids: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.LongTensor | None = None,
past_key_values: typing.Any | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
pixel_values: torch.Tensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
mm_token_type_ids: torch.IntTensor | None = None,
use_cache: bool | None = None,
logits_to_keep: int | torch.Tensor = 0,
padding_mask: torch.Tensor | None = None,
kwargs: typing.Any = {}
) -> nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.from_config(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs: typing.Any = {}
)
classmethod
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args: typing.Any = (),
kwargs: typing.Any = {}
)
classmethod
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16
) -> None
nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration.prepare_model_inputs_for_cp(
input_ids: torch.Tensor,
attention_mask: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
pixel_values: torch.Tensor | None = None,
pixel_values_videos: torch.Tensor | None = None,
image_grid_thw: torch.Tensor | None = None,
image_grid_hws: torch.Tensor | None = None,
video_grid_thw: torch.Tensor | None = None,
mm_token_type_ids: torch.Tensor | None = None,
kwargs: typing.Any = {}
) -> dict[str, torch.Tensor]

Build full-sequence multimodal embeddings and mRoPE positions before CP sharding.

The VLM->LM multimodal scatter and mRoPE get_rope_index must run on the full (unsharded) sequence; context-parallel sharding then happens on the returned inputs_embeds / position_ids via make_cp_batch_and_ctx.

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5Model()

Bases: HFQwen3_5Model

Thin VLM wrapper exposing language_model internals as properties and routing the forward: HF vision+scatter path when media is present, else the NeMo dense backbone directly. Mirrors Qwen3_5MoeModel.

nemo_automodel.components.models.qwen3_5.model.Qwen3_5Model.forward(
input_ids = None,
attention_mask = None,
position_ids = None,
past_key_values = None,
inputs_embeds = None,
pixel_values = None,
pixel_values_videos = None,
image_grid_thw = None,
video_grid_thw = None,
cache_position = None,
kwargs = {}
)
nemo_automodel.components.models.qwen3_5.model._default_init_device() -> torch.device
nemo_automodel.components.models.qwen3_5.model._dense_moe_config(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
dtype: torch.dtype
) -> nemo_automodel.components.moe.layers.MoEConfig

Trivial MoEConfig for the dense Qwen3.5 backbone.

The dense model has no experts (num_experts is 0/absent), so Block builds a dense MLP and never consults this config; it is only required to satisfy Block.__init__’s signature.

nemo_automodel.components.models.qwen3_5.model._make_full_attention_config(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
layer_idx: int
) -> transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig
nemo_automodel.components.models.qwen3_5.model._mtp_block_causal_mask(
packing_mask: torch.Tensor,
inputs_embeds: torch.Tensor
) -> torch.Tensor

Build a 4D block-causal attention mask from an indexed packing mask.

packing_mask is [B, S] with the 1-based document index per token (0 = padding). The returned bool mask [B, 1, S, S] (True = attend) keeps attention causal and within each packed document, matching the backbone’s packed-sequence semantics. Used for the MTP sublayers, which run SDPA self-attention over the same packed batch (NVBugs 6330129).

nemo_automodel.components.models.qwen3_5.model._qwen3_5_backend(
backend: nemo_automodel.components.models.common.BackendConfig | None = None
) -> nemo_automodel.components.models.common.BackendConfig

Return a Qwen3.5 backend with TE fused RoPE disabled.

Qwen3.5 VLM training can feed full-attention layers in packed/THD shape via the shared Qwen3-Next attention block. TE fused RoPE expects 4D inputs there, so keep the non-fused RoPE path while preserving the rest of the backend selection (TE Linear, attention backend, etc.).

nemo_automodel.components.models.qwen3_5.model._resolve_mtp_num_layers(
config: typing.Any,
override: int | None = None
) -> int
nemo_automodel.components.models.qwen3_5.model._rolled_embed_inputs(
inputs_embeds: torch.Tensor,
num_depths: int
) -> tuple[torch.Tensor, ...]
nemo_automodel.components.models.qwen3_5.model._split_qwen3_5_position_ids(
position_ids: torch.Tensor | None,
batch_size: int,
seq_len: int,
device: torch.device,
past_key_values: typing.Any | None = None
) -> tuple[torch.Tensor, torch.Tensor | None]
nemo_automodel.components.models.qwen3_5.model.build_mtp_config_from_hf(
config: typing.Any,
loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None
) -> nemo_automodel.components.models.common.mtp.MTPConfig

Build Qwen3.5 MTP runtime config from HF-style config fields.

nemo_automodel.components.models.qwen3_5.model.build_qwen3_5_dense_mtp(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
dtype: torch.dtype
) -> nemo_automodel.components.models.common.mtp.MTPModule

Construct dense Qwen3.5 MTP blocks.

Qwen3.5 MTP follows Megatron Bridge: each depth is one full-attention Qwen3.5 decoder block, regardless of the backbone’s GatedDeltaNet layers.

nemo_automodel.components.models.qwen3_5.model.ModelClass = Qwen3_5ForCausalLM