nemo_automodel.components.models.qwen3_5.model#

Qwen3.5 dense causal LM with Megatron-style MTP support.

Module Contents#

Classes#

Qwen3_5CausalLMOutputWithPast

Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.

Qwen3_5DenseMTPSublayer

One full-attention Qwen3.5 dense MTP sublayer.

Qwen3_5ForCausalLM

Qwen3.5 dense causal LM with optional Megatron-style MTP head.

Qwen3_5ForConditionalGeneration

Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.

Functions#

Data#

API#

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast#

Bases: transformers.modeling_outputs.CausalLMOutputWithPast

Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.

rope_deltas: torch.Tensor | None#

None

mtp_per_depth_h: list[torch.Tensor] | None#

None

mtp_loss_scaling_factor: float | None#

None

nemo_automodel.components.models.qwen3_5.model._resolve_mtp_num_layers(
config: Any,
override: int | None = None,
) int#
nemo_automodel.components.models.qwen3_5.model._default_init_device() torch.device#
nemo_automodel.components.models.qwen3_5.model.build_mtp_config_from_hf(
config: Any,
*,
loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
) nemo_automodel.components.models.common.mtp.MTPConfig#

Build Qwen3.5 MTP runtime config from HF-style config fields.

nemo_automodel.components.models.qwen3_5.model._make_full_attention_config(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
layer_idx: int,
) transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig#
nemo_automodel.components.models.qwen3_5.model._split_qwen3_5_position_ids(
position_ids: torch.Tensor | None,
*,
batch_size: int,
seq_len: int,
device: torch.device,
past_key_values: Any | None = None,
) tuple[torch.Tensor, torch.Tensor | None]#
nemo_automodel.components.models.qwen3_5.model._rolled_embed_inputs(
inputs_embeds: torch.Tensor,
num_depths: int,
) tuple[torch.Tensor, ...]#
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
layer_idx: int,
*,
has_fusion: bool = False,
has_final_norm: bool = False,
dtype: torch.dtype = torch.bfloat16,
)#

Bases: transformers.models.qwen3_5.modeling_qwen3_5.Qwen3_5DecoderLayer

One full-attention Qwen3.5 dense MTP sublayer.

Initialization

forward(
hidden_states: torch.Tensor,
*,
embed_input: torch.Tensor | None = None,
rotary_emb: torch.nn.Module,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
past_key_values: Any | None = None,
**kwargs: Any,
) torch.Tensor#
init_weights(buffer_device: torch.device | None = None) None#
nemo_automodel.components.models.qwen3_5.model.build_qwen3_5_dense_mtp(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
dtype: torch.dtype,
) nemo_automodel.components.models.common.mtp.MTPModule#

Construct dense Qwen3.5 MTP blocks.

Qwen3.5 MTP follows Megatron Bridge: each depth is one full-attention Qwen3.5 decoder block, regardless of the backbone’s GatedDeltaNet layers.

class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
*,
mtp_loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
**kwargs: Any,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module

Qwen3.5 dense causal LM with optional Megatron-style MTP head.

Initialization

class ModelCapabilities#

Declared parallelism capabilities for this model class.

supports_tp: bool#

True

supports_cp: bool#

False

supports_pp: bool#

True

supports_ep: bool#

False

classmethod from_config(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs: Any,
)#
classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args: Any,
**kwargs: Any,
)#
get_input_embeddings() torch.nn.Module#
set_input_embeddings(value: torch.nn.Module) None#
get_output_embeddings() torch.nn.Module#
set_output_embeddings(new_embeddings: torch.nn.Module) None#
tie_weights() None#
forward(
input_ids: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.LongTensor | None = None,
past_key_values: Any | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
use_cache: bool | None = None,
logits_to_keep: int | torch.Tensor = 0,
**kwargs: Any,
) nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast#
initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16,
) None#
class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
*,
mtp_loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
**kwargs: Any,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, transformers.models.qwen3_5.modeling_qwen3_5.Qwen3_5ForConditionalGeneration

Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.

The base VLM stays on the upstream HF implementation so image/video feature insertion, M-RoPE position handling, and generation helpers remain intact. MTP is added as an auxiliary train-time module over the final language hidden states, matching the dense text-only MTP architecture.

Initialization

_pp_keep_self_forward: bool#

True

class ModelCapabilities#

Declared parallelism capabilities for this model class.

supports_tp: bool#

True

supports_cp: bool#

False

supports_pp: bool#

True

supports_ep: bool#

False

classmethod from_config(
config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs: Any,
)#
classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args: Any,
**kwargs: Any,
)#
_pop_staged_vlm_media(
input_ids: torch.Tensor | None,
kwargs: dict[str, Any],
) tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
forward(
input_ids: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.LongTensor | None = None,
past_key_values: Any | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
pixel_values: torch.Tensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
mm_token_type_ids: torch.IntTensor | None = None,
use_cache: bool | None = None,
logits_to_keep: int | torch.Tensor = 0,
padding_mask: torch.Tensor | None = None,
**kwargs: Any,
) nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast#
initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16,
) None#
nemo_automodel.components.models.qwen3_5.model.ModelClass#

None