nemo_automodel.components.models.qwen3_5.model#
Qwen3.5 dense causal LM with Megatron-style MTP support.
Module Contents#
Classes#
Qwen3.5 causal-LM output extended with MTP auxiliary hidden states. |
|
One full-attention Qwen3.5 dense MTP sublayer. |
|
Qwen3.5 dense causal LM with optional Megatron-style MTP head. |
|
Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head. |
Functions#
Build Qwen3.5 MTP runtime config from HF-style config fields. |
|
Construct dense Qwen3.5 MTP blocks. |
Data#
API#
- class nemo_automodel.components.models.qwen3_5.model.Qwen3_5CausalLMOutputWithPast#
Bases:
transformers.modeling_outputs.CausalLMOutputWithPastQwen3.5 causal-LM output extended with MTP auxiliary hidden states.
- rope_deltas: torch.Tensor | None#
None
- mtp_per_depth_h: list[torch.Tensor] | None#
None
- mtp_loss_scaling_factor: float | None#
None
- nemo_automodel.components.models.qwen3_5.model._resolve_mtp_num_layers(
- config: Any,
- override: int | None = None,
- nemo_automodel.components.models.qwen3_5.model._default_init_device() torch.device#
- nemo_automodel.components.models.qwen3_5.model.build_mtp_config_from_hf(
- config: Any,
- *,
- loss_scaling_factor: float = 0.1,
- num_nextn_predict_layers: int | None = None,
Build Qwen3.5 MTP runtime config from HF-style config fields.
- nemo_automodel.components.models.qwen3_5.model._make_full_attention_config(
- config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
- layer_idx: int,
- nemo_automodel.components.models.qwen3_5.model._split_qwen3_5_position_ids(
- position_ids: torch.Tensor | None,
- *,
- batch_size: int,
- seq_len: int,
- device: torch.device,
- past_key_values: Any | None = None,
- nemo_automodel.components.models.qwen3_5.model._rolled_embed_inputs(
- inputs_embeds: torch.Tensor,
- num_depths: int,
- class nemo_automodel.components.models.qwen3_5.model.Qwen3_5DenseMTPSublayer(
- config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
- layer_idx: int,
- *,
- has_fusion: bool = False,
- has_final_norm: bool = False,
- dtype: torch.dtype = torch.bfloat16,
Bases:
transformers.models.qwen3_5.modeling_qwen3_5.Qwen3_5DecoderLayerOne full-attention Qwen3.5 dense MTP sublayer.
Initialization
- forward(
- hidden_states: torch.Tensor,
- *,
- embed_input: torch.Tensor | None = None,
- rotary_emb: torch.nn.Module,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- past_key_values: Any | None = None,
- **kwargs: Any,
- init_weights(buffer_device: torch.device | None = None) None#
- nemo_automodel.components.models.qwen3_5.model.build_qwen3_5_dense_mtp(
- config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
- mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
- dtype: torch.dtype,
Construct dense Qwen3.5 MTP blocks.
Qwen3.5 MTP follows Megatron Bridge: each depth is one full-attention Qwen3.5 decoder block, regardless of the backbone’s GatedDeltaNet layers.
- class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForCausalLM(
- config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- *,
- mtp_loss_scaling_factor: float = 0.1,
- num_nextn_predict_layers: int | None = None,
- **kwargs: Any,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,torch.nn.ModuleQwen3.5 dense causal LM with optional Megatron-style MTP head.
Initialization
- class ModelCapabilities#
Declared parallelism capabilities for this model class.
- supports_tp: bool#
True
- supports_cp: bool#
False
- supports_pp: bool#
True
- supports_ep: bool#
False
- classmethod from_config(
- config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5TextConfig,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs: Any,
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args: Any,
- **kwargs: Any,
- get_input_embeddings() torch.nn.Module#
- set_input_embeddings(value: torch.nn.Module) None#
- get_output_embeddings() torch.nn.Module#
- set_output_embeddings(new_embeddings: torch.nn.Module) None#
- tie_weights() None#
- forward(
- input_ids: torch.LongTensor | None = None,
- attention_mask: torch.Tensor | None = None,
- position_ids: torch.LongTensor | None = None,
- past_key_values: Any | None = None,
- inputs_embeds: torch.FloatTensor | None = None,
- labels: torch.LongTensor | None = None,
- use_cache: bool | None = None,
- logits_to_keep: int | torch.Tensor = 0,
- **kwargs: Any,
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
- class nemo_automodel.components.models.qwen3_5.model.Qwen3_5ForConditionalGeneration(
- config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- *,
- mtp_loss_scaling_factor: float = 0.1,
- num_nextn_predict_layers: int | None = None,
- **kwargs: Any,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,transformers.models.qwen3_5.modeling_qwen3_5.Qwen3_5ForConditionalGenerationQwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.
The base VLM stays on the upstream HF implementation so image/video feature insertion, M-RoPE position handling, and generation helpers remain intact. MTP is added as an auxiliary train-time module over the final language hidden states, matching the dense text-only MTP architecture.
Initialization
- _pp_keep_self_forward: bool#
True
- class ModelCapabilities#
Declared parallelism capabilities for this model class.
- supports_tp: bool#
True
- supports_cp: bool#
False
- supports_pp: bool#
True
- supports_ep: bool#
False
- classmethod from_config(
- config: transformers.models.qwen3_5.configuration_qwen3_5.Qwen3_5Config,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs: Any,
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args: Any,
- **kwargs: Any,
- _pop_staged_vlm_media(
- input_ids: torch.Tensor | None,
- kwargs: dict[str, Any],
- forward(
- input_ids: torch.LongTensor | None = None,
- attention_mask: torch.Tensor | None = None,
- position_ids: torch.LongTensor | None = None,
- past_key_values: Any | None = None,
- inputs_embeds: torch.FloatTensor | None = None,
- labels: torch.LongTensor | None = None,
- pixel_values: torch.Tensor | None = None,
- pixel_values_videos: torch.FloatTensor | None = None,
- image_grid_thw: torch.LongTensor | None = None,
- video_grid_thw: torch.LongTensor | None = None,
- mm_token_type_ids: torch.IntTensor | None = None,
- use_cache: bool | None = None,
- logits_to_keep: int | torch.Tensor = 0,
- padding_mask: torch.Tensor | None = None,
- **kwargs: Any,
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
- nemo_automodel.components.models.qwen3_5.model.ModelClass#
None