nemo_automodel.components.models.qwen3_5_moe.model

View as Markdown

Qwen3.5-MoE (VL) NeMo Automodel support.

Module Contents

Classes

NameDescription
Fp32SafeQwen3_5MoeTextRotaryEmbeddingEnsure inv_freq stays in float32 across .to(dtype) calls.
Fp32SafeQwen3_5MoeVisionRotaryEmbeddingEnsure the vision rotary inv_freq buffer remains float32.
Qwen3_5MoeBlockBlock that uses the Qwen3.5-MoE native GatedDeltaNet (separate in_proj_qkv,
Qwen3_5MoeCausalLMOutputWithPastQwen3.5-MoE output extended with MTP auxiliary hidden states.
Qwen3_5MoeForConditionalGenerationQwen3.5-MoE VL conditional generation model using NeMo backend components.
Qwen3_5MoeMTPSublayerOne full-attention Qwen3.5-MoE MTP sublayer.
Qwen3_5MoeModelThin wrapper that exposes language_model internals as properties
Qwen3_5MoeTextModelBackendQwen3.5-MoE text decoder rebuilt on top of the Qwen3-Next Block.

Functions

NameDescription
_default_init_device-
_freqs_cis_from_rotary-
_make_missing-
_make_mtp_block_config-
_qwen3_5_moe_backendReturn a Qwen3.5-MoE backend with TE fused RoPE disabled.
_resolve_mtp_num_layers-
_rolled_embed_inputs-
_split_qwen3_5_moe_position_ids-
build_mtp_config_from_hfBuild Qwen3.5-MoE MTP runtime config from HF-style config fields.
build_qwen3_5_moe_mtpConstruct Qwen3.5-MoE MTP blocks.

Data

ModelClass

_QWEN3_5_MOE_HF_AVAILABLE

API

class nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeTextRotaryEmbedding()

Bases: Qwen3_5MoeTextRotaryEmbedding

Ensure inv_freq stays in float32 across .to(dtype) calls.

nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeTextRotaryEmbedding._apply(
fn: typing.Any,
recurse: bool = True
)
class nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeVisionRotaryEmbedding()

Bases: Qwen3_5MoeVisionRotaryEmbedding

Ensure the vision rotary inv_freq buffer remains float32.

nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeVisionRotaryEmbedding._apply(
fn: typing.Any,
recurse: bool = True
)
class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeBlock(
layer_idx,
config,
moe_config,
backend
)

Bases: Block

Block that uses the Qwen3.5-MoE native GatedDeltaNet (separate in_proj_qkv, in_proj_z, in_proj_b, in_proj_a)

linear_attn
= CPAwareGatedDeltaNet(config, layer_idx)
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeBlock.forward(
x: torch.Tensor,
freqs_cis: torch.Tensor,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor

Mirror :meth:Block.forward but thread NEAT-packing kwargs into CPAwareGatedDeltaNet.

The parent Block.forward calls linear_attn with only hidden_states and attention_mask; for packed sequences the gated_delta_rule kernel additionally needs cu_seqlens / indices to reset state at document boundaries (issue #2131). Derived once per forward from the indexed attention mask.

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeBlock.init_weights(
buffer_device: torch.device
)
class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeCausalLMOutputWithPast(
mtp_per_depth_h: list[torch.Tensor] | None = None,
mtp_loss_scaling_factor: float | None = None
)
Dataclass

Bases: CausalLMOutputWithPast

Qwen3.5-MoE output extended with MTP auxiliary hidden states.

mtp_loss_scaling_factor
float | None = None
mtp_per_depth_h
list[Tensor] | None = None
class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration(
config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeConfig,
moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
mtp_loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
kwargs = {}
)

Bases: HFCheckpointingMixin, HFQwen3_5MoeForConditionalGeneration, MoEFSDPSyncMixin

Qwen3.5-MoE VL conditional generation model using NeMo backend components.

_pp_keep_self_forward
bool = True
lm_head
mtp
mtp_config
pad_token_id
= pad_token_id if pad_token_id is not None else -1
state_dict_adapter
vocab_size
= text_config.vocab_size
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.forward(
input_ids: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
inputs_embeds: torch.Tensor | None = None,
cache_position: torch.Tensor | None = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
output_hidden_states: typing.Optional[bool] = None,
kwargs: typing.Any = {}
)
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.from_config(
config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeConfig,
moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16
) -> None
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.prepare_model_inputs_for_cp(
input_ids: torch.Tensor,
attention_mask: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
pixel_values: torch.Tensor | None = None,
pixel_values_videos: torch.Tensor | None = None,
image_grid_thw: torch.Tensor | None = None,
image_grid_hws: torch.Tensor | None = None,
video_grid_thw: torch.Tensor | None = None,
mm_token_type_ids: torch.Tensor | None = None,
kwargs: typing.Any = {}
) -> dict[str, torch.Tensor]

Build full-sequence multimodal embeddings and mRoPE positions before CP sharding.

class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeMTPSublayer(
layer_idx: int,
config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
moe_config: nemo_automodel.components.moe.layers.MoEConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
has_fusion: bool = False,
has_final_norm: bool = False,
dtype: torch.dtype = torch.bfloat16
)

Bases: Qwen3_5MoeBlock

One full-attention Qwen3.5-MoE MTP sublayer.

eh_proj
enorm
final_layernorm
hnorm
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeMTPSublayer.forward(
hidden_states: torch.Tensor,
embed_input: torch.Tensor | None = None,
rotary_emb: torch.nn.Module,
position_ids: torch.Tensor,
attention_mask: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
attn_kwargs: typing.Any = {}
) -> torch.Tensor
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeMTPSublayer.init_weights(
buffer_device: torch.device
) -> None
class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeModel()

Bases: HFQwen3_5MoeModel

Thin wrapper that exposes language_model internals as properties expected by the NeMo training loop (e.g. model.layers).

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeModel.forward(
input_ids = None,
attention_mask = None,
position_ids = None,
past_key_values = None,
inputs_embeds = None,
pixel_values = None,
pixel_values_videos = None,
image_grid_thw = None,
video_grid_thw = None,
cache_position = None,
kwargs = {}
)
class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend(
config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
moe_overrides: dict | None = None
)

Bases: Module

Qwen3.5-MoE text decoder rebuilt on top of the Qwen3-Next Block.

embed_tokens
layers
moe_config
= moe_config or MoEConfig(**moe_defaults)
norm
padding_idx
= getattr(config, 'pad_token_id', None)
rotary_emb
vocab_size
= config.vocab_size
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.forward(
input_ids: torch.Tensor | None = None,
inputs_embeds: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.Tensor | None = None,
cache_position: torch.Tensor | None = None,
padding_mask: torch.Tensor | None = None,
past_key_values: typing.Any | None = None,
use_cache: bool | None = None,
attn_kwargs: typing.Any = {}
) -> transformers.models.qwen3_5_moe.modeling_qwen3_5_moe.Qwen3_5MoeModelOutputWithPast
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.get_input_embeddings() -> torch.nn.Module
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.init_weights(
buffer_device: torch.device | None = None
) -> None
nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.set_input_embeddings(
value: torch.nn.Module
) -> None
nemo_automodel.components.models.qwen3_5_moe.model._default_init_device() -> torch.device
nemo_automodel.components.models.qwen3_5_moe.model._freqs_cis_from_rotary(
rotary_emb: torch.nn.Module,
hidden_states: torch.Tensor,
position_ids: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.qwen3_5_moe.model._make_missing(
name: str
)
nemo_automodel.components.models.qwen3_5_moe.model._make_mtp_block_config(
config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
layer_idx: int
) -> transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig
nemo_automodel.components.models.qwen3_5_moe.model._qwen3_5_moe_backend(
backend: nemo_automodel.components.models.common.BackendConfig | None = None
) -> nemo_automodel.components.models.common.BackendConfig

Return a Qwen3.5-MoE backend with TE fused RoPE disabled.

The Qwen3.5 full-attention blocks reuse Qwen3-Next attention, and VLM/packed execution can present THD-shaped q/k tensors. TE fused RoPE expects 4D inputs in this path, so use non-fused RoPE while preserving the rest of the backend.

nemo_automodel.components.models.qwen3_5_moe.model._resolve_mtp_num_layers(
config: typing.Any,
override: int | None = None
) -> int
nemo_automodel.components.models.qwen3_5_moe.model._rolled_embed_inputs(
inputs_embeds: torch.Tensor,
num_depths: int
) -> tuple[torch.Tensor, ...]
nemo_automodel.components.models.qwen3_5_moe.model._split_qwen3_5_moe_position_ids(
position_ids: torch.Tensor | None,
batch_size: int,
seq_len: int,
device: torch.device,
cache_position: torch.Tensor | None = None
) -> torch.Tensor
nemo_automodel.components.models.qwen3_5_moe.model.build_mtp_config_from_hf(
config: typing.Any,
loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None
) -> nemo_automodel.components.models.common.mtp.MTPConfig

Build Qwen3.5-MoE MTP runtime config from HF-style config fields.

nemo_automodel.components.models.qwen3_5_moe.model.build_qwen3_5_moe_mtp(
config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
moe_config: nemo_automodel.components.moe.layers.MoEConfig,
dtype: torch.dtype
) -> nemo_automodel.components.models.common.mtp.MTPModule

Construct Qwen3.5-MoE MTP blocks.

nemo_automodel.components.models.qwen3_5_moe.model.ModelClass = Qwen3_5MoeForConditionalGeneration
nemo_automodel.components.models.qwen3_5_moe.model._QWEN3_5_MOE_HF_AVAILABLE = True