nemo_automodel.components.models.qwen3_5_moe.model

Qwen3.5-MoE (VL) NeMo Automodel support.

Module Contents

Classes

Name	Description
`Fp32SafeQwen3_5MoeTextRotaryEmbedding`	Ensure inv_freq stays in float32 across `.to(dtype)` calls.
`Fp32SafeQwen3_5MoeVisionRotaryEmbedding`	Ensure the vision rotary inv_freq buffer remains float32.
`Qwen3_5MoeBlock`	Block that uses the Qwen3.5-MoE native GatedDeltaNet (separate in_proj_qkv,
`Qwen3_5MoeCausalLMOutputWithPast`	Qwen3.5-MoE output extended with MTP auxiliary hidden states.
`Qwen3_5MoeForConditionalGeneration`	Qwen3.5-MoE VL conditional generation model using NeMo backend components.
`Qwen3_5MoeMTPSublayer`	One full-attention Qwen3.5-MoE MTP sublayer.
`Qwen3_5MoeModel`	Thin wrapper that exposes `language_model` internals as properties
`Qwen3_5MoeTextModelBackend`	Qwen3.5-MoE text decoder rebuilt on top of the Qwen3-Next Block.

Functions

Name	Description
`_default_init_device`	-
`_freqs_cis_from_rotary`	-
`_make_missing`	-
`_make_mtp_block_config`	-
`_qwen3_5_moe_backend`	Return a Qwen3.5-MoE backend with TE fused RoPE disabled.
`_resolve_mtp_num_layers`	-
`_rolled_embed_inputs`	-
`_split_qwen3_5_moe_position_ids`	-
`build_mtp_config_from_hf`	Build Qwen3.5-MoE MTP runtime config from HF-style config fields.
`build_qwen3_5_moe_mtp`	Construct Qwen3.5-MoE MTP blocks.

Data

ModelClass

_QWEN3_5_MOE_HF_AVAILABLE

API

class nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeTextRotaryEmbedding()

Bases: Qwen3_5MoeTextRotaryEmbedding

Ensure inv_freq stays in float32 across .to(dtype) calls.

nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeTextRotaryEmbedding._apply(
    fn: typing.Any,
    recurse: bool = True
)

class nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeVisionRotaryEmbedding()

Bases: Qwen3_5MoeVisionRotaryEmbedding

Ensure the vision rotary inv_freq buffer remains float32.

nemo_automodel.components.models.qwen3_5_moe.model.Fp32SafeQwen3_5MoeVisionRotaryEmbedding._apply(
    fn: typing.Any,
    recurse: bool = True
)

class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeBlock(
    layer_idx,
    config,
    moe_config,
    backend
)

Bases: Block

Block that uses the Qwen3.5-MoE native GatedDeltaNet (separate in_proj_qkv, in_proj_z, in_proj_b, in_proj_a)

linear_attn

= CPAwareGatedDeltaNet(config, layer_idx)

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeBlock.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

Mirror :meth:Block.forward but thread NEAT-packing kwargs into CPAwareGatedDeltaNet.

The parent Block.forward calls linear_attn with only hidden_states and attention_mask; for packed sequences the gated_delta_rule kernel additionally needs cu_seqlens / indices to reset state at document boundaries (issue #2131). Derived once per forward from the indexed attention mask.

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeBlock.init_weights(
    buffer_device: torch.device
)

class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeCausalLMOutputWithPast(
    mtp_per_depth_h: list[torch.Tensor] | None = None,
    mtp_loss_scaling_factor: float | None = None
)

Dataclass

Bases: CausalLMOutputWithPast

Qwen3.5-MoE output extended with MTP auxiliary hidden states.

mtp_loss_scaling_factor

float | None = None

mtp_per_depth_h

list[Tensor] | None = None

class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration(
    config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    mtp_loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None,
    kwargs = {}
)

Bases: HFCheckpointingMixin, HFQwen3_5MoeForConditionalGeneration, MoEFSDPSyncMixin

Qwen3.5-MoE VL conditional generation model using NeMo backend components.

_pp_keep_self_forward

bool = True

lm_head

mtp

mtp_config

pad_token_id

= pad_token_id if pad_token_id is not None else -1

state_dict_adapter

vocab_size

= text_config.vocab_size

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.forward(
    input_ids: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    cache_position: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    kwargs: typing.Any = {}
)

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.from_config(
    config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

classmethod

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)

classmethod

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeForConditionalGeneration.prepare_model_inputs_for_cp(
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    pixel_values: torch.Tensor | None = None,
    pixel_values_videos: torch.Tensor | None = None,
    image_grid_thw: torch.Tensor | None = None,
    image_grid_hws: torch.Tensor | None = None,
    video_grid_thw: torch.Tensor | None = None,
    mm_token_type_ids: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> dict[str, torch.Tensor]

Build full-sequence multimodal embeddings and mRoPE positions before CP sharding.

class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeMTPSublayer(
    layer_idx: int,
    config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    has_fusion: bool = False,
    has_final_norm: bool = False,
    dtype: torch.dtype = torch.bfloat16
)

Bases: Qwen3_5MoeBlock

One full-attention Qwen3.5-MoE MTP sublayer.

eh_proj

enorm

final_layernorm

hnorm

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeMTPSublayer.forward(
    hidden_states: torch.Tensor,
    embed_input: torch.Tensor | None = None,
    rotary_emb: torch.nn.Module,
    position_ids: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeMTPSublayer.init_weights(
    buffer_device: torch.device
) -> None

class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeModel()

Bases: HFQwen3_5MoeModel

Thin wrapper that exposes language_model internals as properties expected by the NeMo training loop (e.g. model.layers).

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeModel.forward(
    input_ids = None,
    attention_mask = None,
    position_ids = None,
    past_key_values = None,
    inputs_embeds = None,
    pixel_values = None,
    pixel_values_videos = None,
    image_grid_thw = None,
    video_grid_thw = None,
    cache_position = None,
    kwargs = {}
)

class nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend(
    config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
    moe_overrides: dict | None = None
)

Bases: Module

Qwen3.5-MoE text decoder rebuilt on top of the Qwen3-Next Block.

embed_tokens

layers

moe_config

= moe_config or MoEConfig(**moe_defaults)

norm

padding_idx

= getattr(config, 'pad_token_id', None)

rotary_emb

vocab_size

= config.vocab_size

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    cache_position: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    past_key_values: typing.Any | None = None,
    use_cache: bool | None = None,
    attn_kwargs: typing.Any = {}
) -> transformers.models.qwen3_5_moe.modeling_qwen3_5_moe.Qwen3_5MoeModelOutputWithPast

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.get_input_embeddings() -> torch.nn.Module

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.init_weights(
    buffer_device: torch.device | None = None
) -> None

nemo_automodel.components.models.qwen3_5_moe.model.Qwen3_5MoeTextModelBackend.set_input_embeddings(
    value: torch.nn.Module
) -> None

nemo_automodel.components.models.qwen3_5_moe.model._default_init_device() -> torch.device

nemo_automodel.components.models.qwen3_5_moe.model._freqs_cis_from_rotary(
    rotary_emb: torch.nn.Module,
    hidden_states: torch.Tensor,
    position_ids: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.qwen3_5_moe.model._make_missing(
    name: str
)

nemo_automodel.components.models.qwen3_5_moe.model._make_mtp_block_config(
    config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
    layer_idx: int
) -> transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig

nemo_automodel.components.models.qwen3_5_moe.model._qwen3_5_moe_backend(
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
) -> nemo_automodel.components.models.common.BackendConfig

Return a Qwen3.5-MoE backend with TE fused RoPE disabled.

The Qwen3.5 full-attention blocks reuse Qwen3-Next attention, and VLM/packed execution can present THD-shaped q/k tensors. TE fused RoPE expects 4D inputs in this path, so use non-fused RoPE while preserving the rest of the backend.

nemo_automodel.components.models.qwen3_5_moe.model._resolve_mtp_num_layers(
    config: typing.Any,
    override: int | None = None
) -> int

nemo_automodel.components.models.qwen3_5_moe.model._rolled_embed_inputs(
    inputs_embeds: torch.Tensor,
    num_depths: int
) -> tuple[torch.Tensor, ...]

nemo_automodel.components.models.qwen3_5_moe.model._split_qwen3_5_moe_position_ids(
    position_ids: torch.Tensor | None,
    batch_size: int,
    seq_len: int,
    device: torch.device,
    cache_position: torch.Tensor | None = None
) -> torch.Tensor

nemo_automodel.components.models.qwen3_5_moe.model.build_mtp_config_from_hf(
    config: typing.Any,
    loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None
) -> nemo_automodel.components.models.common.mtp.MTPConfig

Build Qwen3.5-MoE MTP runtime config from HF-style config fields.

nemo_automodel.components.models.qwen3_5_moe.model.build_qwen3_5_moe_mtp(
    config: transformers.models.qwen3_5_moe.configuration_qwen3_5_moe.Qwen3_5MoeTextConfig,
    mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    dtype: torch.dtype
) -> nemo_automodel.components.models.common.mtp.MTPModule

Construct Qwen3.5-MoE MTP blocks.

nemo_automodel.components.models.qwen3_5_moe.model.ModelClass = Qwen3_5MoeForConditionalGeneration

nemo_automodel.components.models.qwen3_5_moe.model._QWEN3_5_MOE_HF_AVAILABLE = True