nemo_automodel.components.models.qwen3_5.model
nemo_automodel.components.models.qwen3_5.model
Qwen3.5 dense causal LM with Megatron-style MTP support.
Module Contents
Classes
Functions
Data
API
Bases: Qwen3_5TextRotaryEmbedding
Ensure inv_freq stays in float32 across .to(dtype) calls.
Bases: CausalLMOutputWithPast
Qwen3.5 causal-LM output extended with MTP auxiliary hidden states.
Bases: Block
Qwen3.5 dense decoder block on top of the Qwen3-Next Block.
Identical to Qwen3_5MoeBlock except the MLP degrades to a dense MLP
(no experts). The CP-aware GatedDeltaNet is built natively for
linear-attention layers, and the forward threads NEAT-packing kwargs.
Bases: Qwen3_5DecoderLayer
One full-attention Qwen3.5 dense MTP sublayer.
Bases: Module
Qwen3.5 dense text decoder rebuilt on the Qwen3-Next Block.
Native counterpart of Qwen3_5MoeTextModelBackend for the dense model:
reuses the same blocks/GatedDeltaNet/norm/rotary so dense and MoE share one
code path, with the fp32 SSMGate built at construction (no runtime patch).
Bases: HFCheckpointingMixin, Module
Qwen3.5 dense causal LM with optional Megatron-style MTP head.
Bases: HFCheckpointingMixin, HFQwen3_5ForConditionalGeneration
Qwen3.5/Qwen3.6 dense VLM with optional Megatron-style MTP head.
The base VLM stays on the upstream HF implementation so image/video feature insertion, M-RoPE position handling, and generation helpers remain intact. MTP is added as an auxiliary train-time module over the final language hidden states, matching the dense text-only MTP architecture.
Build full-sequence multimodal embeddings and mRoPE positions before CP sharding.
The VLM->LM multimodal scatter and mRoPE get_rope_index must run on the
full (unsharded) sequence; context-parallel sharding then happens on the
returned inputs_embeds / position_ids via make_cp_batch_and_ctx.
Bases: HFQwen3_5Model
Thin VLM wrapper exposing language_model internals as properties and
routing the forward: HF vision+scatter path when media is present, else the
NeMo dense backbone directly. Mirrors Qwen3_5MoeModel.
Trivial MoEConfig for the dense Qwen3.5 backbone.
The dense model has no experts (num_experts is 0/absent), so Block
builds a dense MLP and never consults this config; it is only required to
satisfy Block.__init__’s signature.
Build a 4D block-causal attention mask from an indexed packing mask.
packing_mask is [B, S] with the 1-based document index per token
(0 = padding). The returned bool mask [B, 1, S, S] (True = attend)
keeps attention causal and within each packed document, matching the
backbone’s packed-sequence semantics. Used for the MTP sublayers, which run
SDPA self-attention over the same packed batch (NVBugs 6330129).
Return a Qwen3.5 backend with TE fused RoPE disabled.
Qwen3.5 VLM training can feed full-attention layers in packed/THD shape via the shared Qwen3-Next attention block. TE fused RoPE expects 4D inputs there, so keep the non-fused RoPE path while preserving the rest of the backend selection (TE Linear, attention backend, etc.).
Build Qwen3.5 MTP runtime config from HF-style config fields.
Construct dense Qwen3.5 MTP blocks.
Qwen3.5 MTP follows Megatron Bridge: each depth is one full-attention Qwen3.5 decoder block, regardless of the backbone’s GatedDeltaNet layers.