nemo_automodel.components.models.qwen3_omni_moe.model
nemo_automodel.components.models.qwen3_omni_moe.model
Module Contents
Classes
Data
API
Bases: HFCheckpointingMixin, HFQwen3OmniMoeThinkerForConditionalGeneration, MoEFSDPSyncMixin
Qwen3OmniMoe Thinker for Conditional Generation with multimodal support.
Forward pass with multimodal fusion.
Parameters:
Input token IDs
Audio input features
Image pixel values
Video pixel values
Image grid (temporal, height, width)
Video grid (temporal, height, width)
Attention mask
Feature attention mask for audio
Audio feature lengths
Position IDs (3D for MRoPE)
Padding mask
Optional pre-computed input embeddings
Labels for loss computation
Whether to output router logits
Whether audio is in video
Seconds per grid for videos
If > 0, only compute logits for the last
logits_to_keep token positions (0 = all positions). Enables
memory-efficient fused cross-entropy by letting the recipe request
a single-position lm_head projection alongside the final hidden
states.
When set, the returned output carries the final
hidden states (the input to lm_head) so the recipe can run
fused linear cross-entropy.
Additional attention arguments
Returns: torch.Tensor | dict | CausalLMOutputWithPast
Logits tensor, a dict with loss/aux_loss if labels provided, or a
Bases: Module
Qwen3OmniMoe Thinker Text Model with MRoPE and sparse MoE layers.
visual_pos_masks (torch.Tensor of shape (batch_size, seqlen), optional):
The mask of the visual positions.
deepstack_visual_embeds (list[torch.Tensor], optional):
The deepstack visual embeddings. The shape is (num_layers, visual_seqlen, embed_dim).
The feature is extracted from the different visual encoder layers, and fed to the decoder
hidden states. It’s from the paper DeepStack(https://arxiv.org/abs/2406.04334).