nemo_automodel.components.models.qwen2_5_omni.model#
Qwen2.5-Omni Thinker for ASR / multimodal text generation.
Qwen2.5-Omni is the dense predecessor of Qwen3-Omni-Moe. For NeMo AutoModel we only train the Thinker (audio + image + video + text); the talker and token2wav components are dropped from the loaded checkpoint by
- class:
Qwen2_5OmniStateDictAdapter.
Compared with :mod:nemo_automodel.components.models.qwen3_omni_moe.model,
this module is intentionally minimal:
inherits HF’s
Qwen2_5OmniThinkerForConditionalGenerationdirectly (the text backbone is a standard dense Qwen2 transformer with MRoPE, so no custom rewrite is needed);adds :class:
HFCheckpointingMixinfor NeMo-compatible save/load;attaches :class:
Qwen2_5OmniStateDictAdapterforthinker.*prefix handling;does NOT inherit
MoEFSDPSyncMixin(dense, no experts).
Module Contents#
Classes#
Qwen2.5-Omni Thinker (audio + image + video + text → text). |
Functions#
Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in. |
Data#
API#
- nemo_automodel.components.models.qwen2_5_omni.model._resolve_thinker_config(
- config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in.
- class nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration(
- config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGenerationQwen2.5-Omni Thinker (audio + image + video + text → text).
Initialization
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
- classmethod from_config(
- config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
- forward(
- input_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- input_features: torch.FloatTensor | None = None,
- feature_attention_mask: torch.LongTensor | None = None,
- audio_feature_lengths: torch.LongTensor | None = None,
- pixel_values: torch.FloatTensor | None = None,
- pixel_values_videos: torch.FloatTensor | None = None,
- image_grid_thw: torch.LongTensor | None = None,
- video_grid_thw: torch.LongTensor | None = None,
- video_second_per_grid: torch.Tensor | None = None,
- use_audio_in_video: bool | None = None,
- position_ids: torch.Tensor | None = None,
- inputs_embeds: torch.FloatTensor | None = None,
- labels: torch.LongTensor | None = None,
- **kwargs: Any,
Delegate to HF Thinker forward, passing all multimodal inputs through.
The forward signature mirrors HF’s
Qwen2_5OmniThinkerForConditionalGeneration.forward; we override only to make the call site uniform with the rest of NeMo AutoModel. Audio is mandatory for ASR; image / video paths are kept enabled so the same class supports the full Thinker modality set.
- nemo_automodel.components.models.qwen2_5_omni.model.ModelClass#
None