nemo_automodel.components.models.qwen2_5_omni.model#

Qwen2.5-Omni Thinker for ASR / multimodal text generation.

Qwen2.5-Omni is the dense predecessor of Qwen3-Omni-Moe. For NeMo AutoModel we only train the Thinker (audio + image + video + text); the talker and token2wav components are dropped from the loaded checkpoint by

class:

Qwen2_5OmniStateDictAdapter.

Compared with :mod:nemo_automodel.components.models.qwen3_omni_moe.model, this module is intentionally minimal:

  • inherits HF’s Qwen2_5OmniThinkerForConditionalGeneration directly (the text backbone is a standard dense Qwen2 transformer with MRoPE, so no custom rewrite is needed);

  • adds :class:HFCheckpointingMixin for NeMo-compatible save/load;

  • attaches :class:Qwen2_5OmniStateDictAdapter for thinker.* prefix handling;

  • does NOT inherit MoEFSDPSyncMixin (dense, no experts).

Module Contents#

Classes#

Qwen2_5OmniThinkerForConditionalGeneration

Qwen2.5-Omni Thinker (audio + image + video + text → text).

Functions#

_resolve_thinker_config

Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in.

Data#

API#

nemo_automodel.components.models.qwen2_5_omni.model._resolve_thinker_config(
config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
) transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig#

Return the thinker sub-config regardless of whether a full Omni or Thinker-only config was passed in.

class nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration(
config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, transformers.models.qwen2_5_omni.modeling_qwen2_5_omni.Qwen2_5OmniThinkerForConditionalGeneration

Qwen2.5-Omni Thinker (audio + image + video + text → text).

Initialization

classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#
classmethod from_config(
config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#
forward(
input_ids: torch.Tensor | None = None,
attention_mask: torch.Tensor | None = None,
input_features: torch.FloatTensor | None = None,
feature_attention_mask: torch.LongTensor | None = None,
audio_feature_lengths: torch.LongTensor | None = None,
pixel_values: torch.FloatTensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
video_second_per_grid: torch.Tensor | None = None,
use_audio_in_video: bool | None = None,
position_ids: torch.Tensor | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
**kwargs: Any,
)#

Delegate to HF Thinker forward, passing all multimodal inputs through.

The forward signature mirrors HF’s Qwen2_5OmniThinkerForConditionalGeneration.forward; we override only to make the call site uniform with the rest of NeMo AutoModel. Audio is mandatory for ASR; image / video paths are kept enabled so the same class supports the full Thinker modality set.

nemo_automodel.components.models.qwen2_5_omni.model.ModelClass#

None