> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.qwen2_5_omni.model

Qwen2.5-Omni Thinker for ASR / multimodal text generation.

Qwen2.5-Omni is the dense predecessor of Qwen3-Omni-Moe. For NeMo
AutoModel we only train the Thinker (audio + image + video + text); the
talker and token2wav components are dropped from the loaded checkpoint by
:class:`Qwen2_5OmniStateDictAdapter`.

Compared with :mod:`nemo_automodel.components.models.qwen3_omni_moe.model`,
this module is intentionally minimal:

* inherits HF's `Qwen2_5OmniThinkerForConditionalGeneration` directly
  (the text backbone is a standard dense Qwen2 transformer with MRoPE, so
  no custom rewrite is needed);
* adds :class:`HFCheckpointingMixin` for NeMo-compatible save/load;
* attaches :class:`Qwen2_5OmniStateDictAdapter` for `thinker.*` prefix
  handling;
* does NOT inherit `MoEFSDPSyncMixin` (dense, no experts).

## Module Contents

### Classes

| Name                                                                                                                                            | Description                                                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
| [`Qwen2_5OmniThinkerForConditionalGeneration`](#nemo_automodel-components-models-qwen2_5_omni-model-Qwen2_5OmniThinkerForConditionalGeneration) | Qwen2.5-Omni Thinker (audio + image + video + text → text). |

### Functions

| Name                                                                                                      | Description                                                        |
| --------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| [`_resolve_thinker_config`](#nemo_automodel-components-models-qwen2_5_omni-model-_resolve_thinker_config) | Return the thinker sub-config regardless of whether a full Omni or |

### Data

[`ModelClass`](#nemo_automodel-components-models-qwen2_5_omni-model-ModelClass)

### API

```python
class nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration(
    config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `HFQwen2_5OmniThinkerForConditionalGeneration`

Qwen2.5-Omni Thinker (audio + image + video + text → text).

```python
nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.forward(
    input_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    input_features: torch.FloatTensor | None = None,
    feature_attention_mask: torch.LongTensor | None = None,
    audio_feature_lengths: torch.LongTensor | None = None,
    pixel_values: torch.FloatTensor | None = None,
    pixel_values_videos: torch.FloatTensor | None = None,
    image_grid_thw: torch.LongTensor | None = None,
    video_grid_thw: torch.LongTensor | None = None,
    video_second_per_grid: torch.Tensor | None = None,
    use_audio_in_video: bool | None = None,
    position_ids: torch.Tensor | None = None,
    past_key_values: typing.Any = None,
    inputs_embeds: torch.FloatTensor | None = None,
    labels: torch.LongTensor | None = None,
    use_cache: bool | None = None,
    rope_deltas: torch.LongTensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    kwargs: typing.Any = {}
)
```

Multimodal forward that mirrors HF's Thinker but supports cut-CE.

This re-implements the body of HF's
`Qwen2_5OmniThinkerForConditionalGeneration.forward` (same
audio/image/video embedding merge and MRoPE index computation) so we
can (a) gate the `lm_head` projection on `logits_to_keep` and
(b) surface the FINAL hidden states (the `lm_head` input) on the
returned :class:`~transformers.modeling_outputs.CausalLMOutputWithPast`.
Together these let the recipe enable
:class:`FusedLinearCrossEntropy` (cut-CE): it checks `logits_to_keep`
is in the signature and that the output carries `hidden_states`.

Audio is mandatory for ASR; image / video paths are kept enabled so
the same class supports the full Thinker modality set.

**Parameters:**

If `0` (default), project all positions (no slice
— DTensor cannot slice a full range). Otherwise compute logits
only for the last `logits_to_keep` positions before `lm_head`.

When set, the returned output carries the
final hidden states spanning the full sequence.

**Returns:**

class:`~transformers.modeling_outputs.CausalLMOutputWithPast` with

```python
nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.from_config(
    config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen2_5_omni.model.Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.qwen2_5_omni.model._resolve_thinker_config(
    config: transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniConfig | transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig
) -> transformers.models.qwen2_5_omni.configuration_qwen2_5_omni.Qwen2_5OmniThinkerConfig
```

Return the thinker sub-config regardless of whether a full Omni or
Thinker-only config was passed in.

```python
nemo_automodel.components.models.qwen2_5_omni.model.ModelClass = Qwen2_5OmniThinkerForConditionalGeneration
```