> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.audio.collate_fns

Collate functions for Qwen-Omni ASR fine-tuning (`torchcodec`-free).

These collates assume audio waveforms are already attached to each conversation
as 1-D `np.ndarray` items (see
:func:`nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset`),
so they feed the processor's `audio=` kwarg directly without going through
`qwen_omni_utils` / `torchcodec`. Label masking is delegated to the shared
marker-based :func:`nemo_automodel.components.datasets.vlm.collate_fns.build_labels_from_template`.

## Module Contents

### Functions

| Name                                                                                                                                     | Description                                                                       |
| ---------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| [`_conversation_ends_with_assistant_text`](#nemo_automodel-components-datasets-audio-collate_fns-_conversation_ends_with_assistant_text) | Return True iff the last turn is an `assistant` turn with non-empty text content. |
| [`_extract_audios_from_conversation`](#nemo_automodel-components-datasets-audio-collate_fns-_extract_audios_from_conversation)           | Walk a Qwen-Omni-style conversation and collect audio payloads in order.          |
| [`_validate_and_coerce_audio_payload`](#nemo_automodel-components-datasets-audio-collate_fns-_validate_and_coerce_audio_payload)         | Coerce an audio payload to a 1-D `float32` `np.ndarray` or raise.                 |
| [`qwen2_5_omni_asr_collate_fn`](#nemo_automodel-components-datasets-audio-collate_fns-qwen2_5_omni_asr_collate_fn)                       | Collate Qwen2.5-Omni ASR conversations.                                           |
| [`qwen3_omni_asr_collate_fn`](#nemo_automodel-components-datasets-audio-collate_fns-qwen3_omni_asr_collate_fn)                           | Collate Qwen3-Omni ASR conversations into model inputs without `qwen_omni_utils`. |

### API

```python
nemo_automodel.components.datasets.audio.collate_fns._conversation_ends_with_assistant_text(
    conversation: typing.Sequence[typing.Dict[str, typing.Any]]
) -> bool
```

Return True iff the last turn is an `assistant` turn with non-empty text content.

```python
nemo_automodel.components.datasets.audio.collate_fns._extract_audios_from_conversation(
    conversation: typing.Sequence[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Any]
```

Walk a Qwen-Omni-style conversation and collect audio payloads in order.

The returned list contains the raw audio objects (typically 1-D `np.ndarray`
waveforms) attached to `&#123;"type": "audio", "audio": ...&#125;` items in any
message's content list. Used by :func:`qwen3_omni_asr_collate_fn` to feed the
processor's `audio=` kwarg without going through `qwen_omni_utils`.

```python
nemo_automodel.components.datasets.audio.collate_fns._validate_and_coerce_audio_payload(
    payload: typing.Any,
    sample_index: int
) -> numpy.ndarray
```

Coerce an audio payload to a 1-D `float32` `np.ndarray` or raise.

**Parameters:**

Audio object pulled from a conversation content item.

Index of the offending sample within the batch (for error
messages).

**Returns:** `np.ndarray`

A 1-D `np.float32` `np.ndarray`.

**Raises:**

* `ValueError`: When the payload is not a numeric array or is not 1-D.

```python
nemo_automodel.components.datasets.audio.collate_fns.qwen2_5_omni_asr_collate_fn(
    examples: typing.Sequence[typing.Dict[str, typing.Any]],
    processor: typing.Any
) -> typing.Dict[str, torch.Tensor]
```

Collate Qwen2.5-Omni ASR conversations.

Thin alias over :func:`qwen3_omni_asr_collate_fn`: the body is processor-
agnostic (it only depends on the processor exposing `apply_chat_template`
and the `audio=` kwarg, both of which `Qwen2_5OmniProcessor` provides),
so the entire Qwen3-Omni-ASR path works unchanged here. We expose a
separate symbol so YAML configs can pick the right collate via
`_target_` without users having to know about the Qwen3-Omni name.

```python
nemo_automodel.components.datasets.audio.collate_fns.qwen3_omni_asr_collate_fn(
    examples: typing.Sequence[typing.Dict[str, typing.Any]],
    processor: typing.Any
) -> typing.Dict[str, torch.Tensor]
```

Collate Qwen3-Omni ASR conversations into model inputs without `qwen_omni_utils`.

Unlike `qwen3_omni_collate_fn` (in `vlm.collate_fns`), this collate is
intended for environments that lack `qwen_omni_utils` and `torchcodec`.
It assumes audio waveforms are already attached to the conversation as 1-D
`np.ndarray` items of the form `&#123;"type": "audio", "audio": waveform&#125;` (see
:func:`nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset`)
and passes them directly to the processor's `audio=` kwarg, which routes to
the bundled `WhisperFeatureExtractor`.

Label masking is delegated to :func:`build_labels_from_template`, which uses
the marker-based fast path that already supports `Qwen3OmniMoeProcessor`
via `_IMSTART_TEMPLATE_PROCESSORS`. The collate produces pre-shifted labels
(`labels[:, 1:]`) and slices same-shape tensors to `[:, :-1]` so the
downstream loss (`MaskedCrossEntropy`/`FusedLinearCrossEntropy`) consumes
them without a second internal shift.

**Parameters:**

Iterable of dicts each containing a `conversation` key, where
the last turn MUST be an `assistant` turn with non-empty text.

A `Qwen3OmniMoeProcessor` instance (or compatible mock).

**Returns:** `Dict[str, torch.Tensor]`

Dict with `input_ids`, `attention_mask`, `input_features`,

**Raises:**

* `ValueError`: If any conversation lacks a non-empty assistant turn at the
  end (the marker-based labeler would otherwise produce all-`-100`
  labels and a NaN loss).