nemo_automodel.components.datasets.audio.collate_fns

Collate functions for Qwen-Omni ASR fine-tuning (torchcodec-free).

These collates assume audio waveforms are already attached to each conversation as 1-D np.ndarray items (see :func:nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset), so they feed the processor’s audio= kwarg directly without going through qwen_omni_utils / torchcodec. Label masking is delegated to the shared marker-based :func:nemo_automodel.components.datasets.vlm.collate_fns.build_labels_from_template.

Module Contents

Functions

Name	Description
`_conversation_ends_with_assistant_text`	Return True iff the last turn is an `assistant` turn with non-empty text content.
`_extract_audios_from_conversation`	Walk a Qwen-Omni-style conversation and collect audio payloads in order.
`_validate_and_coerce_audio_payload`	Coerce an audio payload to a 1-D `float32` `np.ndarray` or raise.
`qwen2_5_omni_asr_collate_fn`	Collate Qwen2.5-Omni ASR conversations.
`qwen3_omni_asr_collate_fn`	Collate Qwen3-Omni ASR conversations into model inputs without `qwen_omni_utils`.

API

nemo_automodel.components.datasets.audio.collate_fns._conversation_ends_with_assistant_text(
    conversation: typing.Sequence[typing.Dict[str, typing.Any]]
) -> bool

Return True iff the last turn is an assistant turn with non-empty text content.

nemo_automodel.components.datasets.audio.collate_fns._extract_audios_from_conversation(
    conversation: typing.Sequence[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Any]

Walk a Qwen-Omni-style conversation and collect audio payloads in order.

The returned list contains the raw audio objects (typically 1-D np.ndarray waveforms) attached to {"type": "audio", "audio": ...} items in any message’s content list. Used by :func:qwen3_omni_asr_collate_fn to feed the processor’s audio= kwarg without going through qwen_omni_utils.

nemo_automodel.components.datasets.audio.collate_fns._validate_and_coerce_audio_payload(
    payload: typing.Any,
    sample_index: int
) -> numpy.ndarray

Coerce an audio payload to a 1-D float32 np.ndarray or raise.

Parameters:

payload

Any

Audio object pulled from a conversation content item.

sample_index

int

Index of the offending sample within the batch (for error messages).

Returns: np.ndarray

A 1-D np.float32 np.ndarray.

Raises:

ValueError: When the payload is not a numeric array or is not 1-D.

nemo_automodel.components.datasets.audio.collate_fns.qwen2_5_omni_asr_collate_fn(
    examples: typing.Sequence[typing.Dict[str, typing.Any]],
    processor: typing.Any
) -> typing.Dict[str, torch.Tensor]

Collate Qwen2.5-Omni ASR conversations.

Thin alias over :func:qwen3_omni_asr_collate_fn: the body is processor- agnostic (it only depends on the processor exposing apply_chat_template and the audio= kwarg, both of which Qwen2_5OmniProcessor provides), so the entire Qwen3-Omni-ASR path works unchanged here. We expose a separate symbol so YAML configs can pick the right collate via _target_ without users having to know about the Qwen3-Omni name.

nemo_automodel.components.datasets.audio.collate_fns.qwen3_omni_asr_collate_fn(
    examples: typing.Sequence[typing.Dict[str, typing.Any]],
    processor: typing.Any
) -> typing.Dict[str, torch.Tensor]

Collate Qwen3-Omni ASR conversations into model inputs without qwen_omni_utils.

Unlike qwen3_omni_collate_fn (in vlm.collate_fns), this collate is intended for environments that lack qwen_omni_utils and torchcodec. It assumes audio waveforms are already attached to the conversation as 1-D np.ndarray items of the form {"type": "audio", "audio": waveform} (see :func:nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset) and passes them directly to the processor’s audio= kwarg, which routes to the bundled WhisperFeatureExtractor.

Label masking is delegated to :func:build_labels_from_template, which uses the marker-based fast path that already supports Qwen3OmniMoeProcessor via _IMSTART_TEMPLATE_PROCESSORS. The collate produces pre-shifted labels (labels[:, 1:]) and slices same-shape tensors to [:, :-1] so the downstream loss (MaskedCrossEntropy/FusedLinearCrossEntropy) consumes them without a second internal shift.

Parameters:

examples

Sequence[Dict[str, Any]]

Iterable of dicts each containing a conversation key, where the last turn MUST be an assistant turn with non-empty text.

processor

Any

A Qwen3OmniMoeProcessor instance (or compatible mock).

Returns: Dict[str, torch.Tensor]

Dict with input_ids, attention_mask, input_features,

Raises:

ValueError: If any conversation lacks a non-empty assistant turn at the end (the marker-based labeler would otherwise produce all--100 labels and a NaN loss).