nemo_automodel.components.datasets.audio.collate_fns

View as Markdown

Collate functions for Qwen-Omni ASR fine-tuning (torchcodec-free).

These collates assume audio waveforms are already attached to each conversation as 1-D np.ndarray items (see :func:nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset), so they feed the processor’s audio= kwarg directly without going through qwen_omni_utils / torchcodec. Label masking is delegated to the shared marker-based :func:nemo_automodel.components.datasets.vlm.collate_fns.build_labels_from_template.

Module Contents

Functions

NameDescription
_conversation_ends_with_assistant_textReturn True iff the last turn is an assistant turn with non-empty text content.
_extract_audios_from_conversationWalk a Qwen-Omni-style conversation and collect audio payloads in order.
_validate_and_coerce_audio_payloadCoerce an audio payload to a 1-D float32 np.ndarray or raise.
qwen2_5_omni_asr_collate_fnCollate Qwen2.5-Omni ASR conversations.
qwen3_omni_asr_collate_fnCollate Qwen3-Omni ASR conversations into model inputs without qwen_omni_utils.

API

nemo_automodel.components.datasets.audio.collate_fns._conversation_ends_with_assistant_text(
conversation: typing.Sequence[typing.Dict[str, typing.Any]]
) -> bool

Return True iff the last turn is an assistant turn with non-empty text content.

nemo_automodel.components.datasets.audio.collate_fns._extract_audios_from_conversation(
conversation: typing.Sequence[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Any]

Walk a Qwen-Omni-style conversation and collect audio payloads in order.

The returned list contains the raw audio objects (typically 1-D np.ndarray waveforms) attached to {"type": "audio", "audio": ...} items in any message’s content list. Used by :func:qwen3_omni_asr_collate_fn to feed the processor’s audio= kwarg without going through qwen_omni_utils.

nemo_automodel.components.datasets.audio.collate_fns._validate_and_coerce_audio_payload(
payload: typing.Any,
sample_index: int
) -> numpy.ndarray

Coerce an audio payload to a 1-D float32 np.ndarray or raise.

Parameters:

payload
Any

Audio object pulled from a conversation content item.

sample_index
int

Index of the offending sample within the batch (for error messages).

Returns: np.ndarray

A 1-D np.float32 np.ndarray.

Raises:

  • ValueError: When the payload is not a numeric array or is not 1-D.
nemo_automodel.components.datasets.audio.collate_fns.qwen2_5_omni_asr_collate_fn(
examples: typing.Sequence[typing.Dict[str, typing.Any]],
processor: typing.Any
) -> typing.Dict[str, torch.Tensor]

Collate Qwen2.5-Omni ASR conversations.

Thin alias over :func:qwen3_omni_asr_collate_fn: the body is processor- agnostic (it only depends on the processor exposing apply_chat_template and the audio= kwarg, both of which Qwen2_5OmniProcessor provides), so the entire Qwen3-Omni-ASR path works unchanged here. We expose a separate symbol so YAML configs can pick the right collate via _target_ without users having to know about the Qwen3-Omni name.

nemo_automodel.components.datasets.audio.collate_fns.qwen3_omni_asr_collate_fn(
examples: typing.Sequence[typing.Dict[str, typing.Any]],
processor: typing.Any
) -> typing.Dict[str, torch.Tensor]

Collate Qwen3-Omni ASR conversations into model inputs without qwen_omni_utils.

Unlike qwen3_omni_collate_fn (in vlm.collate_fns), this collate is intended for environments that lack qwen_omni_utils and torchcodec. It assumes audio waveforms are already attached to the conversation as 1-D np.ndarray items of the form {"type": "audio", "audio": waveform} (see :func:nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset) and passes them directly to the processor’s audio= kwarg, which routes to the bundled WhisperFeatureExtractor.

Label masking is delegated to :func:build_labels_from_template, which uses the marker-based fast path that already supports Qwen3OmniMoeProcessor via _IMSTART_TEMPLATE_PROCESSORS. The collate produces pre-shifted labels (labels[:, 1:]) and slices same-shape tensors to [:, :-1] so the downstream loss (MaskedCrossEntropy/FusedLinearCrossEntropy) consumes them without a second internal shift.

Parameters:

examples
Sequence[Dict[str, Any]]

Iterable of dicts each containing a conversation key, where the last turn MUST be an assistant turn with non-empty text.

processor
Any

A Qwen3OmniMoeProcessor instance (or compatible mock).

Returns: Dict[str, torch.Tensor]

Dict with input_ids, attention_mask, input_features,

Raises:

  • ValueError: If any conversation lacks a non-empty assistant turn at the end (the marker-based labeler would otherwise produce all--100 labels and a NaN loss).