nemo_automodel.components.datasets.audio.collate_fns
nemo_automodel.components.datasets.audio.collate_fns
Collate functions for Qwen-Omni ASR fine-tuning (torchcodec-free).
These collates assume audio waveforms are already attached to each conversation
as 1-D np.ndarray items (see
:func:nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset),
so they feed the processor’s audio= kwarg directly without going through
qwen_omni_utils / torchcodec. Label masking is delegated to the shared
marker-based :func:nemo_automodel.components.datasets.vlm.collate_fns.build_labels_from_template.
Module Contents
Functions
API
Return True iff the last turn is an assistant turn with non-empty text content.
Walk a Qwen-Omni-style conversation and collect audio payloads in order.
The returned list contains the raw audio objects (typically 1-D np.ndarray
waveforms) attached to {"type": "audio", "audio": ...} items in any
message’s content list. Used by :func:qwen3_omni_asr_collate_fn to feed the
processor’s audio= kwarg without going through qwen_omni_utils.
Coerce an audio payload to a 1-D float32 np.ndarray or raise.
Parameters:
Audio object pulled from a conversation content item.
Index of the offending sample within the batch (for error messages).
Returns: np.ndarray
A 1-D np.float32 np.ndarray.
Raises:
ValueError: When the payload is not a numeric array or is not 1-D.
Collate Qwen2.5-Omni ASR conversations.
Thin alias over :func:qwen3_omni_asr_collate_fn: the body is processor-
agnostic (it only depends on the processor exposing apply_chat_template
and the audio= kwarg, both of which Qwen2_5OmniProcessor provides),
so the entire Qwen3-Omni-ASR path works unchanged here. We expose a
separate symbol so YAML configs can pick the right collate via
_target_ without users having to know about the Qwen3-Omni name.
Collate Qwen3-Omni ASR conversations into model inputs without qwen_omni_utils.
Unlike qwen3_omni_collate_fn (in vlm.collate_fns), this collate is
intended for environments that lack qwen_omni_utils and torchcodec.
It assumes audio waveforms are already attached to the conversation as 1-D
np.ndarray items of the form {"type": "audio", "audio": waveform} (see
:func:nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset)
and passes them directly to the processor’s audio= kwarg, which routes to
the bundled WhisperFeatureExtractor.
Label masking is delegated to :func:build_labels_from_template, which uses
the marker-based fast path that already supports Qwen3OmniMoeProcessor
via _IMSTART_TEMPLATE_PROCESSORS. The collate produces pre-shifted labels
(labels[:, 1:]) and slices same-shape tensors to [:, :-1] so the
downstream loss (MaskedCrossEntropy/FusedLinearCrossEntropy) consumes
them without a second internal shift.
Parameters:
Iterable of dicts each containing a conversation key, where
the last turn MUST be an assistant turn with non-empty text.
A Qwen3OmniMoeProcessor instance (or compatible mock).
Returns: Dict[str, torch.Tensor]
Dict with input_ids, attention_mask, input_features,
Raises:
ValueError: If any conversation lacks a non-empty assistant turn at the end (the marker-based labeler would otherwise produce all--100labels and a NaN loss).