nemo_automodel.components.datasets.audio.multi_en
nemo_automodel.components.datasets.audio.multi_en
Multi-source English ASR mix builder for Qwen-Omni fine-tuning.
Mixes several public HuggingFace speech corpora into a single
{audio, text, source} training set, normalizing every transcript to a common
uppercase / punctuation-free style. Audio is never decoded at construction time:
each source’s audio column is cast with Audio(decode=False) and decoded
lazily with soundfile inside the shared
:func:nemo_automodel.components.datasets.audio.datasets._attach_asr_transform
(so no torchcodec dependency is pulled in).
This is the first-class port of the result/data/build_train_mix.py prototype.
The key difference from the prototype: the prototype resampled by casting the audio
column to a decoding Audio feature with a target sampling rate (which triggers
HuggingFace’s default decoding path and pulls in torchcodec), whereas this builder
keeps a non-decoding Audio(decode=False) cast and lets resampling happen in the
soundfile decode path inside the lazy transform.
Module Contents
Classes
Functions
Data
API
Specification for one corpus in the English ASR mix.
Coerce a YAML/CLI source spec into a :class:Source.
The recipe config loader passes nested dataset.sources entries as plain
dicts, so a Source override written in YAML/CLI arrives here as a mapping.
Pass-through for existing :class:Source instances.
Parameters:
A :class:Source or a mapping with keys matching its fields
(name, repo, config, split, text_col are required;
limit, trust_remote_code, known_count are optional).
Returns: Source
class:Source instance.
Raises:
TypeError: Ifspecis neither a :class:Sourcenor a mapping.ValueError: If the mapping contains keys that are notSourcefields.
Load one source and normalise it to {audio (decode=False), text, source}.
No audio is decoded: the audio column is cast with Audio(decode=False) and
only the transcript column is touched (text-only Arrow map/filter).
Raises:
ValueError: If the audio column or the source’s text column is missing.
Build the concatenated {audio, text, source} mix (before the ASR transform).
Exposed separately from :func:make_multi_en_asr_dataset so the mix
composition (source labels, normalisation, filtering) can be inspected/tested
without the lazy conversation transform that hides all columns but
conversation.
Parameters:
Source specs to mix. Defaults to the full six-source
:data:SOURCES. Pass a trimmed list for local/smoke runs.
Name of the audio column to standardise on.
Name of the transcript column to standardise on.
Global shuffle seed so consecutive examples mix sources.
None disables shuffling (single-source blocks).
Drop clips shorter than this via header-only
soundfile.info (no PCM decode). None disables the filter.
Drop clips longer than this via the same
header-only probe. Defaults to 30.0 to cap activation memory —
long clips inflate the Whisper feature extractor and can OOM a rank.
None disables the cap.
Returns: Dataset
A map-style HuggingFace Dataset with columns
Build the multi-source English ASR training dataset for Qwen-Omni.
Mixes the six public corpora in :data:SOURCES (or a caller-supplied subset),
normalises every transcript with :func:normalize_text, and attaches the
shared lazy ASR transform so audio is decoded with soundfile (and
resampled to sampling_rate) only on item access — never at construction
time and never via torchcodec.
Parameters:
Target sampling rate in Hz for the decoded waveform.
Optional system-turn instruction. None omits it.
User-turn instruction placed before the audio. Defaults to a generic English ASR instruction.
Drop clips shorter than this (header-only
probe). Defaults to 1.0 to dodge the Qwen-Omni Whisper
feature-extractor sub-second off-by-one.
Drop clips longer than this (header-only
probe). Defaults to 30.0 to cap activation memory and avoid
per-rank OOM from long clips in the mix. None disables the cap.
Global shuffle seed; None disables shuffling.
Optional subset of :data:SOURCES (e.g. to skip gated corpora
for a local smoke run). Defaults to all six.
Name of the audio column to standardise on.
Name of the transcript column to standardise on.
Returns: Dataset
A HuggingFace Dataset whose elements are
Normalise a transcript to UPPERCASE, punctuation-free, apostrophes kept.
Parameters:
Raw transcript (may be None).
Returns: str
The normalised transcript. Digits and intra-word apostrophes are kept;