nemo_automodel.components.datasets.audio.multi_en

View as Markdown

Multi-source English ASR mix builder for Qwen-Omni fine-tuning.

Mixes several public HuggingFace speech corpora into a single {audio, text, source} training set, normalizing every transcript to a common uppercase / punctuation-free style. Audio is never decoded at construction time: each source’s audio column is cast with Audio(decode=False) and decoded lazily with soundfile inside the shared :func:nemo_automodel.components.datasets.audio.datasets._attach_asr_transform (so no torchcodec dependency is pulled in).

This is the first-class port of the result/data/build_train_mix.py prototype. The key difference from the prototype: the prototype resampled by casting the audio column to a decoding Audio feature with a target sampling rate (which triggers HuggingFace’s default decoding path and pulls in torchcodec), whereas this builder keeps a non-decoding Audio(decode=False) cast and lets resampling happen in the soundfile decode path inside the lazy transform.

Module Contents

Classes

NameDescription
SourceSpecification for one corpus in the English ASR mix.

Functions

NameDescription
_coerce_sourceCoerce a YAML/CLI source spec into a :class:Source.
_load_and_normalize_sourceLoad one source and normalise it to {audio (decode=False), text, source}.
build_multi_en_source_mixBuild the concatenated {audio, text, source} mix (before the ASR transform).
make_multi_en_asr_datasetBuild the multi-source English ASR training dataset for Qwen-Omni.
normalize_textNormalise a transcript to UPPERCASE, punctuation-free, apostrophes kept.

Data

DEFAULT_USER_PROMPT

SOURCES

TARGET_SAMPLING_RATE

_PUNCT_RE

_TAG_RE

_WS_RE

logger

API

class nemo_automodel.components.datasets.audio.multi_en.Source(
name: str,
repo: str,
config: str | None,
split: str,
text_col: str,
limit: int | None = None,
trust_remote_code: bool = False,
known_count: int | None = None
)
Dataclass

Specification for one corpus in the English ASR mix.

config
str | None
known_count
int | None = None
limit
int | None = None
name
str
repo
str
split
str
text_col
str
trust_remote_code
bool = False
nemo_automodel.components.datasets.audio.multi_en._coerce_source(
spec: nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]
) -> nemo_automodel.components.datasets.audio.multi_en.Source

Coerce a YAML/CLI source spec into a :class:Source.

The recipe config loader passes nested dataset.sources entries as plain dicts, so a Source override written in YAML/CLI arrives here as a mapping. Pass-through for existing :class:Source instances.

Parameters:

spec
Source | Mapping[str, Any]

A :class:Source or a mapping with keys matching its fields (name, repo, config, split, text_col are required; limit, trust_remote_code, known_count are optional).

Returns: Source

class:Source instance.

Raises:

  • TypeError: If spec is neither a :class:Source nor a mapping.
  • ValueError: If the mapping contains keys that are not Source fields.
nemo_automodel.components.datasets.audio.multi_en._load_and_normalize_source(
src: nemo_automodel.components.datasets.audio.multi_en.Source,
audio_column: str,
text_column: str
) -> datasets.Dataset

Load one source and normalise it to {audio (decode=False), text, source}.

No audio is decoded: the audio column is cast with Audio(decode=False) and only the transcript column is touched (text-only Arrow map/filter).

Raises:

  • ValueError: If the audio column or the source’s text column is missing.
nemo_automodel.components.datasets.audio.multi_en.build_multi_en_source_mix(
sources: list[nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]] | None = None,
audio_column: str = 'audio',
text_column: str = 'text',
shuffle_seed: int | None = 42,
min_audio_duration_seconds: float | None = 1.0,
max_audio_duration_seconds: float | None = 30.0
) -> datasets.Dataset

Build the concatenated {audio, text, source} mix (before the ASR transform).

Exposed separately from :func:make_multi_en_asr_dataset so the mix composition (source labels, normalisation, filtering) can be inspected/tested without the lazy conversation transform that hides all columns but conversation.

Parameters:

sources
list[Source | Mapping[str, Any]] | NoneDefaults to None

Source specs to mix. Defaults to the full six-source :data:SOURCES. Pass a trimmed list for local/smoke runs.

audio_column
strDefaults to 'audio'

Name of the audio column to standardise on.

text_column
strDefaults to 'text'

Name of the transcript column to standardise on.

shuffle_seed
int | NoneDefaults to 42

Global shuffle seed so consecutive examples mix sources. None disables shuffling (single-source blocks).

min_audio_duration_seconds
float | NoneDefaults to 1.0

Drop clips shorter than this via header-only soundfile.info (no PCM decode). None disables the filter.

max_audio_duration_seconds
float | NoneDefaults to 30.0

Drop clips longer than this via the same header-only probe. Defaults to 30.0 to cap activation memory — long clips inflate the Whisper feature extractor and can OOM a rank. None disables the cap.

Returns: Dataset

A map-style HuggingFace Dataset with columns

nemo_automodel.components.datasets.audio.multi_en.make_multi_en_asr_dataset(
sampling_rate: int = TARGET_SAMPLING_RATE,
system_prompt: str | None = None,
user_prompt: str | None = DEFAULT_USER_PROMPT,
min_audio_duration_seconds: float | None = 1.0,
max_audio_duration_seconds: float | None = 30.0,
shuffle_seed: int | None = 42,
sources: list[nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]] | None = None,
audio_column: str = 'audio',
text_column: str = 'text'
) -> datasets.Dataset

Build the multi-source English ASR training dataset for Qwen-Omni.

Mixes the six public corpora in :data:SOURCES (or a caller-supplied subset), normalises every transcript with :func:normalize_text, and attaches the shared lazy ASR transform so audio is decoded with soundfile (and resampled to sampling_rate) only on item access — never at construction time and never via torchcodec.

Parameters:

sampling_rate
intDefaults to TARGET_SAMPLING_RATE

Target sampling rate in Hz for the decoded waveform.

system_prompt
str | NoneDefaults to None

Optional system-turn instruction. None omits it.

user_prompt
str | NoneDefaults to DEFAULT_USER_PROMPT

User-turn instruction placed before the audio. Defaults to a generic English ASR instruction.

min_audio_duration_seconds
float | NoneDefaults to 1.0

Drop clips shorter than this (header-only probe). Defaults to 1.0 to dodge the Qwen-Omni Whisper feature-extractor sub-second off-by-one.

max_audio_duration_seconds
float | NoneDefaults to 30.0

Drop clips longer than this (header-only probe). Defaults to 30.0 to cap activation memory and avoid per-rank OOM from long clips in the mix. None disables the cap.

shuffle_seed
int | NoneDefaults to 42

Global shuffle seed; None disables shuffling.

sources
list[Source | Mapping[str, Any]] | NoneDefaults to None

Optional subset of :data:SOURCES (e.g. to skip gated corpora for a local smoke run). Defaults to all six.

audio_column
strDefaults to 'audio'

Name of the audio column to standardise on.

text_column
strDefaults to 'text'

Name of the transcript column to standardise on.

Returns: Dataset

A HuggingFace Dataset whose elements are

nemo_automodel.components.datasets.audio.multi_en.normalize_text(
text: str
) -> str

Normalise a transcript to UPPERCASE, punctuation-free, apostrophes kept.

Parameters:

text
str

Raw transcript (may be None).

Returns: str

The normalised transcript. Digits and intra-word apostrophes are kept;

nemo_automodel.components.datasets.audio.multi_en.DEFAULT_USER_PROMPT = 'Transcribe the following English audio verbatim. Output only the raw transcript...
nemo_automodel.components.datasets.audio.multi_en.SOURCES: list[Source] = [Source('ami_ihm', 'edinburghcstr/ami', 'ihm', 'train', 'text', known_count=1085...
nemo_automodel.components.datasets.audio.multi_en.TARGET_SAMPLING_RATE = 16000
nemo_automodel.components.datasets.audio.multi_en._PUNCT_RE = re.compile("[^\\w\\s']")
nemo_automodel.components.datasets.audio.multi_en._TAG_RE = re.compile('<[^>]*>')
nemo_automodel.components.datasets.audio.multi_en._WS_RE = re.compile('\\s+')
nemo_automodel.components.datasets.audio.multi_en.logger = logging.getLogger(__name__)