nemo_automodel.components.datasets.audio.multi_en

Multi-source English ASR mix builder for Qwen-Omni fine-tuning.

Mixes several public HuggingFace speech corpora into a single {audio, text, source} training set, normalizing every transcript to a common uppercase / punctuation-free style. Audio is never decoded at construction time: each source’s audio column is cast with Audio(decode=False) and decoded lazily with soundfile inside the shared :func:nemo_automodel.components.datasets.audio.datasets._attach_asr_transform (so no torchcodec dependency is pulled in).

This is the first-class port of the result/data/build_train_mix.py prototype. The key difference from the prototype: the prototype resampled by casting the audio column to a decoding Audio feature with a target sampling rate (which triggers HuggingFace’s default decoding path and pulls in torchcodec), whereas this builder keeps a non-decoding Audio(decode=False) cast and lets resampling happen in the soundfile decode path inside the lazy transform.

Module Contents

Classes

Name	Description
`Source`	Specification for one corpus in the English ASR mix.

Functions

Name	Description
`_coerce_source`	Coerce a YAML/CLI source spec into a :class:`Source`.
`_load_and_normalize_source`	Load one source and normalise it to `{audio (decode=False), text, source}`.
`build_multi_en_source_mix`	Build the concatenated `{audio, text, source}` mix (before the ASR transform).
`make_multi_en_asr_dataset`	Build the multi-source English ASR training dataset for Qwen-Omni.
`normalize_text`	Normalise a transcript to UPPERCASE, punctuation-free, apostrophes kept.

Data

DEFAULT_USER_PROMPT

SOURCES

TARGET_SAMPLING_RATE

API

class nemo_automodel.components.datasets.audio.multi_en.Source(
    name: str,
    repo: str,
    config: str | None,
    split: str,
    text_col: str,
    limit: int | None = None,
    trust_remote_code: bool = False,
    known_count: int | None = None
)

Dataclass

Specification for one corpus in the English ASR mix.

config

str | None

known_count

int | None = None

limit

int | None = None

name

str

repo

str

split

str

text_col

str

trust_remote_code

bool = False

nemo_automodel.components.datasets.audio.multi_en._coerce_source(
    spec: nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]
) -> nemo_automodel.components.datasets.audio.multi_en.Source

Coerce a YAML/CLI source spec into a :class:Source.

The recipe config loader passes nested dataset.sources entries as plain dicts, so a Source override written in YAML/CLI arrives here as a mapping. Pass-through for existing :class:Source instances.

Parameters:

spec

Source | Mapping[str, Any]

A :class:Source or a mapping with keys matching its fields (name, repo, config, split, text_col are required; limit, trust_remote_code, known_count are optional).

Returns: Source

class:Source instance.

Raises:

TypeError: If spec is neither a :class:Source nor a mapping.
ValueError: If the mapping contains keys that are not Source fields.

nemo_automodel.components.datasets.audio.multi_en._load_and_normalize_source(
    src: nemo_automodel.components.datasets.audio.multi_en.Source,
    audio_column: str,
    text_column: str
) -> datasets.Dataset

Load one source and normalise it to {audio (decode=False), text, source}.

No audio is decoded: the audio column is cast with Audio(decode=False) and only the transcript column is touched (text-only Arrow map/filter).

Raises:

ValueError: If the audio column or the source’s text column is missing.

nemo_automodel.components.datasets.audio.multi_en.build_multi_en_source_mix(
    sources: list[nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]] | None = None,
    audio_column: str = 'audio',
    text_column: str = 'text',
    shuffle_seed: int | None = 42,
    min_audio_duration_seconds: float | None = 1.0,
    max_audio_duration_seconds: float | None = 30.0
) -> datasets.Dataset

Build the concatenated {audio, text, source} mix (before the ASR transform).

Exposed separately from :func:make_multi_en_asr_dataset so the mix composition (source labels, normalisation, filtering) can be inspected/tested without the lazy conversation transform that hides all columns but conversation.

Parameters:

sources

list[Source | Mapping[str, Any]] | NoneDefaults to None

Source specs to mix. Defaults to the full six-source :data:SOURCES. Pass a trimmed list for local/smoke runs.

audio_column

strDefaults to 'audio'

Name of the audio column to standardise on.

text_column

strDefaults to 'text'

Name of the transcript column to standardise on.

shuffle_seed

int | NoneDefaults to 42

Global shuffle seed so consecutive examples mix sources. None disables shuffling (single-source blocks).

min_audio_duration_seconds

float | NoneDefaults to 1.0

Drop clips shorter than this via header-only soundfile.info (no PCM decode). None disables the filter.

max_audio_duration_seconds

float | NoneDefaults to 30.0

Drop clips longer than this via the same header-only probe. Defaults to 30.0 to cap activation memory — long clips inflate the Whisper feature extractor and can OOM a rank. None disables the cap.

Returns: Dataset

A map-style HuggingFace Dataset with columns

nemo_automodel.components.datasets.audio.multi_en.make_multi_en_asr_dataset(
    sampling_rate: int = TARGET_SAMPLING_RATE,
    system_prompt: str | None = None,
    user_prompt: str | None = DEFAULT_USER_PROMPT,
    min_audio_duration_seconds: float | None = 1.0,
    max_audio_duration_seconds: float | None = 30.0,
    shuffle_seed: int | None = 42,
    sources: list[nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]] | None = None,
    audio_column: str = 'audio',
    text_column: str = 'text'
) -> datasets.Dataset

Build the multi-source English ASR training dataset for Qwen-Omni.

Mixes the six public corpora in :data:SOURCES (or a caller-supplied subset), normalises every transcript with :func:normalize_text, and attaches the shared lazy ASR transform so audio is decoded with soundfile (and resampled to sampling_rate) only on item access — never at construction time and never via torchcodec.

Parameters:

sampling_rate

intDefaults to TARGET_SAMPLING_RATE

Target sampling rate in Hz for the decoded waveform.

system_prompt

str | NoneDefaults to None

Optional system-turn instruction. None omits it.

user_prompt

str | NoneDefaults to DEFAULT_USER_PROMPT

User-turn instruction placed before the audio. Defaults to a generic English ASR instruction.

min_audio_duration_seconds

float | NoneDefaults to 1.0

Drop clips shorter than this (header-only probe). Defaults to 1.0 to dodge the Qwen-Omni Whisper feature-extractor sub-second off-by-one.

max_audio_duration_seconds

float | NoneDefaults to 30.0

Drop clips longer than this (header-only probe). Defaults to 30.0 to cap activation memory and avoid per-rank OOM from long clips in the mix. None disables the cap.

shuffle_seed

int | NoneDefaults to 42

Global shuffle seed; None disables shuffling.

sources

list[Source | Mapping[str, Any]] | NoneDefaults to None

Optional subset of :data:SOURCES (e.g. to skip gated corpora for a local smoke run). Defaults to all six.

audio_column

strDefaults to 'audio'

Name of the audio column to standardise on.

text_column

strDefaults to 'text'

Name of the transcript column to standardise on.

Returns: Dataset

A HuggingFace Dataset whose elements are

nemo_automodel.components.datasets.audio.multi_en.normalize_text(
    text: str
) -> str

Normalise a transcript to UPPERCASE, punctuation-free, apostrophes kept.

Parameters:

text

str

Raw transcript (may be None).

Returns: str

The normalised transcript. Digits and intra-word apostrophes are kept;

nemo_automodel.components.datasets.audio.multi_en.DEFAULT_USER_PROMPT = 'Transcribe the following English audio verbatim. Output only the raw transcript...

nemo_automodel.components.datasets.audio.multi_en.SOURCES: list[Source] = [Source('ami_ihm', 'edinburghcstr/ami', 'ihm', 'train', 'text', known_count=1085...

nemo_automodel.components.datasets.audio.multi_en.TARGET_SAMPLING_RATE = 16000

nemo_automodel.components.datasets.audio.multi_en._PUNCT_RE = re.compile("[^\\w\\s']")

nemo_automodel.components.datasets.audio.multi_en._TAG_RE = re.compile('<[^>]*>')

nemo_automodel.components.datasets.audio.multi_en._WS_RE = re.compile('\\s+')

nemo_automodel.components.datasets.audio.multi_en.logger = logging.getLogger(__name__)