> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.audio.multi_en

Multi-source English ASR mix builder for Qwen-Omni fine-tuning.

Mixes several public HuggingFace speech corpora into a single
`&#123;audio, text, source&#125;` training set, normalizing every transcript to a common
uppercase / punctuation-free style. Audio is never decoded at construction time:
each source's audio column is cast with `Audio(decode=False)` and decoded
lazily with `soundfile` inside the shared
:func:`nemo_automodel.components.datasets.audio.datasets._attach_asr_transform`
(so no `torchcodec` dependency is pulled in).

This is the first-class port of the `result/data/build_train_mix.py` prototype.
The key difference from the prototype: the prototype resampled by casting the audio
column to a decoding `Audio` feature with a target sampling rate (which triggers
HuggingFace's default decoding path and pulls in `torchcodec`), whereas this builder
keeps a non-decoding `Audio(decode=False)` cast and lets resampling happen in the
soundfile decode path inside the lazy transform.

## Module Contents

### Classes

| Name                                                                  | Description                                          |
| --------------------------------------------------------------------- | ---------------------------------------------------- |
| [`Source`](#nemo_automodel-components-datasets-audio-multi_en-Source) | Specification for one corpus in the English ASR mix. |

### Functions

| Name                                                                                                          | Description                                                                              |
| ------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| [`_coerce_source`](#nemo_automodel-components-datasets-audio-multi_en-_coerce_source)                         | Coerce a YAML/CLI source spec into a :class:`Source`.                                    |
| [`_load_and_normalize_source`](#nemo_automodel-components-datasets-audio-multi_en-_load_and_normalize_source) | Load one source and normalise it to `&#123;audio (decode=False), text, source&#125;`.    |
| [`build_multi_en_source_mix`](#nemo_automodel-components-datasets-audio-multi_en-build_multi_en_source_mix)   | Build the concatenated `&#123;audio, text, source&#125;` mix (before the ASR transform). |
| [`make_multi_en_asr_dataset`](#nemo_automodel-components-datasets-audio-multi_en-make_multi_en_asr_dataset)   | Build the multi-source English ASR training dataset for Qwen-Omni.                       |
| [`normalize_text`](#nemo_automodel-components-datasets-audio-multi_en-normalize_text)                         | Normalise a transcript to UPPERCASE, punctuation-free, apostrophes kept.                 |

### Data

[`DEFAULT_USER_PROMPT`](#nemo_automodel-components-datasets-audio-multi_en-DEFAULT_USER_PROMPT)

[`SOURCES`](#nemo_automodel-components-datasets-audio-multi_en-SOURCES)

[`TARGET_SAMPLING_RATE`](#nemo_automodel-components-datasets-audio-multi_en-TARGET_SAMPLING_RATE)

[`_PUNCT_RE`](#nemo_automodel-components-datasets-audio-multi_en-_PUNCT_RE)

[`_TAG_RE`](#nemo_automodel-components-datasets-audio-multi_en-_TAG_RE)

[`_WS_RE`](#nemo_automodel-components-datasets-audio-multi_en-_WS_RE)

[`logger`](#nemo_automodel-components-datasets-audio-multi_en-logger)

### API

```python
class nemo_automodel.components.datasets.audio.multi_en.Source(
    name: str,
    repo: str,
    config: str | None,
    split: str,
    text_col: str,
    limit: int | None = None,
    trust_remote_code: bool = False,
    known_count: int | None = None
)
```

Dataclass

Specification for one corpus in the English ASR mix.

```python
nemo_automodel.components.datasets.audio.multi_en._coerce_source(
    spec: nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]
) -> nemo_automodel.components.datasets.audio.multi_en.Source
```

Coerce a YAML/CLI source spec into a :class:`Source`.

The recipe config loader passes nested `dataset.sources` entries as plain
dicts, so a `Source` override written in YAML/CLI arrives here as a mapping.
Pass-through for existing :class:`Source` instances.

**Parameters:**

A :class:`Source` or a mapping with keys matching its fields
(`name`, `repo`, `config`, `split`, `text_col` are required;
`limit`, `trust_remote_code`, `known_count` are optional).

**Returns:** `Source`

class:`Source` instance.

**Raises:**

* `TypeError`: If `spec` is neither a :class:`Source` nor a mapping.
* `ValueError`: If the mapping contains keys that are not `Source` fields.

```python
nemo_automodel.components.datasets.audio.multi_en._load_and_normalize_source(
    src: nemo_automodel.components.datasets.audio.multi_en.Source,
    audio_column: str,
    text_column: str
) -> datasets.Dataset
```

Load one source and normalise it to `&#123;audio (decode=False), text, source&#125;`.

No audio is decoded: the audio column is cast with `Audio(decode=False)` and
only the transcript column is touched (text-only Arrow `map`/`filter`).

**Raises:**

* `ValueError`: If the audio column or the source's text column is missing.

```python
nemo_automodel.components.datasets.audio.multi_en.build_multi_en_source_mix(
    sources: list[nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]] | None = None,
    audio_column: str = 'audio',
    text_column: str = 'text',
    shuffle_seed: int | None = 42,
    min_audio_duration_seconds: float | None = 1.0,
    max_audio_duration_seconds: float | None = 30.0
) -> datasets.Dataset
```

Build the concatenated `&#123;audio, text, source&#125;` mix (before the ASR transform).

Exposed separately from :func:`make_multi_en_asr_dataset` so the mix
composition (source labels, normalisation, filtering) can be inspected/tested
without the lazy conversation transform that hides all columns but
`conversation`.

**Parameters:**

Source specs to mix. Defaults to the full six-source
:data:`SOURCES`. Pass a trimmed list for local/smoke runs.

Name of the audio column to standardise on.

Name of the transcript column to standardise on.

Global shuffle seed so consecutive examples mix sources.
`None` disables shuffling (single-source blocks).

Drop clips shorter than this via header-only
`soundfile.info` (no PCM decode). `None` disables the filter.

Drop clips longer than this via the same
header-only probe. Defaults to `30.0` to cap activation memory —
long clips inflate the Whisper feature extractor and can OOM a rank.
`None` disables the cap.

**Returns:** `Dataset`

A map-style HuggingFace `Dataset` with columns

```python
nemo_automodel.components.datasets.audio.multi_en.make_multi_en_asr_dataset(
    sampling_rate: int = TARGET_SAMPLING_RATE,
    system_prompt: str | None = None,
    user_prompt: str | None = DEFAULT_USER_PROMPT,
    min_audio_duration_seconds: float | None = 1.0,
    max_audio_duration_seconds: float | None = 30.0,
    shuffle_seed: int | None = 42,
    sources: list[nemo_automodel.components.datasets.audio.multi_en.Source | collections.abc.Mapping[str, typing.Any]] | None = None,
    audio_column: str = 'audio',
    text_column: str = 'text'
) -> datasets.Dataset
```

Build the multi-source English ASR training dataset for Qwen-Omni.

Mixes the six public corpora in :data:`SOURCES` (or a caller-supplied subset),
normalises every transcript with :func:`normalize_text`, and attaches the
shared lazy ASR transform so audio is decoded with `soundfile` (and
resampled to `sampling_rate`) only on item access — never at construction
time and never via `torchcodec`.

**Parameters:**

Target sampling rate in Hz for the decoded waveform.

Optional system-turn instruction. `None` omits it.

User-turn instruction placed before the audio. Defaults to
a generic English ASR instruction.

Drop clips shorter than this (header-only
probe). Defaults to `1.0` to dodge the Qwen-Omni Whisper
feature-extractor sub-second off-by-one.

Drop clips longer than this (header-only
probe). Defaults to `30.0` to cap activation memory and avoid
per-rank OOM from long clips in the mix. `None` disables the cap.

Global shuffle seed; `None` disables shuffling.

Optional subset of :data:`SOURCES` (e.g. to skip gated corpora
for a local smoke run). Defaults to all six.

Name of the audio column to standardise on.

Name of the transcript column to standardise on.

**Returns:** `Dataset`

A HuggingFace `Dataset` whose elements are

```python
nemo_automodel.components.datasets.audio.multi_en.normalize_text(
    text: str
) -> str
```

Normalise a transcript to UPPERCASE, punctuation-free, apostrophes kept.

**Parameters:**

Raw transcript (may be `None`).

**Returns:** `str`

The normalised transcript. Digits and intra-word apostrophes are kept;

```python
nemo_automodel.components.datasets.audio.multi_en.DEFAULT_USER_PROMPT = 'Transcribe the following English audio verbatim. Output only the raw transcript...
```

```python
nemo_automodel.components.datasets.audio.multi_en.SOURCES: list[Source] = [Source('ami_ihm', 'edinburghcstr/ami', 'ihm', 'train', 'text', known_count=1085...
```

```python
nemo_automodel.components.datasets.audio.multi_en.TARGET_SAMPLING_RATE = 16000
```

```python
nemo_automodel.components.datasets.audio.multi_en._PUNCT_RE = re.compile("[^\\w\\s']")
```

```python
nemo_automodel.components.datasets.audio.multi_en._TAG_RE = re.compile('<[^>]*>')
```

```python
nemo_automodel.components.datasets.audio.multi_en._WS_RE = re.compile('\\s+')
```

```python
nemo_automodel.components.datasets.audio.multi_en.logger = logging.getLogger(__name__)
```