Fine-Tune Qwen3-Omni for ASR#

End-to-end ASR fine-tuning of Qwen/Qwen3-Omni-30B-A3B-Instruct on a Hugging Face audio dataset, using the NeMo AutoModel VLM training stack. The running example is the public edinburghcstr/ami meeting corpus (English IHM), but the same recipe works for any HF dataset that exposes {audio, text} columns (AMI, LibriSpeech, GigaSpeech, WenetSpeech, CommonVoice, …).

The workflow has two stages:

  1. Train the thinker sub-model with the FinetuneRecipeForVLM recipe.

  2. Convert the NeMo-saved thinker checkpoint into a Hugging Face-compatible Qwen3-Omni export so transformers.AutoModel* and vLLM can load it.


Data Preparation#

Built-In Builder: make_hf_audio_asr_dataset#

nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset returns a Hugging Face Dataset whose __getitem__ lazily produces a single {"conversation": [...]} dict suitable for qwen3_omni_asr_collate_fn. Key design points:

  • with_transform for lazy decoding. Building the dataset object is a constant-time metadata read; audio decode and chat-template assembly only run inside dataloader workers when a batch is fetched. Startup time is independent of split size.

  • Configurable prompt shape. Defaults are system_prompt=None and user_prompt=None, yielding the minimal user(audio) β†’ assistant(text) conversation. Setting either or both expands the conversation: system_prompt="..." adds a system turn, user_prompt="..." prepends a text item before the audio inside the user turn. Whitespace-only prompts are treated as absent.

  • Dataset-agnostic. Accepts any HF audio dataset that exposes an audio column and a transcript column. Defaults (audio_column="audio", text_column="text", name=None) cover AMI, LibriSpeech, GigaSpeech, and WenetSpeech out of the box; per-dataset overrides go in the recipe YAML.

from nemo_automodel.components.datasets.audio.datasets import (
    make_hf_audio_asr_dataset,
)

dataset = make_hf_audio_asr_dataset(
    path_or_dataset="edinburghcstr/ami",
    name="ihm",
    split="train",
    sampling_rate=16000,
    user_prompt="Transcribe the English audio into text.",
)
# dataset[0]["conversation"] yields:
#   [
#     {"role": "user",      "content": [{"type": "text", "text": "Transcribe…"},
#                                       {"type": "audio", "audio": np.ndarray}]},
#     {"role": "assistant", "content": [{"type": "text", "text": "..."}]},
#   ]

Built-In Collate: qwen3_omni_asr_collate_fn#

nemo_automodel.components.datasets.audio.collate_fns.qwen3_omni_asr_collate_fn batches the lazy samples into model inputs without depending on qwen_omni_utils:

  • Walks each conversation for {"type": "audio", "audio": <ndarray>} items and feeds the raw waveforms straight to Qwen3OmniMoeProcessor’s WhisperFeatureExtractor (skipping the process_mm_info helper).

  • Validates and coerces every audio payload through _validate_and_coerce_audio_payload (1-D float32; otherwise raises ValueError naming the sample index and offending shape/dtype).

  • Pins padding_side="right" so the recipe’s count_tail_padding token accounting works correctly.

  • Reuses build_labels_from_template (marker-based; Qwen3OmniMoeProcessor is in _IMSTART_TEMPLATE_PROCESSORS) and emits pre-shifted labels.

The collate is selected through the YAML’s dataloader.collate_fn._target_; it is intentionally not registered in the global COLLATE_FNS map so the existing Qwen3OmniMoeProcessor β†’ qwen3_omni_collate_fn mapping keeps serving non-ASR VLM users that do have qwen_omni_utils installed.

Use a Different HF Audio Dataset#

To target your own dataset, set dataset.path_or_dataset and override the defaults below only when the dataset diverges:

Dataset

path_or_dataset

name

text_column

edinburghcstr/ami

edinburghcstr/ami

ihm or sdm

text (default)

openslr/librispeech_asr

openslr/librispeech_asr

optional config

text (default)

speechcolab/gigaspeech

speechcolab/gigaspeech

optional config

text (default)

mozilla-foundation/common_voice_*

mozilla-foundation/common_voice_18_0

language code (e.g., en)

sentence

YAML override snippet for CommonVoice (note text_column: sentence):

dataset:
  _target_: nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset
  path_or_dataset: mozilla-foundation/common_voice_18_0
  name: en
  text_column: sentence
  split: train
  sampling_rate: 16000

Audio columns are universally named audio across these datasets, so the default audio_column="audio" rarely needs an override.

Mixture of Datasets: multi_en#

For a stronger general-purpose English model, train on a mixture of public ASR corpora rather than a single dataset. nemo_automodel.components.datasets.audio.multi_en concatenates several HF sources into one training set, normalizing each to {audio, text, source} with per-source transcript cleanup (e.g. stripping GigaSpeech bracket tags such as <COMMA> / <SIL>). The default English composition is ~500k clips:

Source

HF repo

config / split

clips

AMI IHM

edinburghcstr/ami

ihm / train

108,502

Earnings22

sanchit-gandhi/earnings22_split

train

52,006

VoxPopuli (en)

facebook/voxpopuli

en / train (capped)

4,000

GigaSpeech (s)

speechcolab/gigaspeech

s / train

230,068

SPGISpeech (S)

kensho/spgispeech

S / train

77,073

LibriSpeech

openslr/librispeech_asr

clean / train.100

28,539

Total

~500,188

GigaSpeech and SPGISpeech are gated on the Hub β€” accept their terms (and allow trust_remote_code for GigaSpeech) before launching. The source list is fully overridable from YAML via dataset.sources (pass a trimmed list to drop gated corpora), and dataset.max_audio_duration_seconds caps clip length to bound activation memory. Ready-to-run recipe: examples/audio_finetune/qwen3_omni_asr/multi_en_sft.yaml (and examples/audio_finetune/qwen2_5_omni_asr/multi_en_sft_3b.yaml for the 3B).


Train#

Example Config#

examples/audio_finetune/qwen3_omni_asr/ami_sft.yaml is a ready-to-run full fine-tune for the 30B-A3B Omni model on a single 8-GPU node, targeting the public AMI IHM corpus. Defaults:

Section

Setting

recipe

FinetuneRecipeForVLM

distributed

fsdp2, ep_size=8, tp=cp=pp=1

freeze_config

freeze_vision_tower=true, freeze_audio_tower=false, freeze_language_model=false

step_scheduler

global_batch_size=64, local_batch_size=8, ckpt_every_steps=200, num_epochs=1

optimizer

AdamW(lr=2.0e-5, betas=[0.9, 0.95], weight_decay=0.0)

checkpoint

result/checkpoints/..., model_save_format=safetensors, save_consolidated=final

dataset

make_hf_audio_asr_dataset(path_or_dataset="edinburghcstr/ami", name="ihm")

peft: is intentionally omitted β€” both the language model and the audio tower are trainable; the vision tower stays frozen. With ep_size=8, the MoE experts are sharded across all 8 GPUs.

Measured on 8x H100 80 GB: ~1.4 step/s steady-state, ~36–40 GB peak/GPU. One epoch over the ~69k post-1.0s-filter AMI IHM train clips finishes in ~22 min (compared to ~2 h at local_batch_size=1). Peak memory on this MoE is dominated by FSDP/expert all-gather (~36 GB), not by activations, so the batch size can be pushed this high without OOM.

Launch#

Use the standard NeMo AutoModel CLI:

torchrun --nproc_per_node=8 --nnodes=1 -m nemo_automodel.cli.app \
    examples/audio_finetune/qwen3_omni_asr/ami_sft.yaml

Any per-field CLI override (e.g., --dataset.split 'train[:5000]') is forwarded to the YAML. Optional WandB logging streams online as long as WANDB_API_KEY is set in the environment; set WANDB_MODE=offline for a dry run.

What Gets Saved#

Every ckpt_every_steps steps the recipe writes a consolidated checkpoint:

epoch_E_step_S/
β”œβ”€β”€ config.yaml                # snapshot of the recipe config
β”œβ”€β”€ losses.json
β”œβ”€β”€ dataloader/                # StatefulDataLoader state for restart
β”œβ”€β”€ optim/                     # AdamW state (~30 GB / shard for 30B FT)
β”œβ”€β”€ rng/                       # PyTorch + numpy + python RNG state
β”œβ”€β”€ step_scheduler.pt
└── model/
    β”œβ”€β”€ shard-XXXXX-model-00001-of-00001.safetensors  # DCP sharded
    β”œβ”€β”€ consolidated/                                  # HF-format export
    β”‚   β”œβ”€β”€ config.json                               # thinker subtree only
    β”‚   β”œβ”€β”€ model.safetensors.index.json
    β”‚   β”œβ”€β”€ model-00001-of-00013.safetensors
    β”‚   └── ...
    └── chat_template.jinja, tokenizer*.json, processor_config.json

The consolidated/ directory is the artifact to use for inference. It already holds the trained weights and the right tokenizer + processor β€” but its config.json describes the thinker sub-model only (model_type=qwen3_omni_moe_thinker), which neither transformers.AutoConfig nor vLLM recognizes as a top-level architecture. See the Convert section for the conversion step.

Resume#

--checkpoint.restore_from <ckpt_dir> reloads the model state, optimizer, RNG, and dataloader position. Full-FT checkpoints are loaded directly into the sharded model parts. The recipe does not require the conversion step below for restart β€” only for external inference tooling.


Convert: Thinker β†’ HF-Compatible Omni#

NeMo maps Qwen3OmniMoeForConditionalGeneration to a custom thinker-only class (the parent Omni model in HF has thinker / code2wav / talker sub-modules; this recipe only needs the thinker for ASR). The saved consolidated/config.json therefore carries model_type=qwen3_omni_moe_thinker, which is not registered as a top-level architecture in transformers.CONFIG_MAPPING. Loading it directly will fail with:

ValueError: The checkpoint you are trying to load has model type
`qwen3_omni_moe_thinker` but Transformers does not recognize this architecture.

Tool: tools/wrap_thinker_ckpt_as_omni.py#

tools/wrap_thinker_ckpt_as_omni.py rewraps the thinker checkpoint as a full Qwen3-Omni export by:

  1. Renaming + copying the trained thinker.* shards into the output dir.

  2. Copying the untrained code2wav.* and talker.* shards verbatim from the cached HF base model (these were never modified during ASR training).

  3. Writing a merged model.safetensors.index.json across all three buckets.

  4. Replacing the bogus config.json with the base model’s (model_type=qwen3_omni_moe, architectures=["Qwen3OmniMoeForConditionalGeneration"]).

  5. Copying the rest of the HF metadata (tokenizer, processor, generation config, chat template) from the base; the recipe-saved chat_template.jinja wins if present.

Memory footprint stays at roughly one shard (~5 GB) at a time β€” no full-model materialisation.

python tools/wrap_thinker_ckpt_as_omni.py \
    --ckpt-dir   result/checkpoints/<run>/epoch_0_step_199/model/consolidated \
    --base-dir   ~/.cache/huggingface/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/<rev> \
    --out-dir    /tmp/qwen3_omni_asr_step_199_wrapped

The output directory is a drop-in replacement for the public Qwen3-Omni snapshot β€” only the thinker.* weights differ.


Results: AMI IHM#

End-of-epoch evaluation on the AMI IHM test split, comparing the zero-shot base Qwen3-Omni against the same model after one epoch of full fine-tuning with the recipe above (audio tower trainable). WER drops by roughly half:

AMI IHM WER: base vs. fine-tuned Qwen3-Omni

Stage

Model

WER (AMI IHM test)

Before training

Base Qwen/Qwen3-Omni-30B-A3B-Instruct (zero-shot)

15.81%

After training

1 epoch full FT (audio tower trainable)

8.31%

Results: multi_en mixture#

Training the same model for 3 epochs on the ~500k-clip multi_en mixture (see Mixture of Datasets) generalizes across all 7 open-ASR-leaderboard English test subsets, not just AMI. WER below is Whisper-normalized (EnglishTextNormalizer), greedy decode, comparing the zero-shot base against the multi_en fine-tune:

Subset

N

Base (zero-shot)

multi_en FT

LibriSpeech test.clean

2,620

1.49

1.89

LibriSpeech test.other

2,939

2.62

3.54

SPGISpeech

39,341

3.12

2.11

VoxPopuli

1,842

7.07

6.67

GigaSpeech

19,931

8.54

9.46

Earnings22

2,741

9.79

8.89

AMI (IHM)

12,643

11.07

8.22

Macro avg

β€”

6.24

5.83

Fine-tuning concentrates its gains on the harder conversational / domain sets (AMI βˆ’2.85, Earnings22 βˆ’0.90, SPGISpeech βˆ’1.01, VoxPopuli βˆ’0.40), while the strong base keeps a small edge on clean read speech (LibriSpeech, GigaSpeech) β€” a hint that the mix can be rebalanced toward those styles. Net macro WER improves from 6.24% to 5.83%.

The same recipe on Qwen2.5-Omni-3B (multi_en_sft_3b.yaml) shows a much larger fine-tuning gain, since the small model’s zero-shot baseline is weaker: macro WER 8.97% β†’ 6.55% (βˆ’2.42).