Fine-Tune Qwen3-Omni for ASR

View as Markdown

End-to-end ASR fine-tuning of Qwen/Qwen3-Omni-30B-A3B-Instruct on a Hugging Face audio dataset, using the NeMo AutoModel VLM training stack. The running example is the public edinburghcstr/ami meeting corpus (English IHM), but the same recipe works for any HF dataset that exposes {audio, text} columns (AMI, LibriSpeech, GigaSpeech, WenetSpeech, CommonVoice, …).

The workflow has two stages:

  1. Train the thinker sub-model with the FinetuneRecipeForVLM recipe.
  2. Convert the NeMo-saved thinker checkpoint into a Hugging Face-compatible Qwen3-Omni export so transformers.AutoModel* and vLLM can load it.

Data Preparation

Built-In Builder: make_hf_audio_asr_dataset

nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset returns a Hugging Face Dataset whose __getitem__ lazily produces a single {"conversation": [...]} dict suitable for qwen3_omni_asr_collate_fn. Key design points:

  • with_transform for lazy decoding. Building the dataset object is a constant-time metadata read; audio decode and chat-template assembly only run inside dataloader workers when a batch is fetched. Startup time is independent of split size.
  • Configurable prompt shape. Defaults are system_prompt=None and user_prompt=None, yielding the minimal user(audio) β†’ assistant(text) conversation. Setting either or both expands the conversation: system_prompt="..." adds a system turn, user_prompt="..." prepends a text item before the audio inside the user turn. Whitespace-only prompts are treated as absent.
  • Dataset-agnostic. Accepts any HF audio dataset that exposes an audio column and a transcript column. Defaults (audio_column="audio", text_column="text", name=None) cover AMI, LibriSpeech, GigaSpeech, and WenetSpeech out of the box; per-dataset overrides go in the recipe YAML.
1from nemo_automodel.components.datasets.audio.datasets import (
2 make_hf_audio_asr_dataset,
3)
4
5dataset = make_hf_audio_asr_dataset(
6 path_or_dataset="edinburghcstr/ami",
7 name="ihm",
8 split="train",
9 sampling_rate=16000,
10 user_prompt="Transcribe the English audio into text.",
11)
12# dataset[0]["conversation"] yields:
13# [
14# {"role": "user", "content": [{"type": "text", "text": "Transcribe…"},
15# {"type": "audio", "audio": np.ndarray}]},
16# {"role": "assistant", "content": [{"type": "text", "text": "..."}]},
17# ]

Built-In Collate: qwen3_omni_asr_collate_fn

nemo_automodel.components.datasets.audio.collate_fns.qwen3_omni_asr_collate_fn batches the lazy samples into model inputs without depending on qwen_omni_utils:

  • Walks each conversation for {"type": "audio", "audio": <ndarray>} items and feeds the raw waveforms straight to Qwen3OmniMoeProcessor’s WhisperFeatureExtractor (skipping the process_mm_info helper).
  • Validates and coerces every audio payload through _validate_and_coerce_audio_payload (1-D float32; otherwise raises ValueError naming the sample index and offending shape/dtype).
  • Pins padding_side="right" so the recipe’s count_tail_padding token accounting works correctly.
  • Reuses build_labels_from_template (marker-based; Qwen3OmniMoeProcessor is in _IMSTART_TEMPLATE_PROCESSORS) and emits pre-shifted labels.

The collate is selected through the YAML’s dataloader.collate_fn._target_; it is intentionally not registered in the global COLLATE_FNS map so the existing Qwen3OmniMoeProcessor β†’ qwen3_omni_collate_fn mapping keeps serving non-ASR VLM users that do have qwen_omni_utils installed.

Use a Different HF Audio Dataset

To target your own dataset, set dataset.path_or_dataset and override the defaults below only when the dataset diverges:

Datasetpath_or_datasetnametext_column
edinburghcstr/amiedinburghcstr/amiihm or sdmtext (default)
openslr/librispeech_asropenslr/librispeech_asroptional configtext (default)
speechcolab/gigaspeechspeechcolab/gigaspeechoptional configtext (default)
mozilla-foundation/common_voice_*mozilla-foundation/common_voice_18_0language code (e.g., en)sentence

YAML override snippet for CommonVoice (note text_column: sentence):

1dataset:
2 _target_: nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset
3 path_or_dataset: mozilla-foundation/common_voice_18_0
4 name: en
5 text_column: sentence
6 split: train
7 sampling_rate: 16000

Audio columns are universally named audio across these datasets, so the default audio_column="audio" rarely needs an override.

Mixture of Datasets: multi_en

For a stronger general-purpose English model, train on a mixture of public ASR corpora rather than a single dataset. nemo_automodel.components.datasets.audio.multi_en concatenates several HF sources into one training set, normalizing each to {audio, text, source} with per-source transcript cleanup (e.g. stripping GigaSpeech bracket tags such as <COMMA> / <SIL>). The default English composition is ~500k clips:

SourceHF repoconfig / splitclips
AMI IHMedinburghcstr/amiihm / train108,502
Earnings22sanchit-gandhi/earnings22_splittrain52,006
VoxPopuli (en)facebook/voxpopulien / train (capped)4,000
GigaSpeech (s)speechcolab/gigaspeechs / train230,068
SPGISpeech (S)kensho/spgispeechS / train77,073
LibriSpeechopenslr/librispeech_asrclean / train.10028,539
Total~500,188

GigaSpeech and SPGISpeech are gated on the Hub β€” accept their terms (and allow trust_remote_code for GigaSpeech) before launching. The source list is fully overridable from YAML via dataset.sources (pass a trimmed list to drop gated corpora), and dataset.max_audio_duration_seconds caps clip length to bound activation memory. Ready-to-run recipe: examples/audio_finetune/qwen3_omni_asr/multi_en_sft.yaml (and examples/audio_finetune/qwen2_5_omni_asr/multi_en_sft_3b.yaml for the 3B).


Train

Example Config

examples/audio_finetune/qwen3_omni_asr/ami_sft.yaml is a ready-to-run full fine-tune for the 30B-A3B Omni model on a single 8-GPU node, targeting the public AMI IHM corpus. Defaults:

SectionSetting
recipeFinetuneRecipeForVLM
distributedfsdp2, ep_size=8, tp=cp=pp=1
freeze_configfreeze_vision_tower=true, freeze_audio_tower=false, freeze_language_model=false
step_schedulerglobal_batch_size=64, local_batch_size=8, ckpt_every_steps=200, num_epochs=1
optimizerAdamW(lr=2.0e-5, betas=[0.9, 0.95], weight_decay=0.0)
checkpointresult/checkpoints/..., model_save_format=safetensors, save_consolidated=final
datasetmake_hf_audio_asr_dataset(path_or_dataset="edinburghcstr/ami", name="ihm")

peft: is intentionally omitted β€” both the language model and the audio tower are trainable; the vision tower stays frozen. With ep_size=8, the MoE experts are sharded across all 8 GPUs.

Measured on 8x H100 80 GB: ~1.4 step/s steady-state, ~36–40 GB peak/GPU. One epoch over the ~69k post-1.0s-filter AMI IHM train clips finishes in ~22 min (compared to ~2 h at local_batch_size=1). Peak memory on this MoE is dominated by FSDP/expert all-gather (~36 GB), not by activations, so the batch size can be pushed this high without OOM.

Launch

Use the standard NeMo AutoModel CLI:

$torchrun --nproc_per_node=8 --nnodes=1 -m nemo_automodel.cli.app \
> examples/audio_finetune/qwen3_omni_asr/ami_sft.yaml

Any per-field CLI override (e.g., --dataset.split 'train[:5000]') is forwarded to the YAML. Optional WandB logging streams online as long as WANDB_API_KEY is set in the environment; set WANDB_MODE=offline for a dry run.

What Gets Saved

Every ckpt_every_steps steps the recipe writes a consolidated checkpoint:

epoch_E_step_S/
β”œβ”€β”€ config.yaml # snapshot of the recipe config
β”œβ”€β”€ losses.json
β”œβ”€β”€ dataloader/ # StatefulDataLoader state for restart
β”œβ”€β”€ optim/ # AdamW state (~30 GB / shard for 30B FT)
β”œβ”€β”€ rng/ # PyTorch + numpy + python RNG state
β”œβ”€β”€ step_scheduler.pt
└── model/
β”œβ”€β”€ shard-XXXXX-model-00001-of-00001.safetensors # DCP sharded
β”œβ”€β”€ consolidated/ # HF-format export
β”‚ β”œβ”€β”€ config.json # thinker subtree only
β”‚ β”œβ”€β”€ model.safetensors.index.json
β”‚ β”œβ”€β”€ model-00001-of-00013.safetensors
β”‚ └── ...
└── chat_template.jinja, tokenizer*.json, processor_config.json

The consolidated/ directory is the artifact to use for inference. It already holds the trained weights and the right tokenizer + processor β€” but its config.json describes the thinker sub-model only (model_type=qwen3_omni_moe_thinker), which neither transformers.AutoConfig nor vLLM recognizes as a top-level architecture. See the Convert section for the conversion step.

Resume

--checkpoint.restore_from <ckpt_dir> reloads the model state, optimizer, RNG, and dataloader position. Full-FT checkpoints are loaded directly into the sharded model parts. The recipe does not require the conversion step below for restart β€” only for external inference tooling.


Convert: Thinker β†’ HF-Compatible Omni

NeMo maps Qwen3OmniMoeForConditionalGeneration to a custom thinker-only class (the parent Omni model in HF has thinker / code2wav / talker sub-modules; this recipe only needs the thinker for ASR). The saved consolidated/config.json therefore carries model_type=qwen3_omni_moe_thinker, which is not registered as a top-level architecture in transformers.CONFIG_MAPPING. Loading it directly will fail with:

ValueError: The checkpoint you are trying to load has model type
`qwen3_omni_moe_thinker` but Transformers does not recognize this architecture.

Tool: tools/wrap_thinker_ckpt_as_omni.py

tools/wrap_thinker_ckpt_as_omni.py rewraps the thinker checkpoint as a full Qwen3-Omni export by:

  1. Renaming + copying the trained thinker.* shards into the output dir.
  2. Copying the untrained code2wav.* and talker.* shards verbatim from the cached HF base model (these were never modified during ASR training).
  3. Writing a merged model.safetensors.index.json across all three buckets.
  4. Replacing the bogus config.json with the base model’s (model_type=qwen3_omni_moe, architectures=["Qwen3OmniMoeForConditionalGeneration"]).
  5. Copying the rest of the HF metadata (tokenizer, processor, generation config, chat template) from the base; the recipe-saved chat_template.jinja wins if present.

Memory footprint stays at roughly one shard (~5 GB) at a time β€” no full-model materialisation.

$python tools/wrap_thinker_ckpt_as_omni.py \
> --ckpt-dir result/checkpoints/<run>/epoch_0_step_199/model/consolidated \
> --base-dir ~/.cache/huggingface/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/<rev> \
> --out-dir /tmp/qwen3_omni_asr_step_199_wrapped

The output directory is a drop-in replacement for the public Qwen3-Omni snapshot β€” only the thinker.* weights differ.


Results: AMI IHM

End-of-epoch evaluation on the AMI IHM test split, comparing the zero-shot base Qwen3-Omni against the same model after one epoch of full fine-tuning with the recipe above (audio tower trainable). WER drops by roughly half:

AMI IHM WER: base vs. fine-tuned Qwen3-Omni

StageModelWER (AMI IHM test)
Before trainingBase Qwen/Qwen3-Omni-30B-A3B-Instruct (zero-shot)15.81%
After training1 epoch full FT (audio tower trainable)8.31%

Results: multi_en mixture

Training the same model for 3 epochs on the ~500k-clip multi_en mixture (see Mixture of Datasets) generalizes across all 7 open-ASR-leaderboard English test subsets, not just AMI. WER below is Whisper-normalized (EnglishTextNormalizer), greedy decode, comparing the zero-shot base against the multi_en fine-tune:

SubsetNBase (zero-shot)multi_en FT
LibriSpeech test.clean2,6201.491.89
LibriSpeech test.other2,9392.623.54
SPGISpeech39,3413.122.11
VoxPopuli1,8427.076.67
GigaSpeech19,9318.549.46
Earnings222,7419.798.89
AMI (IHM)12,64311.078.22
Macro avgβ€”6.245.83

Fine-tuning concentrates its gains on the harder conversational / domain sets (AMI βˆ’2.85, Earnings22 βˆ’0.90, SPGISpeech βˆ’1.01, VoxPopuli βˆ’0.40), while the strong base keeps a small edge on clean read speech (LibriSpeech, GigaSpeech) β€” a hint that the mix can be rebalanced toward those styles. Net macro WER improves from 6.24% to 5.83%.

The same recipe on Qwen2.5-Omni-3B (multi_en_sft_3b.yaml) shows a much larger fine-tuning gain, since the small model’s zero-shot baseline is weaker: macro WER 8.97% β†’ 6.55% (βˆ’2.42).