Fine-Tune Qwen3-Omni for ASR
Fine-Tune Qwen3-Omni for ASR
End-to-end ASR fine-tuning of Qwen/Qwen3-Omni-30B-A3B-Instruct on a
Hugging Face audio dataset, using the NeMo AutoModel VLM training stack. The
running example is the public
edinburghcstr/ami
meeting corpus (English IHM), but the same recipe works for any HF dataset
that exposes {audio, text} columns (AMI, LibriSpeech, GigaSpeech,
WenetSpeech, CommonVoice, β¦).
The workflow has two stages:
- Train the thinker sub-model with the
FinetuneRecipeForVLMrecipe. - Convert the NeMo-saved thinker checkpoint into a Hugging Face-compatible
Qwen3-Omni export so
transformers.AutoModel*and vLLM can load it.
Data Preparation
Built-In Builder: make_hf_audio_asr_dataset
nemo_automodel.components.datasets.audio.datasets.make_hf_audio_asr_dataset
returns a Hugging Face Dataset whose __getitem__ lazily produces a single
{"conversation": [...]} dict suitable for qwen3_omni_asr_collate_fn. Key
design points:
with_transformfor lazy decoding. Building the dataset object is a constant-time metadata read; audio decode and chat-template assembly only run inside dataloader workers when a batch is fetched. Startup time is independent of split size.- Configurable prompt shape. Defaults are
system_prompt=Noneanduser_prompt=None, yielding the minimaluser(audio) β assistant(text)conversation. Setting either or both expands the conversation:system_prompt="..."adds asystemturn,user_prompt="..."prepends a text item before the audio inside the user turn. Whitespace-only prompts are treated as absent. - Dataset-agnostic. Accepts any HF audio dataset that exposes an audio
column and a transcript column. Defaults (
audio_column="audio",text_column="text",name=None) cover AMI, LibriSpeech, GigaSpeech, and WenetSpeech out of the box; per-dataset overrides go in the recipe YAML.
Built-In Collate: qwen3_omni_asr_collate_fn
nemo_automodel.components.datasets.audio.collate_fns.qwen3_omni_asr_collate_fn
batches the lazy samples into model inputs without depending on
qwen_omni_utils:
- Walks each conversation for
{"type": "audio", "audio": <ndarray>}items and feeds the raw waveforms straight toQwen3OmniMoeProcessorβsWhisperFeatureExtractor(skipping theprocess_mm_infohelper). - Validates and coerces every audio payload through
_validate_and_coerce_audio_payload(1-Dfloat32; otherwise raisesValueErrornaming the sample index and offending shape/dtype). - Pins
padding_side="right"so the recipeβscount_tail_paddingtoken accounting works correctly. - Reuses
build_labels_from_template(marker-based;Qwen3OmniMoeProcessoris in_IMSTART_TEMPLATE_PROCESSORS) and emits pre-shifted labels.
The collate is selected through the YAMLβs dataloader.collate_fn._target_; it
is intentionally not registered in the global COLLATE_FNS map so the
existing Qwen3OmniMoeProcessor β qwen3_omni_collate_fn mapping keeps
serving non-ASR VLM users that do have qwen_omni_utils installed.
Use a Different HF Audio Dataset
To target your own dataset, set dataset.path_or_dataset and override the
defaults below only when the dataset diverges:
YAML override snippet for CommonVoice (note text_column: sentence):
Audio columns are universally named audio across these datasets, so the
default audio_column="audio" rarely needs an override.
Mixture of Datasets: multi_en
For a stronger general-purpose English model, train on a mixture of public
ASR corpora rather than a single dataset.
nemo_automodel.components.datasets.audio.multi_en concatenates several HF
sources into one training set, normalizing each to {audio, text, source} with
per-source transcript cleanup (e.g. stripping GigaSpeech bracket tags such as
<COMMA> / <SIL>). The default English composition is ~500k clips:
GigaSpeech and SPGISpeech are gated on the Hub β accept their terms (and allow
trust_remote_code for GigaSpeech) before launching. The source list is fully
overridable from YAML via dataset.sources (pass a trimmed list to drop gated
corpora), and dataset.max_audio_duration_seconds caps clip length to bound
activation memory. Ready-to-run recipe:
examples/audio_finetune/qwen3_omni_asr/multi_en_sft.yaml (and
examples/audio_finetune/qwen2_5_omni_asr/multi_en_sft_3b.yaml for the 3B).
Train
Example Config
examples/audio_finetune/qwen3_omni_asr/ami_sft.yaml is a ready-to-run full
fine-tune for the 30B-A3B Omni model on a single 8-GPU node, targeting the
public AMI IHM corpus. Defaults:
peft: is intentionally omitted β both the language model and the audio
tower are trainable; the vision tower stays frozen. With ep_size=8, the MoE
experts are sharded across all 8 GPUs.
Measured on 8x H100 80 GB: ~1.4 step/s steady-state, ~36β40 GB peak/GPU.
One epoch over the ~69k post-1.0s-filter AMI IHM train clips finishes in
~22 min (compared to ~2 h at local_batch_size=1). Peak memory on this MoE is
dominated by FSDP/expert all-gather (~36 GB), not by activations, so the batch
size can be pushed this high without OOM.
Launch
Use the standard NeMo AutoModel CLI:
Any per-field CLI override (e.g., --dataset.split 'train[:5000]') is
forwarded to the YAML. Optional WandB logging streams online as long as
WANDB_API_KEY is set in the environment; set WANDB_MODE=offline for a
dry run.
What Gets Saved
Every ckpt_every_steps steps the recipe writes a consolidated checkpoint:
The consolidated/ directory is the artifact to use for inference. It already
holds the trained weights and the right tokenizer + processor β but its
config.json describes the thinker sub-model only
(model_type=qwen3_omni_moe_thinker), which neither transformers.AutoConfig
nor vLLM recognizes as a top-level architecture. See the Convert section for the
conversion step.
Resume
--checkpoint.restore_from <ckpt_dir> reloads the model state, optimizer,
RNG, and dataloader position. Full-FT checkpoints are loaded directly into
the sharded model parts. The recipe does not require the conversion step
below for restart β only for external inference tooling.
Convert: Thinker β HF-Compatible Omni
NeMo maps Qwen3OmniMoeForConditionalGeneration to a custom thinker-only
class (the parent Omni model in HF has thinker / code2wav / talker
sub-modules; this recipe only needs the thinker for ASR). The saved
consolidated/config.json therefore carries
model_type=qwen3_omni_moe_thinker, which is not registered as a top-level
architecture in transformers.CONFIG_MAPPING. Loading it directly will
fail with:
Tool: tools/wrap_thinker_ckpt_as_omni.py
tools/wrap_thinker_ckpt_as_omni.py rewraps the thinker checkpoint as a
full Qwen3-Omni export by:
- Renaming + copying the trained
thinker.*shards into the output dir. - Copying the untrained
code2wav.*andtalker.*shards verbatim from the cached HF base model (these were never modified during ASR training). - Writing a merged
model.safetensors.index.jsonacross all three buckets. - Replacing the bogus
config.jsonwith the base modelβs (model_type=qwen3_omni_moe,architectures=["Qwen3OmniMoeForConditionalGeneration"]). - Copying the rest of the HF metadata (tokenizer, processor, generation
config, chat template) from the base; the recipe-saved
chat_template.jinjawins if present.
Memory footprint stays at roughly one shard (~5 GB) at a time β no full-model materialisation.
The output directory is a drop-in replacement for the public Qwen3-Omni
snapshot β only the thinker.* weights differ.
Results: AMI IHM
End-of-epoch evaluation on the AMI IHM test split, comparing the
zero-shot base Qwen3-Omni against the same model after one epoch of full
fine-tuning with the recipe above (audio tower trainable). WER drops by
roughly half:

Results: multi_en mixture
Training the same model for 3 epochs on the ~500k-clip multi_en mixture (see
Mixture of Datasets) generalizes across all 7
open-ASR-leaderboard
English test subsets, not just AMI. WER below is Whisper-normalized
(EnglishTextNormalizer), greedy decode, comparing the zero-shot base against
the multi_en fine-tune:
Fine-tuning concentrates its gains on the harder conversational / domain sets (AMI β2.85, Earnings22 β0.90, SPGISpeech β1.01, VoxPopuli β0.40), while the strong base keeps a small edge on clean read speech (LibriSpeech, GigaSpeech) β a hint that the mix can be rebalanced toward those styles. Net macro WER improves from 6.24% to 5.83%.
The same recipe on Qwen2.5-Omni-3B (multi_en_sft_3b.yaml) shows a much
larger fine-tuning gain, since the small modelβs zero-shot baseline is weaker:
macro WER 8.97% β 6.55% (β2.42).