Key Concepts in Speech AI#
This page introduces the fundamental concepts you’ll encounter when working with speech models in NeMo. No prior NeMo experience is required — we start from the basics of audio and work up to how NeMo structures its models.
Audio Conventions in NeMo#
Sampling rate — ASR models often use 16 kHz; TTS and audio processing models may use higher rates (e.g. 22.05 kHz, 44.1 kHz). Check each model’s or preprocessor’s config for the expected sample rate.
Channels — Most models use mono input, but some support multi-channel audio (e.g. for spatial or multi-mic setups). See the model and preprocessor documentation for your use case.
Preprocessing — NeMo models typically include a preprocessor that converts waveform input into features (e.g. mel-spectrogram). For most setups, you should provide audio that already matches the model’s expected sample rate and channel layout (often mono); automatic resampling or stereo→mono is not guaranteed and depends on the collection, dataset, and preprocessor config. Check the model and preprocessor documentation for your use case.
Mel-spectrogram — For models that use it, the preprocessor turns raw waveform into mel-spectrogram features; this is handled inside the model, not as a separate offline step.
Speech AI Tasks#
NeMo supports several speech AI tasks, each solving a different problem:
Task |
What it does |
Example use case |
|---|---|---|
ASR (Automatic Speech Recognition) |
Converts spoken audio to text |
Transcribing meetings, voice interfaces |
TTS (Text-to-Speech) |
Generates natural speech from text |
Audiobooks, voice interfaces |
Speaker Diarization |
Determines “who spoke when” |
Multi-speaker segmentation and transcription |
Speaker Recognition |
Identifies or verifies a speaker’s identity |
Voice authentication, speaker search |
Speech Enhancement |
Improves audio quality (removes noise) |
Preprocessing noisy recordings |
SpeechLM |
Augments LLMs with audio understanding |
Audio-aware agents, speech translation, reasoning about audio |
Encoder Architectures#
The encoder converts audio features into a sequence of high-level representations:
- Transformer
The standard encoder from Vaswani et al. (2017) — stacked self-attention and feed-forward layers with no convolutions. Used in NeMo as an encoder or decoder in encoder-decoder models (e.g. Canary).
- Conformer
The original architecture from Gulati et al. (2020) that combines self-attention with convolutions for both global and local patterns.
- FastConformer
A faster variant of Conformer (Rekesh et al. (2023)) with 8× subsampling and optimized attention. NeMo’s default choice for ASR; recommended for new projects.
How NeMo Models Work#
Every NeMo model wraps these components into a single, cohesive unit:
Overview of NeMo Speech#
NeMo models are PyTorch modules that also integrate with PyTorch Lightning for training and Hydra + OmegaConf for configuration.
Configuration with YAML#
NeMo experiments are configured with YAML files. A typical config has three main sections:
model:
# Model architecture, data, loss, optimizer
encoder:
_target_: nemo.collections.asr.modules.ConformerEncoder
feat_in: 80
n_layers: 17
...
train_ds:
manifest_filepath: /path/to/train_manifest.json
batch_size: 32
optim:
name: adamw
lr: 0.001
trainer:
# PyTorch Lightning trainer settings
devices: 4
accelerator: gpu
max_steps: 100000
precision: bf16-mixed
exp_manager:
# Experiment logging and checkpointing
exp_dir: /path/to/experiments
name: my_asr_experiment
You can override any value from the command line:
python train_script.py \
model.optim.lr=0.0005 \
model.train_ds.manifest_filepath=/data/train.json \
trainer.devices=8
Manifest Files#
NeMo uses manifest files (JSONL format) to describe datasets. Each line is one training example:
{"audio_filepath": "/data/audio/001.wav", "text": "hello world", "duration": 2.5}
{"audio_filepath": "/data/audio/002.wav", "text": "how are you", "duration": 1.8}
Key fields:
audio_filepath— path to the audio filetext— the transcript (for ASR) or input text (for TTS)duration— audio duration in seconds
See Datasets for details on preparing datasets.
Model Checkpoints#
NeMo models are saved as .nemo files — tar archives containing model weights, configuration, and tokenizer files. You can load models in two ways:
# From a pretrained checkpoint (downloads from HuggingFace/NGC)
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
# From a local .nemo file
model = nemo_asr.models.ASRModel.restore_from("path/to/model.nemo")
See Checkpoints for more details on checkpoint formats.