Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Speech AI Models#
NVIDIA NeMo Framework supports the training and customization of Speech AI models, specifically designed to enable voice-based interfaces for conversational AI applications. A range of speech tasks are supported, including Automatic Speech Recognition (ASR), Speaker Diarization, and Text-to-Speech (TTS), which we highlight below.
Automatic Speech Recognition (ASR)#
Automatic Speech Recognition is the task of generating transcriptions of what was spoken in an audio file.
Model family |
Decoder type |
Useful links |
|---|---|---|
Canary |
AED (Attention-based Encoder-Decoder) |
|
Parakeet |
Key features of NeMo ASR include:
Pretrained ASR models, many topping the HuggingFace Open ASR Leaderboard
Model checkpoints specialized for real-time speech recognition
Find more details in the Developer Docs.
Speaker Diarization#
Speaker diarization is the process of partitioning an audio stream into segments based on the identity of each speaker. Essentially, it answers the question, “Who spoke when?”
Model name |
Useful links |
|---|---|
MSDD (Multiscale Diarization Decoder) |
Find more details in the Developer Docs.
Text-To-Speech (TTS)#
Text-to-Speech is a technology that converts textual inputs into natural human speech.
Model name |
Useful links |
|---|---|
T5-TTS |
Find more details in the Developer Docs.
Speech AI Tools#
NeMo Framework also includes a large set of Speech AI tools for dataset preparation, model evaluation, and text normalization.