Speech AI Models#
NVIDIA NeMo Framework supports the training and customization of Speech AI models, specifically designed to enable voice-based interfaces for conversational AI applications. A range of speech tasks are supported, including Automatic Speech Recognition (ASR), Speaker Diarization, and Text-to-Speech (TTS), which we highlight below.
Automatic Speech Recognition (ASR)#
Automatic Speech Recognition is the task of generating transcriptions of what was spoken in an audio file.
Model family |
Decoder type |
Useful links |
---|---|---|
Canary |
AED (Attention-based Encoder-Decoder) |
|
Parakeet |
Key features of NeMo ASR include:
Pretrained ASR models <https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/results.html#automatic-speech-recognition-models>__, many topping the HuggingFace Open ASR Leaderboard
Model checkpoints specialized for real-time speech recognition
Find more details in the Developer Docs.
Speaker Diarization#
Speaker diarization is the process of partitioning an audio stream into segments based on the identity of each speaker. Essentially, it answers the question, “Who spoke when?”
Model name |
Useful links |
---|---|
MSDD (Multiscale Diarization Decoder) |
Find more details in the Developer Docs.
Text-To-Speech (TTS)#
Text-to-Speech is a technology that converts textual inputs into natural human speech.
Model name |
Useful links |
---|---|
T5-TTS |
Find more details in the Developer Docs.
Speech AI Tools#
NeMo Framework also includes a large set of Speech AI tools for dataset preparation, model evaluation, and text normalization.