Speech AI Models#

NVIDIA NeMo Framework supports the training and customization of Speech AI models, specifically designed to enable voice-based interfaces for conversational AI applications. A range of speech tasks is supported, including Automatic Speech Recognition (ASR), Speaker Diarization, and Text-To-Speech (TTS), which we highlight below.

Automatic Speech Recognition (ASR)#

Automatic Speech Recognition is the task of generating transcriptions of what was spoken in an audio file.

Latest ASR Models Developed by the NVIDIA NeMo Team#

Model family

Decoder type

Useful links

Canary

AED (Attention-based Encoder-Decoder)

Docs, Paper, HF space

Parakeet

CTC, RNN-T, TDT, TDT-CTC hybrid

Docs, HF space

Key features of NeMo ASR include:

Find more details in the Developer Docs.

Speaker Diarization#

Speaker diarization is the process of partitioning an audio stream into segments based on the identity of each speaker. Essentially, it answers the question, “Who spoke when?”

Latest Speaker Diarization Models Developed by the NVIDIA NeMo Team#

Model name

Useful links

MSDD (Multiscale Diarization Decoder)

Docs, Paper, HF space

Find more details in the Developer Docs.

Text-to-Speech (TTS)#

Text-to-Speech is a technology that converts textual inputs into natural human speech. NeMo 2.0 supports training and inference of Magpie-TTS models.

Latest TTS Models Developed by the NVIDIA NeMo Team#

Model name

Useful links

Magpie-TTS

Paper, Blog post, HF Checkpoint, HF Demo

Find more details in the Developer Docs.

Default configurations are provided for each model and are outlined in the model-specific documentation linked above. Every configuration can be modified to train on new datasets or to test new model hyperparameters.

Speech AI Tools#

NeMo Framework also includes a large set of Speech AI tools for dataset preparation, model evaluation, and text normalization.