Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Speech AI Models#
NVIDIA NeMo Framework supports the training and customization of Speech AI models, specifically designed to enable voice-based interfaces for conversational AI applications. A range of speech tasks are supported, including Automatic Speech Recognition (ASR), Speaker Diarization, and Text-to-Speech (TTS), which we highlight below.
Automatic Speech Recognition (ASR)#
Automatic Speech Recognition is the task of generating transcriptions of what was spoken in an audio file.
Key features of NeMo ASR include:
Pretrained ASR models, many topping the HuggingFace Open ASR Leaderboard
Model checkpoints specialized for real-time speech recognition
Find more details in the Developer Docs.
Speaker Diarization#
Speaker diarization is the process of partitioning an audio stream into segments based on the identity of each speaker. Essentially, it answers the question, “Who spoke when?”
Find more details in the Developer Docs.
Text-To-Speech (TTS)#
Text-to-Speech is a technology that converts textual inputs into natural human speech.
Find more details in the Developer Docs.
Speech AI Tools#
NeMo Framework also includes a large set of Speech AI tools for dataset preparation, model evaluation, and text normalization.