The conversational AI pipeline consists of three major stages:
Automatic Speech Recognition (ASR)
Natural Language Processing (NLP) or Natural Language Understanding (NLU)
Text-to-Speech (TTS) or voice synthesis
As you talk to a computer, the ASR phase converts the audio signal into text, the NLP stage interprets the question and generates a smart response, and finally the TTS phase converts the text into speech signals to generate audio for the user. The toolkit enables development and training of deep learning models involved in conversational AI and easily chain them together.
Deep learning model development for conversational AI is complex. It involves defining, building, and training several models in specific domains; experimenting several times to get high accuracy, fine tuning on multiple tasks and domain specific data, ensuring training performance and making sure the models are ready for deployment to inference applications. Neural modules are logical blocks of AI applications which take some typed inputs and produce certain typed outputs. By separating a model into its essential components in a building block manner, NeMo helps researchers develop state-of-the-art accuracy models for domain specific data faster and easier.
Collections of modules for core tasks as well as specific to speech recognition, natural language, speech synthesis help develop modular, flexible, and reusable pipelines.
A neural module’s inputs/outputs have a neural type, that describes the semantics, the axis order and meaning, and the dimensions of the input/output tensors. This typing allows neural modules to be safely chained together to build models for applications.
NeMo can be used to train new models or perform transfer learning on existing pre-trained models. Pre-trained weights per module (such as encoder, decoder) help accelerate model training for domain specific data.
ASR, NLP and TTS pre-trained models are trained on multiple datasets (including some languages such as Mandarin) and optimized for high accuracy. They can be used for transfer learning as well.
NeMo supports developing models that work with Mandarin Chinese data. Tutorials help users train or fine tune models for conversational AI with the Mandarin Chinese language. The export method provided in NeMo makes it easy to transform a trained model into inference ready format for deployment.
A key area of development in the toolkit is interoperability with other tools used by speech researchers. Data layer for Kaldi compatibility is one such example.
NeMo, PyTorch Lightning, And Hydra#
Conversational AI architectures are typically very large and require a lot of data and compute for training. NeMo uses Pytorch Lightning for easy and performant multi-GPU/multi-node mixed precision training.
Pytorch Lightning is a high-performance PyTorch wrapper that organizes PyTorch code, scales model training, and reduces
boilerplate. PyTorch Lightning has two main components, the
LightningModule and the Trainer. The
used to organize PyTorch code so that deep learning experiments can be easily understood and reproduced. The Pytorch Lightning
Trainer is then able to take the
LightningModule and automate everything needed for deep learning training.
NeMo models are LightningModules that come equipped with all supporting infrastructure for training and reproducibility. This includes the deep learning model architecture, data preprocessing, optimizer, check-pointing and experiment logging. NeMo models, like LightningModules, are also PyTorch modules and are fully compatible with the broader PyTorch ecosystem. Any NeMo model can be taken and plugged into any PyTorch workflow.
Configuring conversational AI applications is difficult due to the need to bring together many different Python libraries into one end-to-end system. NeMo uses Hydra for configuring both NeMo models and the PyTorch Lightning Trainer. Hydra is a flexible solution that makes it easy to configure all of these libraries from a configuration file or from the command-line.
Every NeMo model has an example configuration file and a corresponding script that contains all configurations needed for training to state-of-the-art accuracy. NeMo models have the same look and feel so that it is easy to do conversational AI research across multiple domains.
Using Optimized Pretrained Models With NeMo#
NVIDIA GPU Cloud (NGC) is a software repository that has containers and models optimized for deep learning. NGC hosts many conversational AI models developed with NeMo that have been trained to state-of-the-art accuracy on large datasets. NeMo models on NGC can be automatically downloaded and used for transfer learning tasks. Pretrained models are the quickest way to get started with conversational AI on your own data. NeMo has many example scripts and Jupyter Notebook tutorials showing step-by-step how to fine-tune pretrained NeMo models on your own domain-specific datasets.
For BERT based models, the model weights provided are ready for downstream NLU tasks. For speech models, it can be helpful to start with a pretrained model and then continue pretraining on your own domain-specific data. Jasper and QuartzNet base model pretrained weights have been known to be very efficient when used as base models. For an easy to follow guide on transfer learning and building domain specific ASR models, you can follow this blog. All pre-trained NeMo models can be found on the NGC NeMo Collection. Everything needed to quickly get started with NeMo ASR, NLP, and TTS models is there.
Pre-trained models are packaged as a
.nemo file and contain the PyTorch checkpoint along with everything needed to use the model.
NeMo models are trained to state-of-the-art accuracy and trained on multiple datasets so that they are robust to small differences
in data. NeMo contains a large variety of models such as speaker identification and Megatron BERT and the best models in speech and
language are constantly being added as they become available. NeMo is the premier toolkit for conversational AI model building and
For a list of supported models, refer to the Tutorials section.
This section is to help guide your decision making by answering our most asked ASR questions.
Q: Is there a way to add domain specific vocabulary in NeMo? If so, how do I do that? A: QuartzNet and Jasper models are character-based. So pretrained models we provide for these two output lowercase English letters and ‘. Users can re-retrain them on vocabulary with upper case letters and punctuation symbols.
Q: When training, there are “Reference” lines and “Decoded” lines that are printed out. It seems like the reference line should be the “truth” line and the decoded line should be what the ASR is transcribing. Why do I see that even the reference lines do not appear to be correct? A: Because our pre-trained models can only output lowercase letters and apostrophe, everything else is dropped. So the model will transcribe 10 as ten. The best way forward is to prepare the training data first by transforming everything to lowercase and convert the numbers from digit representation to word representation using a simple library such as inflect. Then, add the uppercase letters and punctuation back using the NLP punctuation model. Here is an example of how this is incorporated: NeMo voice swap demo.
Q: What languages are supported in NeMo currently? A: Along with English, we provide pre-trained models for Zh, Es, Fr, De, Ru, It, Ca and Pl languages. For more information, see NeMo Speech Models.
Data augmentation in ASR is invaluable. It comes at the cost of increased training time if samples are augmented during training time. To save training time, it is recommended to pre-process the dataset offline for a one time preprocessing cost and then train the dataset on this augmented training set.
For example, processing a single sample involves:
Time stretch perturbation (sample level)
Time stretch augmentation (batch level, neural module)
A simple tutorial guides users on how to use these utilities provided in GitHub: NeMo.
Speech Data Explorer#
Speech data explorer is a Dash-based tool for interactive exploration of ASR/TTS datasets.
Speech data explorer collects:
dataset statistics (alphabet, vocabulary, and duration-based histograms)
navigation across datasets (sorting and filtering)
inspections of individual utterances (waveform, spectrogram, and audio player)
errors analysis (word error rate, character error rate, word match rate, mean word accuracy, and diff)
In order to use the tool, it needs to be installed separately. Perform the steps here to install speech data explorer.
Using Kaldi Formatted Data#
The Kaldi Speech Recognition Toolkit project began in 2009 at Johns Hopkins University <https://www.jhu.edu/>. It is a toolkit written in C++. If researchers have used Kaldi and have datasets that are formatted to be used with the toolkit; they can use NeMo to develop models based on that data.
To load Kaldi-formatted data, you can simply use
KaldiFeatureDataLayer instead of
takes in the argument
kaldi_dir instead of a
manifest_filepath argument should be set to the directory
that contains the files
Using Speech Command Recognition Task For ASR Models#
Speech Command Recognition is the task of classifying an input audio pattern into a set of discrete classes. It is a subset of ASR,
sometimes referred to as Key Word Spotting, in which a model is constantly analyzing speech patterns to detect certain
Upon detection of these commands, a specific action can be taken. An example Jupyter notebook provided in NeMo shows how to train a QuartzNet model with a modified decoder head trained on a speech commands dataset.
It is preferred that you use absolute paths to
data_dir when preprocessing the dataset.
NLP Fine-Tuning BERT#
BERT, or Bidirectional Encoder Representations from Transformers, is a neural approach to pre-train language representations which obtains near state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks, including the GLUE benchmark and SQuAD Question & Answering dataset.
BERT model checkpoints (BERT-large-uncased and BERT-base-uncased) are provided can be used for either fine tuning BERT on your custom dataset, or fine tuning downstream tasks, including GLUE benchmark tasks, Question & Answering tasks, Joint Intent & Slot detection, Punctuation and Capitalization, Named Entity Recognition, and Speech Recognition post processing model to correct mistakes.
Almost all NLP examples also support RoBERTa and ALBERT models for downstream fine-tuning tasks (see the list of all
supported models by calling
nemo.collections.nlp.modules.common.lm_utils.get_pretrained_lm_models_list(). The user needs to specify
the name of the model desired while running the example scripts.
BioMegatron Medical BERT#
BioMegatron is a large language model (Megatron-LM) trained on larger domain text corpus (PubMed abstract + full-text-commercial). It achieves state-of-the-art results for certain tasks such as Relationship Extraction, Named Entity Recognition and Question & Answering. Follow these tutorials to learn how to train and fine tune BioMegatron; pretrained models are provided on NGC:
Efficient Training With NeMo#
Using Mixed Precision#
Mixed precision accelerates training speed while protecting against noticeable loss. Tensor Cores is a specific hardware unit that comes starting with the Volta and Turing architectures to accelerate large matrix to matrix multiply-add operations by operating them on half precision inputs and returning the result in full precision.
Neural networks which usually use massive matrix multiplications can be significantly sped up with mixed precision and Tensor Cores. However, some neural network layers are numerically more sensitive than others. Apex AMP is an NVIDIA library that maximizes the benefit of mixed precision and Tensor Cores usage for a given network.
This section is to help guide your decision making by answering our most asked multi-GPU training questions.
Q: Why is multi-GPU training preferred over other types of training? A: Multi-GPU training can reduce the total training time by distributing the workload onto multiple compute instances. This is particularly important for large neural networks which would otherwise take weeks to train until convergence. Since NeMo supports multi-GPU training, no code change is needed to move from single to multi-GPU training, only a slight change in your launch command is required.
Q: What are the advantages of mixed precision training? A: Mixed precision accelerates training speed while protecting against noticeable loss in precision. Tensor Cores is a specific hardware unit that comes starting with the Volta and Turing architectures to accelerate large matrix multiply-add operations by operating on half precision inputs and returning the result in full precision in order to prevent loss in precision. Neural networks which usually use massive matrix multiplications can be significantly sped up with mixed precision and Tensor Cores. However, some neural network layers are numerically more sensitive than others. Apex AMP is a NVIDIA library that maximizes the benefit of mixed precision and Tensor Core usage for a given network.
Q: What is the difference between multi-GPU and multi-node training? A: Multi-node is an abstraction of multi-GPU training, which requires a distributed compute cluster, where each node can have multiple GPUs. Multi-node training is needed to scale training beyond a single node to large amounts of GPUs.
From the framework perspective, nothing changes from moving to multi-node training. However, a master address and port needs to be set up for inter-node communication. Multi-GPU training will then be launched on each node with passed information. You might also consider the underlying inter-node network topology and type to achieve full performance, such as HPC-style hardware such as NVLink, InfiniBand networking, or Ethernet.
Recommendations For Optimization And FAQs#
This section is to help guide your decision making by answering our most asked NeMo questions.
Q: Are there areas where performance can be increased? A: You should try using mixed precision for improved performance. Note that typically when using mixed precision, memory consumption is decreased and larger batch sizes could be used to further improve the performance.
When fine-tuning ASR models on your data, it is almost always possible to take advantage of NeMo’s pre-trained modules. Even if you
have a different target vocabulary, or even a different language; you can still try starting with pre-trained weights from Jasper or
encoder and only adjust the
decoder for your needs.
Q: What is the recommended sampling rate for ASR? A: The released models are based on 16 KHz audio, therefore, ensure you use models with 16 KHz audio. Reduced performance should be expected for any audio that is up-sampled from a sampling frequency less than 16 KHz data.
Q: How do we use this toolkit for audio with different types of compression and frequency than the training domain for ASR? A: You have to match the compression and frequency.
Q: How do you replace the 6-gram out of the ASR model with a custom language model? What is the language format supported in NeMo? A: NeMo’s Beam Search decoder with Levenberg-Marquardt (LM) neural module supports the KenLM language model.
You should retrain the KenLM language model on your own dataset. Refer to KenLM’s documentation.
If you want to use a different language model, other than KenLM, you will need to implement a corresponding decoder module.
Transformer-XL example is present in OS2S. It would need to be updated to work with NeMo. Here is the code.
Q: How do I use text-to-speech (TTS)? A: - Obtain speech data ideally at 22050 Hz or alternatively at a higher sample rate and then down sample to 22050 Hz.
- If less than 22050 Hz and at least 16000 Hz:
Retrain WaveGlow on your own dataset.
Tweak the spectrogram generation parameters, namely the
window_stridefor their fourier
For below 16000 Hz, look into obtaining new data.
In terms of bitrate/quantization, the general advice is the higher the better. We have not experimented enough to state how much this impacts quality.
For the amount of data, again the more the better, and the more diverse in terms of phonemes the better. Aim for around 20 hours of speech after filtering for silences and non-speech audio.
Most open speech datasets are in ~10 second format so training spectrogram generators on audio on the order of 10s - 20s per sample is known to work. Additionally, the longer the speech samples, the more difficult it will be to train them.
Audio files should be clean. There should be little background noise or music. Data recorded from a studio mic is likely to be easier to train compared to data captured using a phone.
To ensure pronunciation of words are accurate; the technical challenge is related to the dataset, text to phonetic spelling is required, use phonetic alphabet (notation) that has the name correctly pronounced.
- Here are some example parameters you can use to train spectrogram generators:
use single speaker dataset
Use AMP level O0
Trim long silences in the beginning and end
beta1 = 0.9
beta2 = 0.999
batch_size=48 (per GPU)
Ensure you are familiar with the following resources for NeMo.
- Domain specific, transfer learning, Docker container with Jupyter Notebooks