Speech Self-Supervised Learning#

Self-Supervised Learning (SSL) refers to the problem of learning without explicit labels. As any learning process require feedback, without explit labels, SSL derives supervisory signals from the data itself. The general ideal of SSL is to predict any hidden part (or property) of the input from observed part of the input (e.g., filling in the blanks in a sentence or predicting whether an image is upright or inverted).

SSL for speech/audio understanding broadly falls into either contrastive or reconstruction based approaches. In contrastive methods, models learn by distinguishing between true and distractor tokens (or latents). Examples of contrastive approaches are Contrastive Predictive Coding (CPC), Masked Language Modeling (MLM) etc. In reconstruction methods, models learn by directly estimating the missing (intentionally leftout) portions of the input. Masked Reconstruction, Autoregressive Predictive Coding (APC) are few examples.

In the recent past, SSL has been a major benefactor in improving Acoustic Modeling (AM), i.e., the encoder module of neural ASR models. Here too, majority of SSL effort is focused on improving AM. While it is common that AM is the focus of SSL in ASR, it can also be utilized in improving other parts of ASR models (e.g., predictor module in transducer based ASR models).

In NeMo, we provide two types of SSL models, Wav2Vec-BERT and NEST. The training script for them can be found in https://github.com/NVIDIA/NeMo/tree/main/examples/asr/speech_pretraining.

The full documentation tree is as follows:

Resources and Documentation#

Refer to SSL-for-ASR notebook for a hands-on tutorial. If you are a beginner to NeMo, consider trying out the ASR with NeMo tutorial. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks’ GitHub pages on Colab.

If you are looking for information about a particular ASR model, or would like to find out more about the model architectures available in the nemo_asr collection, refer to the ASR Models page.

NeMo includes preprocessing scripts for several common ASR datasets. The ASR Datasets page contains instructions on running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data.

Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list of the checkpoints available on NGC are located on the Checkpoints page.

Documentation regarding the configuration files specific to the SSL can be found in the Configuration Files page.