Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Models#
This section provides a brief overview of models that NeMo’s audio collection currently supports.
Model Recipes can be accessed through examples/audio.
Configuration Files can be found in the directory of examples/audio/conf. For detailed information about configuration files and how they should be structured, please refer to the section NeMo Audio Configuration Files.
Pretrained Model Checkpoints are available for any users for immediately synthesizing speech or fine-tuning models on your custom datasets. Please follow the section Checkpoints for instructions on how to use those pretrained models.
Encoder-Mask-Decoder Model#
Encoder-Mask-Decoder model is a general model consisting of an encoder, a mask estimator, a mask processor and a decoder. The encoder processes the input audio signal and produces a latent representation. The mask estimator estimates the mask from the latent representation. The mask processor processes the mask and the latent representation to produce a processed latent representation. The decoder processes the processed latent representation to produce the output audio signal. The model can be used for various tasks such as speech enhancement or speech separation. The encoder and decoder can be learned or fixed, such as the short-time Fourier transform (STFT) and inverse STFT modules, respectively. The mask estimator can be a neural model, such as multi-channel mask estimator [] or a non-neural model, such as guided source separation (GSS) []. The mask processor can be either simple masking, or a parametric multichannel Wiener filter [].
Predictive Model#
Predictive model is similar to the encoder-mask-decoder model, but the mask estimator and mask processor are replaced by a neural estimator. The predictive model estimates the latent representation of the target output signal from the input audio signal []. The model can be used for various tasks such as speech enhancement or speech separation.
Score-Based Generative Model#
Score-based generative model is a diffusion-based generative model that estimates the score function of the data distribution []. The model consists of an encoder and decoder, a neural score estimator, a stochastic differential equation (SDE) model and a sampler.
Schrödinger Bridge Model#
Schrödinger bridge model is a generative model using a data-to-data process to transform the input (degraded) audio signal into the target (clean) audio signal []. The model consists of an encoder and decoder, a neural estimator, noise schedule and a sampler.
Flow Matching Model#
Flow matching model is a generative model using a noise-to-data process to transform the input (degraded) audio signal into the target (clean) audio signal []. The model consists of an encoder and decoder, a neural estimator, a flow model and a sampler.