Models#

This section provides a brief overview of models that NeMo’s audio collection currently supports.

Model Recipes can be accessed through examples/audio.
Configuration Files can be found in the directory of examples/audio/conf. For detailed information about configuration files and how they should be structured, please refer to the section NeMo Audio Configuration Files.
Pretrained Model Checkpoints are available for any users for immediately synthesizing speech or fine-tuning models on your custom datasets. Please follow the section Checkpoints for instructions on how to use those pretrained models.

Encoder-Mask-Decoder Model#

Encoder-Mask-Decoder model is a general model consisting of an encoder, a mask estimator, a mask processor and a decoder. The encoder processes the input audio signal and produces a latent representation. The mask estimator estimates the mask from the latent representation. The mask processor processes the mask and the latent representation to produce a processed latent representation. The decoder processes the processed latent representation to produce the output audio signal. The model can be used for various tasks such as speech enhancement or speech separation. The encoder and decoder can be learned or fixed, such as the short-time Fourier transform (STFT) and inverse STFT modules, respectively. The mask estimator can be a neural model, such as multi-channel mask estimator [AUDIO-3] or a non-neural model, such as guided source separation (GSS) [AUDIO-1]. The mask processor can be either simple masking, or a parametric multichannel Wiener filter [AUDIO-3].

Predictive Model#

Predictive model is similar to the encoder-mask-decoder model, but the mask estimator and mask processor are replaced by a neural estimator. The predictive model estimates the latent representation of the target output signal from the input audio signal [AUDIO-2, AUDIO-5]. The model can be used for various tasks such as speech enhancement or speech separation.

Score-Based Generative Model#

Score-based generative model is a diffusion-based generative model that estimates the score function of the data distribution [AUDIO-5, AUDIO-6]. The model consists of an encoder and decoder, a neural score estimator, a stochastic differential equation (SDE) model and a sampler.

Schrödinger Bridge Model#

Schrödinger bridge model is a generative model using a data-to-data process to transform the input (degraded) audio signal into the target (clean) audio signal [AUDIO-2]. The model consists of an encoder and decoder, a neural estimator, noise schedule and a sampler.

Flow Matching Model#

Flow matching model is a generative model using a noise-to-data process to transform the input (degraded) audio signal into the target (clean) audio signal [AUDIO-4]. The model consists of an encoder and decoder, a neural estimator, a flow model and a sampler.

References#

[AUDIO-1]

Nobutaka Ito, Shoko Araki, and Tomohiro Nakatani. Complex angular central gaussian mixture model for directional statistics in mask-based microphone array signal processing. In Proc. EUSIPCO. 2016.

[AUDIO-2] (1,2)

Ante Jukić, Roman Korostik, Jagadeesh Balam, and Boris Ginsburg. Schrödinger bridge for generative speech enhancement. In Proc. Interspeech, 1175–1179. 2024.

[AUDIO-3] (1,2)

Ante Jukić, Jagadeesh Balam, and Boris Ginsburg. Flexible multichannel speech enhancement for noise-robust frontend. In Proc. WASPAA. 2023.

[AUDIO-4]

Pin-Jui Ku, Alexander H. Liu, Roman Korostik, Sung-Feng Huang, Szu-Wei Fu, and Ante Jukić. Generative speech foundation model pretraining for high-quality speech extraction and restoration. arXiv preprint arXiv:2409.16117, 2024.

[AUDIO-5] (1,2)

Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, and Timo Gerkmann. Speech Enhancement and Dereverberation with Diffusion-Based Generative Models. IEEE/ACM Trans. on Audio, Speech, and Language Process., 31:2351–2364, 2023.

[AUDIO-6]

Simon Welker, Julius Richter, and Timo Gerkmann. Speech enhancement with score-based generative models in the complex STFT domain. In Proc. Interspeech, 2928–2932. 2022.