NeMo Audio API#

Model Classes#

Base Classes#

class nemo.collections.audio.models.AudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: ModelPT, ABC

Base class for audio-to-audio models.

Parameters:

cfg – A DictConfig object with the configuration parameters.
trainer – A Trainer object to be used for training.

configure_callbacks()#: Create an callback to add audio/spectrogram into tensorboard & wandb.

classmethod list_available_models() → List[PretrainedModelInfo]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns:: List of available pre-trained models.

static match_batch_length( input: torch.Tensor, batch_length: int, ) → torch.Tensor#

Trim or pad the output to match the batch length.

Parameters:

input – tensor with shape (B, C, T)
batch_length – int

Returns:

Tensor with shape (B, C, T), where T matches the batch length.

multi_test_epoch_end( outputs, dataloader_idx: int = 0, )#

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters:

outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns:

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end( outputs, dataloader_idx: int = 0, )#

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters:

outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.
dataloader_idx – int representing the index of the dataloader.

Returns:

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

on_after_backward()#: zero-out the gradients which any of them is NAN or INF

Takes paths to audio files and returns a list of paths to processed audios.

Parameters:

paths2audio_files – paths to audio files to be processed
output_dir – directory to save the processed files
batch_size – (int) batch size to use during inference.
num_workers – Number of workers for the dataloader
input_channel_selector (int | Iterable[int] | str) – select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None.
input_dir – Optional, directory that contains the input files. If provided, the output directory will mirror the input directory structure.

Returns:

Paths to processed audio signals.

setup_optimization_flags()#

Utility method that must be explicitly called by the subclass in order to support optional optimization flags. This method is the only valid place to access self.cfg prior to DDP training occurs.

The subclass may chose not to support this method, therefore all variables here must be checked via hasattr()

Processing Models#

class nemo.collections.audio.models.EncMaskDecAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

Class for encoder-mask-decoder audio processing models.

The model consists of the following blocks:

encoder: transforms input multi-channel audio signal into an encoded representation (analysis transform)
mask_estimator: estimates a mask used by signal processor
mask_processor: mask-based signal processor, combines the encoded input and the estimated mask
decoder: transforms processor output into the time domain (synthesis transform)

forward(input_signal, input_length=None)#

Forward pass of the model.

Parameters:

input_signal – Tensor that represents a batch of raw audio signals, of shape [B, T] or [B, T, C]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.
input_signal_length – Vector of length B, that contains the individual lengths of the audio sequences.

Returns:

Output signal output in the time domain and the length of the output signal output_length.

property input_types: Dict[str, NeuralType]#: Define these to enable input neural type checks

classmethod list_available_models() → PretrainedModelInfo | None#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns:: List of available pre-trained models.

property output_types: Dict[str, NeuralType]#: Define these to enable output neural type checks

class nemo.collections.audio.models.FlowMatchingAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models uses a flow matching process to generate an encoded representation of the enhanced signal.

The model consists of the following blocks:

encoder: transforms input multi-channel audio signal into an encoded representation (analysis transform)
estimator: neural model, estimates a score for the diffusion process
flow: ordinary differential equation (ODE) defining a flow and a vector field.
sampler: sampler for the inference process, estimates coefficients of the target signal
decoder: transforms sampler output into the time domain (synthesis transform)
ssl_pretrain_masking: if it is defined, perform the ssl pretrain masking for self reconstruction in the training process

forward_internal( input_signal, input_length=None, enable_ssl_masking=False, )#

Internal forward pass of the model.

Parameters:

input_signal – Tensor that represents a batch of raw audio signals, of shape [B, T] or [B, T, C]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.
input_signal_length – Vector of length B, that contains the individual lengths of the audio sequences.
enable_ssl_masking – Whether to enable SSL masking of the input. If using SSL pretraining, masking is applied to the input signal. If not using SSL pretraining, masking is not applied.

Returns:

Output signal output in the time domain and the length of the output signal output_length.

property input_types: Dict[str, NeuralType]#: Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#: Define these to enable output neural type checks

class nemo.collections.audio.models.PredictiveAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models aims to directly estimate the coefficients in the encoded domain by applying a neural model.

forward(input_signal, input_length=None)#

Forward pass of the model.

Parameters:

input_signal – time-domain signal
input_length – valid length of each example in the batch

Returns:

Output signal output in the time domain and the length of the output signal output_length.

property input_types: Dict[str, NeuralType]#: Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#: Define these to enable output neural type checks

class nemo.collections.audio.models.ScoreBasedGenerativeAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models is using a score-based diffusion process to generate an encoded representation of the enhanced signal.

The model consists of the following blocks:

encoder: transforms input multi-channel audio signal into an encoded representation (analysis transform)
estimator: neural model, estimates a score for the diffusion process
sde: stochastic differential equation (SDE) defining the forward and reverse diffusion process
sampler: sampler for the reverse diffusion process, estimates coefficients of the target signal
decoder: transforms sampler output into the time domain (synthesis transform)

property input_types: Dict[str, NeuralType]#: Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#: Define these to enable output neural type checks

class nemo.collections.audio.models.SchroedingerBridgeAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models is using a Schrödinger Bridge process to generate an encoded representation of the enhanced signal.

The model consists of the following blocks:

encoder: transforms input audio signal into an encoded representation (analysis transform)
estimator: neural model, estimates the coefficients for the SB process
noise_schedule: defines the path between the clean and noisy signals
sampler: sampler for the reverse process, estimates coefficients of the target signal
decoder: transforms sampler output into the time domain (synthesis transform)

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074

property input_types: Dict[str, NeuralType]#: Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#: Define these to enable output neural type checks

Modules#

Features#

class nemo.collections.audio.modules.features.SpectrogramToMultichannelFeatures(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Convert a complex-valued multi-channel spectrogram to multichannel features.

Parameters:

num_subbands – Expected number of subbands in the input signal
num_input_channels – Optional, provides the number of channels of the input signal. Used to infer the number of output channels.
mag_reduction – Reduction across channels. Default None, will calculate magnitude of each channel.
mag_power – Optional, apply power on the magnitude.
use_ipd – Use inter-channel phase difference (IPD).
mag_normalization – Normalization for magnitude features
ipd_normalization – Normalization for IPD features
eps – Small regularization constant.

forward( input: torch.Tensor, input_length: torch.Tensor, ) → torch.Tensor#

Convert input batch of C-channel spectrograms into a batch of time-frequency features with dimension num_feat. The output number of channels may be the same as input, or reduced to 1, e.g., if averaging over magnitude and not appending individual IPDs.

Parameters:

input – Spectrogram for C channels with F subbands and N time frames, (B, C, F, N)
input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

num_feat_channels channels with num_feat features, shape (B, num_feat_channels, num_feat, N)

classmethod get_mean_std_time_channel( input: torch.Tensor, input_length: torch.Tensor | None = None, eps: float = 1e-10, ) → torch.Tensor#

Calculate mean and standard deviation across time and channel dimensions.

Parameters:

input – tensor with shape (B, C, F, T)
input_length – tensor with shape (B,)

Returns:

Mean and standard deviation of the input calculated across time and channel dimension, each with shape (B, 1, F, 1).

static get_mean_time_channel( input: torch.Tensor, input_length: torch.Tensor | None = None, ) → torch.Tensor#

Calculate mean across time and channel dimensions.

Parameters:

input – tensor with shape (B, C, F, T)
input_length – tensor with shape (B,)

Returns:

Mean of input calculated across time and channel dimension with shape (B, 1, F, 1)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

normalize_mean( input: torch.Tensor, input_length: torch.Tensor, ) → torch.Tensor#

Mean normalization for the input tensor.

Parameters:

input – input tensor
input_length – valid length for each example

Returns:

Mean normalized input.

normalize_mean_var( input: torch.Tensor, input_length: torch.Tensor, ) → torch.Tensor#

Mean and variance normalization for the input tensor.

Parameters:

input – input tensor
input_length – valid length for each example

Returns:

Mean and variance normalized input.

property num_channels: int#: Configured number of channels

property num_features: int#: Configured number of features

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

Masking#

class nemo.collections.audio.modules.masking.MaskEstimatorRNN(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate num_outputs masks from the input spectrogram using stacked RNNs and projections.

The module is structured as follows:

input –> spatial features –> input projection –>: –> stacked RNNs –> output projection for each output –> sigmoid

Reference:

Multi-microphone neural speech separation for far-field multi-talker speech recognition (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8462081)

Parameters:

num_outputs – Number of output masks to estimate
num_subbands – Number of subbands of the input spectrogram
num_features – Number of features after the input projections
num_layers – Number of RNN layers
num_hidden_features – Number of hidden features in RNN layers
num_input_channels – Number of input channels
dropout – If non-zero, introduces dropout on the outputs of each RNN layer except the last layer, with dropout probability equal to dropout. Default: 0
bidirectional – If True, use bidirectional RNN.
rnn_type – Type of RNN, either lstm or gru. Default: lstm
mag_reduction – Channel-wise reduction for magnitude features
use_ipd – Use inter-channel phase difference (IPD) features

forward( input: torch.Tensor, input_length: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#

Estimate num_outputs masks from the input spectrogram.

Parameters:

input – C-channel input, shape (B, C, F, N)
input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Returns num_outputs masks in a tensor, shape (B, num_outputs, F, N), and output length with shape (B,)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskEstimatorFlexChannels(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate num_outputs masks from the input spectrogram using stacked channel-wise and temporal layers.

This model is using interlaved channel blocks and temporal blocks, and it can process arbitrary number of input channels. Default channel block is the transform-average-concatenate layer. Default temporal block is the Conformer encoder. Reduction from multichannel signal to single-channel signal is performed after channel_reduction_position blocks. Only temporal blocks are used afterwards. After the sequence of blocks, the output mask is computed using an additional output temporal layer and a nonlinearity.

References

Yoshioka et al, VarArray: Array-Geometry-Agnostic Continuous Speech Separation, 2022
Jukić et al, Flexible multichannel speech enhancement for noise-robust frontend, 2023

Parameters:

num_outputs – Number of output masks.
num_subbands – Number of subbands on the input spectrogram.
num_blocks – Number of blocks in the model.
channel_reduction_position – After this block, the signal will be reduced across channels.
channel_reduction_type – Reduction across channels: ‘average’ or ‘attention’
channel_block_type – Block for channel processing: ‘transform_average_concatenate’ or ‘transform_attend_concatenate’
temporal_block_type – Block for temporal processing: ‘conformer_encoder’
temporal_block_num_layers – Number of layers for the temporal block
temporal_block_num_heads – Number of heads for the temporal block
temporal_block_dimension – The hidden size of the model
temporal_block_self_attention_model – Self attention model for the temporal block
temporal_block_att_context_size – Attention context size for the temporal block
mag_reduction – Channel-wise reduction for magnitude features
mag_power – Power to apply on magnitude features
use_ipd – Use inter-channel phase difference (IPD) features
mag_normalization – Normalize using mean (‘mean’) or mean and variance (‘mean_var’)
ipd_normalization – Normalize using mean (‘mean’) or mean and variance (‘mean_var’)

forward( input: torch.Tensor, input_length: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#: Estimate num_outputs masks from the input spectrogram.

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskEstimatorGSS(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate masks using guided source separation with a complex angular Central Gaussian Mixture Model (cACGMM) [1].

This module corresponds to GSS in Fig. 2 in [2].

Notation is approximately following [1], where gamma denotes the time-frequency mask, alpha denotes the mixture weights, and BM denotes the shape matrix. Additionally, the provided source activity is denoted as activity.

Parameters:

num_iterations – Number of iterations for the EM algorithm
eps – Small value for regularization
dtype – Data type for internal computations (default torch.cdouble)

References

[1] Ito et al., Complex Angular Central Gaussian Mixture Model for Directional Statistics in Mask-Based Microphone Array Signal Processing, 2016 [2] Boeddeker et al., Front-End Processing for the CHiME-5 Dinner Party Scenario, 2018

forward( input: torch.Tensor, activity: torch.Tensor, ) → torch.Tensor#

Apply GSS to estimate the time-frequency masks for each output source.

Parameters:

input – batched C-channel input signal, shape (B, num_inputs, F, T)
activity – batched frame-wise activity for each output source, shape (B, num_outputs, T)

Returns:

Masks for the components of the model, shape (B, num_outputs, F, T)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

normalize( x: torch.Tensor, dim: int = 1, ) → torch.Tensor#

Normalize input to have a unit L2-norm across dim. By default, normalizes across the input channels.

Parameters:

x – C-channel input signal, shape (B, C, F, T)
dim – Dimension for normalization, defaults to -3 to normalize over channels

Returns:

Normalized signal, shape (B, C, F, T)

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

update_masks( alpha: torch.Tensor, activity: torch.Tensor, log_pdf: torch.Tensor, ) → torch.Tensor#

Update masks for the cACGMM.

Parameters:

alpha – component weights, shape (B, num_outputs, F)
activity – temporal activity for the components, shape (B, num_outputs, T)
log_pdf – logarithm of the PDF, shape (B, num_outputs, F, T)

Returns:

Masks for the components of the model, shape (B, num_outputs, F, T)

update_pdf( z: torch.Tensor, gamma: torch.Tensor, zH_invBM_z: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#

Update PDF of the cACGMM.

Parameters:

z – directional statistics, shape (B, num_inputs, F, T)
gamma – masks, shape (B, num_outputs, F, T)
zH_invBM_z – energy weighted by shape matrices, shape (B, num_outputs, F, T)

Returns:

Logarithm of the PDF, shape (B, num_outputs, F, T), the energy term, shape (B, num_outputs, F, T)

update_weights(gamma: torch.Tensor) → torch.Tensor#

Update weights for the individual components in the mixture model.

Parameters:: gamma – masks, shape (B, num_outputs, F, T)
Returns:: Component weights, shape (B, num_outputs, F)

class nemo.collections.audio.modules.masking.MaskReferenceChannel(*args: Any, **kwargs: Any)#

Bases: NeuralModule

A simple mask processor which applies mask on ref_channel of the input signal.

Parameters:

ref_channel – Index of the reference channel.
mask_min_db – Threshold mask to a minimal value before applying it, defaults to -200dB
mask_max_db – Threshold mask to a maximal value before applying it, defaults to 0dB

forward( input: torch.Tensor, input_length: torch.Tensor, mask: torch.Tensor, ) → Tuple[torch.Tensor, torch.Tensor]#

Apply mask on ref_channel of the input signal. This can be used to generate multi-channel output. If mask has M channels, the output will have M channels as well.

Parameters:

input – Input signal complex-valued spectrogram, shape (B, C, F, N)
input_length – Length of valid entries along the time dimension, shape (B,)
mask – Mask for M outputs, shape (B, M, F, N)

Returns:

M-channel output complex-valed spectrogram with shape (B, M, F, N)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskBasedBeamformer(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Multi-channel processor using masks to estimate signal statistics.

Parameters:

filter_type – string denoting the type of the filter. Defaults to mvdr
filter_beta – Parameter of the parameteric multichannel Wiener filter
filter_rank – Parameter of the parametric multichannel Wiener filter
filter_postfilter – Optional, postprocessing of the filter
ref_channel – Optional, reference channel. If None, it will be estimated automatically
ref_hard – If true, hard (one-hot) reference. If false, a soft reference
ref_hard_use_grad – If true, use straight-through gradient when using the hard reference
ref_subband_weighting – If true, use subband weighting when estimating reference channel
num_subbands – Optional, used to determine the parameter size for reference estimation
mask_min_db – Threshold mask to a minimal value before applying it, defaults to -200dB
mask_max_db – Threshold mask to a maximal value before applying it, defaults to 0dB
diag_reg – Optional, diagonal regularization for the multichannel filter
eps – Small regularization constant to avoid division by zero

forward( input: torch.Tensor, mask: torch.Tensor, mask_undesired: torch.Tensor | None = None, input_length: torch.Tensor | None = None, ) → torch.Tensor#

Apply a mask-based beamformer to the input spectrogram. This can be used to generate multi-channel output. If mask has multiple channels, a multichannel filter is created for each mask, and the output is concatenation of individual outputs along the channel dimension. The total number of outputs is num_masks * M, where M is the number of channels at the filter output.

Parameters:

input – Input signal complex-valued spectrogram, shape (B, C, F, N)
mask – Mask for M output signals, shape (B, num_masks, F, N)
input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Multichannel output signal complex-valued spectrogram, shape (B, num_masks * M, F, N)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskBasedDereverbWPE(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Multi-channel linear prediction-based dereverberation using weighted prediction error for filter estimation.

An optional mask to estimate the signal power can be provided. If a time-frequency mask is not provided, the algorithm corresponds to the conventional WPE algorithm.

Parameters:

filter_length – Length of the convolutional filter for each channel in frames.
prediction_delay – Delay of the input signal for multi-channel linear prediction in frames.
num_iterations – Number of iterations for reweighting
mask_min_db – Threshold mask to a minimal value before applying it, defaults to -200dB
mask_max_db – Threshold mask to a minimal value before applying it, defaults to 0dB
diag_reg – Diagonal regularization for WPE
eps – Small regularization constant
dtype – Data type for internal computations

References

Kinoshita et al, Neural network-based spectrum estimation for online WPE dereverberation, 2017
Yoshioka and Nakatani, Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening, 2012

forward( input: torch.Tensor, input_length: torch.Tensor | None = None, mask: torch.Tensor | None = None, ) → torch.Tensor#

Given an input signal input, apply the WPE dereverberation algoritm.

Parameters:

input – C-channel complex-valued spectrogram, shape (B, C, F, T)
input_length – Optional length for each signal in the batch, shape (B,)
mask – Optional mask, shape (B, 1, F, N) or (B, C, F, T)

Returns:

Processed tensor with the same number of channels as the input, shape (B, C, F, T).

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

Projections#

class nemo.collections.audio.modules.projections.MixtureConsistencyProjection(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Ensure estimated sources are consistent with the input mixture. Note that the input mixture is assume to be a single-channel signal.

Parameters:

weighting – Optional weighting mode for the consistency constraint. If None, use uniform weighting. If power, use the power of the estimated source as the weight.
eps – Small positive value for regularization

Reference:: Wisdom et al, Differentiable consistency constraints for improved deep speech enhancement, 2018

forward( mixture: torch.Tensor, estimate: torch.Tensor, ) → torch.Tensor#

Enforce mixture consistency on the estimated sources. :param mixture: Single-channel mixture, shape (B, 1, F, N) :param estimate: M estimated sources, shape (B, M, F, N)

Returns:: Source estimates consistent with the mixture, shape (B, M, F, N)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

SSL Pretraining#

class nemo.collections.audio.modules.ssl_pretrain_masking.SSLPretrainWithMaskedPatch(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Zeroes out fixed size time patches of the spectrogram. All samples in batch are guaranteed to have the same amount of masked time steps. Note that this may be problematic when we do pretraining on a unbalanced dataset.

For example, say a batch contains two spectrograms of length 87 and 276. With mask_fraction=0.7 and patch_size=10, we’ll obrain mask_patches=7. Each of the two data will then have 7 patches of 10-frame mask.

Parameters:

patch_size (int) – up to how many time steps does one patch consist of. Defaults to 10.
mask_fraction (float) – how much fraction in each sample to be masked (number of patches is rounded up). Range from 0.0 to 1.0. Defaults to 0.7.

forward(input_spec, length)#

Apply Patched masking on the input_spec.

During the training stage, the mask is generated randomly, with approximately self.mask_fraction of the time frames being masked out.

In the validation stage, the masking pattern is fixed to ensure consistent evaluation of checkpoints and to prevent overfitting. Note that the same masking pattern is applied to all data, regardless of their lengths. On average, approximately self.mask_fraction of the time frames will be masked out.

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

Transforms#

class nemo.collections.audio.modules.transforms.AudioToSpectrogram(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Transform a batch of input multi-channel signals into a batch of STFT-based spectrograms.

Parameters:

fft_length – length of FFT
hop_length – length of hops/shifts of the sliding window
power – exponent for magnitude spectrogram. Default None will return a complex-valued spectrogram
magnitude_power – Transform magnitude of the spectrogram as x^magnitude_power.
scale – Positive scaling of the spectrogram.

forward( input: torch.Tensor, input_length: torch.Tensor | None = None, ) → Tuple[torch.Tensor, torch.Tensor]#

Convert a batch of C-channel input signals into a batch of complex-valued spectrograms.

Parameters:

input – Time-domain input signal with C channels, shape (B, C, T)
input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Output spectrogram with F subbands and N time frames, shape (B, C, F, N) and output length with shape (B,).

get_output_length( input_length: torch.Tensor, ) → torch.Tensor#

Get length of valid frames for the output.

Parameters:: input_length – number of valid samples, shape (B,)
Returns:: Number of valid frames, shape (B,)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

stft(x: torch.Tensor)#

Apply STFT as in torchaudio.transforms.Spectrogram(power=None)

Parameters:: x_spec – Input time-domain signal, shape (…, T)
Returns:: Time-domain signal x_spec = STFT(x), shape (…, F, N).

class nemo.collections.audio.modules.transforms.SpectrogramToAudio(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Transform a batch of input multi-channel spectrograms into a batch of time-domain multi-channel signals.

Parameters:

fft_length – length of FFT
hop_length – length of hops/shifts of the sliding window
magnitude_power – Transform magnitude of the spectrogram as x^(1/magnitude_power).
scale – Spectrogram will be scaled with 1/scale before the inverse transform.

forward( input: torch.Tensor, input_length: torch.Tensor | None = None, ) → torch.Tensor#

Convert input complex-valued spectrogram to a time-domain signal. Multi-channel IO is supported.

Parameters:

input – Input spectrogram for C channels, shape (B, C, F, N)
input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Time-domain signal with T time-domain samples and C channels, (B, C, T) and output length with shape (B,).

get_output_length( input_length: torch.Tensor, ) → torch.Tensor#

Get length of valid samples for the output.

Parameters:: input_length – number of valid frames, shape (B,)
Returns:: Number of valid samples, shape (B,)

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

istft(x_spec: torch.Tensor)#

Apply iSTFT as in torchaudio.transforms.InverseSpectrogram

Parameters:: x_spec – Input complex-valued spectrogram, shape (…, F, N)
Returns:: Time-domain signal x = iSTFT(x_spec), shape (…, T).

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

Parts#

Submodules: Diffusion#

class nemo.collections.audio.parts.submodules.diffusion.StochasticDifferentialEquation(*args: Any, **kwargs: Any)#

Bases: NeuralModule, ABC

Base class for stochastic differential equations.

abstract coefficients(

state: torch.Tensor,

time: torch.Tensor,

**kwargs,

) → Tuple[torch.Tensor, torch.Tensor]#

Parameters:

state – tensor of shape (B, C, D, T)
time – tensor of shape (B,)

Returns:

Tuple with drift and diffusion coefficients.

abstract copy()#: Create a copy of this SDE.

discretize(

*,

state: torch.Tensor,

time: torch.Tensor,

state_length: torch.Tensor | None = None,

**kwargs,

) → Tuple[torch.Tensor, torch.Tensor]#

Assume we have the following SDE:

dx = drift(x, t) * dt + diffusion(x, t) * dwt

where wt is the standard Wiener process.

We assume the following discretization:

new_state = current_state + total_drift + total_diffusion * z_norm

where z_norm is sampled from normal distribution with zero mean and unit variance.

Parameters:

state – current state of the process, shape (B, C, D, T)
time – current time of the process, shape (B,)
state_length – length of the valid time steps for each example in the batch, shape (B,)
**kwargs – other parameters

Returns:

Drift and diffusion.

property dt: float#: Time step for this SDE. This denotes the step size between 0 and self.time_max when using self.num_steps.

generate_time( size: int, device: torch.device, ) → torch.Tensor#

Generate random time steps in the valid range.

Time steps are generated between self.time_min and self.time_max.

Parameters:

size – number of samples
device – device to use

Returns:

A tensor of floats with shape (size,)

prior_sampling( prior_mean: torch.Tensor, ) → torch.Tensor#

Generate a sample from the prior distribution p_T.

Parameters:: prior_mean – Mean of the prior distribution
Returns:: A sample from the prior distribution.

property time_delta: float#: Time range for this SDE.

class nemo.collections.audio.parts.submodules.diffusion.OrnsteinUhlenbeckVarianceExplodingSDE(*args: Any, **kwargs: Any)#

Bases: StochasticDifferentialEquation

This class implements the Ornstein-Uhlenbeck SDE with variance exploding noise schedule.

The SDE is given by:

dx = theta * (y - x) dt + g(t) dw

where theta is the stiffness parameter and g(t) is the diffusion coefficient:

g(t) = std_min * (std_max/std_min)^t * sqrt(2 * log(std_max/std_min))

References

Richter et al., Speech Enhancement and Dereverberation with Diffusion-based Generative Models, Tr. ASLP 2023

coefficients( state: torch.Tensor, time: torch.Tensor, prior_mean: torch.Tensor, state_length: torch.Tensor | None = None, ) → Tuple[torch.Tensor, torch.Tensor]#

Compute drift and diffusion coefficients for this SDE.

Parameters:

state – current state of the process, shape (B, C, D, T)
time – current time of the process, shape (B,)
prior_mean – mean of the prior distribution
state_length – length of the valid time steps for each example in the batch

Returns:

Drift and diffusion coefficients.

copy()#: Create a copy of this SDE.

perturb_kernel_mean( state: torch.Tensor, prior_mean: torch.Tensor, time: torch.Tensor, ) → torch.Tensor#

Return the mean of the perturbation kernel for this SDE.

Parameters:

state – current state of the process, shape (B, C, D, T)
prior_mean – mean of the prior distribution
time – current time of the process, shape (B,)

Returns:

A tensor of shape (B, C, D, T)

perturb_kernel_params( state: torch.Tensor, prior_mean: torch.Tensor, time: torch.Tensor, ) → torch.Tensor#

Return the mean and standard deviation of the perturbation kernel for this SDE.

Parameters:

state – current state of the process, shape (B, C, D, T)
prior_mean – mean of the prior distribution
time – current time of the process, shape (B,)

perturb_kernel_std( time: torch.Tensor, ) → torch.Tensor#

Return the standard deviation of the perturbation kernel for this SDE.

Note that the standard deviation depends on the time and the noise schedule, which is parametrized using self.stiffness, self.std_min and self.std_max.

Parameters:: time – current time of the process, shape (B,)
Returns:: A tensor of shape (B,)

prior_sampling( prior_mean: torch.Tensor, ) → torch.Tensor#

Generate a sample from the prior distribution p_T.

Parameters:: prior_mean – Mean of the prior distribution

class nemo.collections.audio.parts.submodules.diffusion.ReverseStochasticDifferentialEquation(*args: Any, **kwargs: Any)#

Bases: StochasticDifferentialEquation

coefficients(

state: torch.Tensor,

time: torch.Tensor,

score_condition: torch.Tensor | None = None,

state_length: torch.Tensor | None = None,

**kwargs,

) → Tuple[torch.Tensor, torch.Tensor]#

Compute drift and diffusion coefficients for the reverse SDE.

Parameters:

state – current state of the process, shape (B, C, D, T)
time – current time of the process, shape (B,)

copy()#: Create a copy of this SDE.

discretize(

*,

state: torch.Tensor,

time: torch.Tensor,

score_condition: torch.Tensor | None = None,

state_length: torch.Tensor | None = None,

**kwargs,

) → Tuple[torch.Tensor, torch.Tensor]#

Discretize the reverse SDE.

Parameters:

state – current state of the process, shape (B, C, D, T)
time – current time of the process, shape (B,)
score_condition – condition for the score estimator
state_length – length of the valid time steps for each example in the batch
**kwargs – other parameters for discretization of the forward SDE

prior_sampling( shape: torch.Size, device: torch.device, ) → torch.Tensor#: Prior sampling is not necessary for the reverse SDE.

class nemo.collections.audio.parts.submodules.diffusion.PredictorCorrectorSampler(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Predictor-Corrector sampler for the reverse SDE.

Parameters:

sde – forward SDE
score_estimator – neural score estimator
predictor – predictor for the reverse process
corrector – corrector for the reverse process
num_steps – number of time steps for the reverse process
num_corrector_steps – number of corrector steps
time_max – maximum time
time_min – minimum time
snr – SNR for Annealed Langevin Dynamics
output_type – type of the output (‘state’ for the final state, or ‘mean’ for the mean of the final state)

References

Song et al., Score-based generative modeling through stochastic differential equations, 2021

class nemo.collections.audio.parts.submodules.diffusion.Predictor(*args: Any, **kwargs: Any)#

Bases: Module, ABC

Predictor for the reverse process.

Parameters:

sde – forward SDE
score_estimator – neural score estimator

abstract forward(

*,

state: torch.Tensor,

time: torch.Tensor,

score_condition: torch.Tensor | None = None,

state_length: torch.Tensor | None = None,

**kwargs,

)#

Predict the next state of the reverse process.

Parameters:

state – current state of the process, shape (B, C, D, T)
time – current time of the process, shape (B,)
score_condition – conditioning for the score estimator
state_length – length of the valid time steps for each example in the batch

Returns:

New state and mean.

class nemo.collections.audio.parts.submodules.diffusion.ReverseDiffusionPredictor(*args: Any, **kwargs: Any)#

Bases: Predictor

Predict the next state of the reverse process using the reverse diffusion process.

Parameters:

sde – forward SDE
score_estimator – neural score estimator

forward(

*,

state,

time,

score_condition=None,

state_length=None,

**kwargs,

)#

Predict the next state of the reverse process using the reverse diffusion process.

Parameters:

state – current state of the process, shape (B, C, D, T)
time – current time of the process, shape (B,)
score_condition – conditioning for the score estimator
state_length – length of the valid time steps for each example in the batch

Returns:

New state and mean of the diffusion process.

class nemo.collections.audio.parts.submodules.diffusion.Corrector(*args: Any, **kwargs: Any)#

Bases: NeuralModule, ABC

Corrector for the reverse process.

Parameters:

sde – forward SDE
score_estimator – neural score estimator
snr – SNR for Annealed Langevin Dynamics
num_steps – number of steps for the corrector

class nemo.collections.audio.parts.submodules.diffusion.AnnealedLangevinDynamics(*args: Any, **kwargs: Any)#

Bases: Corrector

Annealed Langevin Dynamics for the reverse process.

References

Song et al., Score-based generative modeling through stochastic differential equations, 2021

forward( state, time, score_condition=None, state_length=None, )#

Correct the state using Annealed Langevin Dynamics.

Parameters:

state – current state of the process, shape (B, C, D, T)
time – current time of the process, shape (B,)
score_condition – conditioning for the score estimator
state_length – length of the valid time steps for each example in the batch

Returns:

New state and mean of the diffusion process.

References

Alg. 4 in http://arxiv.org/abs/2011.13456

Submodules: Flow#

class nemo.collections.audio.parts.submodules.flow.ConditionalFlow(time_min: float = 1e-08, time_max: float = 1.0)#

Bases: ABC

Abstract class for different conditional flow-matching (CFM) classes

Time horizon is [time_min, time_max (should be 1)]

every path is “conditioned” on endpoints of the path endpoints are just our paired data samples subclasses need to implement mean, std, and vector_field

flow( *, time: torch.Tensor, x_start: torch.Tensor, x_end: torch.Tensor, point: torch.Tensor, ) → torch.Tensor#: Compute the conditional flow phi_t( point | x_start, x_end). This is an affine flow.

generate_time( batch_size: int, rng: torch.random.Generator | None = None, ) → torch.Tensor#: Randomly sample a batchsize of time_steps from U[self.time_min, self.time_max] Supports an external random number generator for better reproducibility

abstract mean( *, time: torch.Tensor, x_start: torch.Tensor, x_end: torch.Tensor, ) → torch.Tensor#: Return the mean of p_t(x | x_start, x_end) at time t

sample( *, time: torch.Tensor, x_start: torch.Tensor, x_end: torch.Tensor, ) → torch.Tensor#: Generate a sample from p_t(x | x_start, x_end) at time t. Note that this implementation assumes all path marginals are normally distributed.

abstract std( *, time: torch.Tensor, x_start: torch.Tensor, x_end: torch.Tensor, ) → torch.Tensor#: Return the standard deviation of p_t(x | x_start, x_end) at time t

abstract vector_field( *, time: torch.Tensor, x_start: torch.Tensor, x_end: torch.Tensor, point: torch.Tensor, ) → torch.Tensor#: Compute the conditional vector field v_t( point | x_start, x_end)

class nemo.collections.audio.parts.submodules.flow.OptimalTransportFlow( time_min: float = 1e-08, time_max: float = 1.0, sigma_start: float = 1.0, sigma_end: float = 0.0001, )#

Bases: ConditionalFlow

The OT-CFM model from [Lipman et at, 2023]

Every conditional path the following holds: p_0 = N(x_start, sigma_start) p_1 = N(x_end, sigma_end),

mean(x, t) = (time_max - t) * x_start + t * x_end: (linear interpolation between x_start and x_end)

std(x, t) = (time_max - t) * sigma_start + t * sigma_end

Every conditional path is optimal transport map from p_0(x_start, x_end) to p_1(x_start, x_end) Marginal path is not guaranteed to be an optimal transport map from p_0 to p_1

To get the OT-CFM model from [Lipman et at, 2023] just pass zeroes for x_start To get the I-CFM model, set sigma_min=sigma_max To get the rectified flow model, set sigma_min=sigma_max=0

Parameters:

time_min – minimum time value used in the process
time_max – maximum time value used in the process
sigma_start – the standard deviation of the initial distribution
sigma_end – the standard deviation of the target distribution

mean( *, x_start: torch.Tensor, x_end: torch.Tensor, time: torch.Tensor, ) → torch.Tensor#: Return the mean of p_t(x | x_start, x_end) at time t

std( *, x_start: torch.Tensor, x_end: torch.Tensor, time: torch.Tensor, ) → torch.Tensor#: Return the standard deviation of p_t(x | x_start, x_end) at time t

vector_field( *, x_start: torch.Tensor, x_end: torch.Tensor, time: torch.Tensor, point: torch.Tensor, eps: float = 1e-06, ) → torch.Tensor#: Compute the conditional vector field v_t( point | x_start, x_end)

class nemo.collections.audio.parts.submodules.flow.ConditionalFlowMatchingSampler( estimator: torch.nn.Module, num_steps: int = 5, time_min: float = 1e-08, time_max: float = 1.0, )#

Bases: ABC

Abstract class for different sampler to solve the ODE in CFM

Parameters:

estimator – the NN-based conditional vector field estimator
num_steps – How many time steps to iterate in the process
time_min – minimum time value used in the process
time_max – maximum time value used in the process

class nemo.collections.audio.parts.submodules.flow.ConditionalFlowMatchingEulerSampler( estimator: torch.nn.Module, num_steps: int = 5, time_min: float = 1e-08, time_max: float = 1.0, )#

Bases: ConditionalFlowMatchingSampler

The Euler Sampler for solving the ODE in CFM on a uniform time grid

Submodules: Multichannel#

class nemo.collections.audio.parts.submodules.multichannel.ChannelAugment(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Randomly permute and selects a subset of channels.

Parameters:

permute_channels (bool) – Apply a random permutation of channels.
num_channels_min (int) – Minimum number of channels to select.
num_channels_max (int) – Max number of channels to select.
rng – Optional, random generator.
seed – Optional, seed for the generator.

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.TransformAverageConcatenate(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Apply transform-average-concatenate across channels. We’re using a version from [2].

Parameters:

in_features – Number of input features
out_features – Number of output features

References

[1] Luo et al, End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation, 2019 [2] Yoshioka et al, VarArray: Array-Geometry-Agnostic Continuous Speech Separation, 2022

forward( input: torch.Tensor, ) → torch.Tensor#

Parameters:: input – shape (B, M, in_features, T)
Returns:: Output tensor with shape shape (B, M, out_features, T)

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.TransformAttendConcatenate(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Apply transform-attend-concatenate across channels. The output is a concatenation of transformed channel and MHA over channels.

Parameters:

in_features – Number of input features
out_features – Number of output features
n_head – Number of heads for the MHA module
dropout_rate – Dropout rate for the MHA module

References

Jukić et al, Flexible multichannel speech enhancement for noise-robust frontend, 2023

forward( input: torch.Tensor, ) → torch.Tensor#

Parameters:: input – shape (B, M, in_features, T)
Returns:: Output tensor with shape shape (B, M, out_features, T)

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.ChannelAveragePool(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Apply average pooling across channels.

forward(input: torch.Tensor) → torch.Tensor#

Parameters:: input – shape (B, M, F, T)
Returns:: Output tensor with shape shape (B, F, T)

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.ChannelAttentionPool(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Use attention pooling to aggregate information across channels. First apply MHA across channels and then apply averaging.

Parameters:

in_features – Number of input features
out_features – Number of output features
n_head – Number of heads for the MHA module
dropout_rate – Dropout rate for the MHA module

References

Wang et al, Neural speech separation using sparially distributed microphones, 2020
Jukić et al, Flexible multichannel speech enhancement for noise-robust frontend, 2023

forward(input: torch.Tensor) → torch.Tensor#

Parameters:: input – shape (B, M, F, T)
Returns:: Output tensor with shape shape (B, F, T)

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.ParametricMultichannelWienerFilter(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Parametric multichannel Wiener filter, with an adjustable tradeoff between noise reduction and speech distortion. It supports automatic reference channel selection based on the estimated output SNR.

Parameters:

beta – Parameter of the parameteric filter, tradeoff between noise reduction and speech distortion (0: MVDR, 1: MWF).
rank – Rank assumption for the speech covariance matrix.
postfilter – Optional postfilter. If None, no postfilter is applied.
ref_channel – Optional, reference channel. If None, it will be estimated automatically.
ref_hard – If true, estimate a hard (one-hot) reference. If false, a soft reference.
ref_hard_use_grad – If true, use straight-through gradient when using the hard reference
ref_subband_weighting – If true, use subband weighting when estimating reference channel
num_subbands – Optional, used to determine the parameter size for reference estimation
diag_reg – Optional, diagonal regularization for the multichannel filter
eps – Small regularization constant to avoid division by zero

References

Souden et al, On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction, 2010

apply_ban( input: torch.Tensor, filter: torch.Tensor, psd_n: torch.Tensor, ) → torch.Tensor#

Apply blind analytic normalization postfilter. Note that this normalization has been derived for the GEV beamformer in [1]. More specifically, the BAN postfilter aims to scale GEV to satisfy the distortionless constraint and the final analytical expression is derived using an assumption on the norm of the transfer function. However, this may still be useful in some instances.

Parameters:

input – batch with M output channels (B, M, F, T)
filter – batch of C-input, M-output filters, shape (B, F, C, M)
psd_n – batch of noise PSDs, shape (B, F, C, C)

Returns:

Filtere input, shape (B, M, F, T)

References

Warsitz and Haeb-Umbach, Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition, 2007

apply_diag_reg( psd: torch.Tensor, ) → torch.Tensor#

Apply diagonal regularization on psd.

Parameters:: psd – tensor, shape (…, C, C)
Returns:: Tensor, same shape as input.

apply_filter( input: torch.Tensor, filter: torch.Tensor, ) → torch.Tensor#

Apply the MIMO filter on the input.

Parameters:

input – batch with C input channels, shape (B, C, F, T)
filter – batch of C-input, M-output filters, shape (B, F, C, M)

Returns:

M-channel filter output, shape (B, M, F, T)

forward( input: torch.Tensor, mask_s: torch.Tensor, mask_n: torch.Tensor, ) → torch.Tensor#

Return processed signal. The output has either one channel (M=1) if a ref_channel is selected, or the same number of channels as the input (M=C) if ref_channel is None.

Parameters:

input – Input signal, complex tensor with shape (B, C, F, T)
mask_s – Mask for the desired signal, shape (B, F, T)
mask_n – Mask for the undesired noise, shape (B, F, T)

Returns:

Processed signal, shape (B, M, F, T)

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

static trace( x: torch.Tensor, keepdim: bool = False, ) → torch.Tensor#

Calculate trace of matrix slices over the last two dimensions in the input tensor.

Parameters:: x – tensor, shape (…, C, C)
Returns:: Trace for each (C, C) matrix. shape (…)

class nemo.collections.audio.parts.submodules.multichannel.ReferenceChannelEstimatorSNR(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate a reference channel by selecting the reference that maximizes the output SNR. It returns one-hot encoded vector or a soft reference.

A straight-through estimator is used for gradient when using hard reference.

Parameters:

hard – If true, use hard estimate of ref channel. If false, use a soft estimate across channels.
hard_use_grad – Use straight-through estimator for the gradient.
subband_weighting – If true, use subband weighting when adding across subband SNRs. If false, use average across subbands.

References

Boeddeker et al, Front-End Processing for the CHiME-5 Dinner Party Scenario, 2018

forward( W: torch.Tensor, psd_s: torch.Tensor, psd_n: torch.Tensor, ) → torch.Tensor#

Parameters:

W – Multichannel input multichannel output filter, shape (B, F, C, M), where C is the number of input channels and M is the number of output channels
psd_s – Covariance for the signal, shape (B, F, C, C)
psd_n – Covariance for the noise, shape (B, F, C, C)

Returns:

One-hot or soft reference channel, shape (B, M)

property input_types#: Returns definitions of module input types

property output_types#: Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.WPEFilter(*args: Any, **kwargs: Any)#

Bases: NeuralModule

A weighted prediction error filter. Given input signal, and expected power of the desired signal, this class estimates a multiple-input multiple-output prediction filter and returns the filtered signal. Currently, estimation of statistics and processing is performed in batch mode.

Parameters:

filter_length – Length of the prediction filter in frames, per channel
prediction_delay – Prediction delay in frames
diag_reg – Diagonal regularization for the correlation matrix Q, applied as diag_reg * trace(Q) + eps
eps – Small positive constant for regularization

References

Yoshioka and Nakatani, Generalization of Multi-Channel Linear Prediction
Methods for Blind MIMO Impulse Response Shortening, 2012
Jukić et al, Group sparsity for MIMO speech dereverberation, 2015

apply_filter( filter: torch.Tensor, input: torch.Tensor | None = None, tilde_input: torch.Tensor | None = None, ) → torch.Tensor#

Apply a prediction filter filter on the input input as

output(b,f) = tilde{input(b,f)} * filter(b,f)

If available, directly use the convolution matrix tilde_input.

Parameters:

input – Input signal, shape (B, C, F, N)
tilde_input – Convolution matrix for the input signal, shape (B, C, F, N, filter_length)
filter – Prediction filter, shape (B, C, F, C, filter_length)

Returns:

Multi-channel signal obtained by applying the prediction filter on the input signal, same shape as input (B, C, F, N)

classmethod convtensor( x: torch.Tensor, filter_length: int, delay: int = 0, n_steps: int | None = None, ) → torch.Tensor#

Create a tensor equivalent of convmtx_mc for each example in the batch. The input signal tensor x has shape (B, C, F, N). Convtensor returns a view of the input signal x.

Note: We avoid reshaping the output to collapse channels and filter taps into a single dimension, e.g., (B, F, N, -1). In this way, the output is a view of the input, while an additional reshape would result in a contiguous array and more memory use.

Parameters:

x – input tensor, shape (B, C, F, N)
filter_length – length of the filter, determines the shape of the convolution tensor
delay – delay to add to the input signal x before constructing the convolution tensor
n_steps – Optional, number of time steps to keep in the out. Defaults to the number of time steps in the input tensor.

Returns:

Return a convolutional tensor with shape (B, C, F, n_steps, filter_length)

estimate_correlations( input: torch.Tensor, weight: torch.Tensor, tilde_input: torch.Tensor, input_length: torch.Tensor | None = None, ) → Tuple[torch.Tensor]#

Parameters:

input – Input signal, shape (B, C, F, N)
weight – Time-frequency weight, shape (B, F, N)
tilde_input – Multi-channel convolution tensor, shape (B, C, F, N, filter_length)
input_length – Length of each input example, shape (B)

Returns:

Returns a tuple of correlation matrices for each batch.

Let X denote the input signal in a single subband, tilde{X} the corresponding multi-channel correlation matrix, and w the vector of weights.

The first output is Q = tilde{X}^H * diag(w) * tilde{X}, for each (b, f). The matrix Q has shape (C * filter_length, C * filter_length) The output is returned in a tensor with shape (B, F, C, filter_length, C, filter_length).

The second output is R = tilde{X}^H * diag(w) * X, for each (b, f). The matrix R has shape (C * filter_length, C) The output is returned in a tensor with shape (B, F, C, filter_length, C). The last dimension corresponds to output channels.

estimate_filter( Q: torch.Tensor, R: torch.Tensor, ) → torch.Tensor#

Estimate the MIMO prediction filter as G(b,f) = Q(b,f) R(b,f) for each subband in each example in the batch (b, f).

Parameters:

Q – shape (B, F, C, filter_length, C, filter_length)
R – shape (B, F, C, filter_length, C)

Returns:

Complex-valued prediction filter, shape (B, C, F, C, filter_length)

forward( input: torch.Tensor, power: torch.Tensor, input_length: torch.Tensor | None = None, ) → torch.Tensor#

Given input and the predicted power for the desired signal, estimate the WPE filter and return the processed signal.

Parameters:

input – Input signal, shape (B, C, F, N)
power – Predicted power of the desired signal, shape (B, C, F, N)
input_length – Optional, length of valid frames in input. Defaults to None

Returns:

Tuple of (processed_signal, output_length). Processed signal has the same shape as the input signal (B, C, F, N), and the output length is the same as the input length.

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

classmethod permute_convtensor(x: torch.Tensor) → torch.Tensor#

Reshape and permute columns to convert the result of convtensor to be equal to convmtx_mc. This is used for verification purposes and it is not required to use the filter.

Parameters:: x – output of self.convtensor, shape (B, C, F, N, filter_length)
Returns:: Output has shape (B, F, N, C*filter_length) that corresponds to the layout of convmtx_mc.

Submodules: NCSN++#

class nemo.collections.audio.parts.submodules.ncsnpp.SpectrogramNoiseConditionalScoreNetworkPlusPlus(

*args: Any,

**kwargs: Any,

)#

Bases: NeuralModule

This model handles complex-valued inputs by stacking real and imaginary components. Stacked tensor is processed using NCSN++ and the output is projected to generate real and imaginary components of the output channels.

Parameters:

in_channels – number of input complex-valued channels
out_channels – number of output complex-valued channels

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

class nemo.collections.audio.parts.submodules.ncsnpp.NoiseConditionalScoreNetworkPlusPlus(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Implementation of Noise Conditional Score Network (NCSN++) architecture.

References

Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, NeurIPS 2021
Brock et al., Large scale GAN training for high fidelity natural image synthesis, ICLR 2018

forward( *, input: torch.Tensor, input_length: torch.Tensor | None, condition: torch.Tensor | None = None, )#

Forward pass of the model.

Parameters:

input – input tensor, shjae (B, C, D, T)
input_length – length of the valid time steps for each example in the batch, shape (B,)
condition – scalar condition (time) for the model, will be embedded using self.time_embedding

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

pad_input( input: torch.Tensor, ) → torch.Tensor#: Pad input tensor to match the required dimensions across T and D.

class nemo.collections.audio.parts.submodules.ncsnpp.GaussianFourierProjection(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Gaussian Fourier embeddings for input scalars.

The input scalars are typically time or noise levels.

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

class nemo.collections.audio.parts.submodules.ncsnpp.ResnetBlockBigGANPlusPlus(*args: Any, **kwargs: Any)#

Bases: Module

Implementation of a ResNet block for the BigGAN model.

References

Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, NeurIPS 2021
Brock et al., Large scale GAN training for high fidelity natural image synthesis, ICLR 2018

forward( x: torch.Tensor, diffusion_time_embedding: torch.Tensor | None = None, )#

Forward pass of the model.

Parameters:

x – input tensor
diffusion_time_embedding – embedding of the diffusion time step

Returns:

Output tensor

init_weights_()#: Weight initialization

Submodules: Schrödinger Bridge#

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBNoiseSchedule(*args: Any, **kwargs: Any)#

Bases: NeuralModule, ABC

Noise schedule for the Schrödinger Bridge

Parameters:

time_min – minimum time for the process
time_max – maximum time for the process
num_steps – number of steps for the process
eps – small regularization

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074

abstract alpha(time: torch.Tensor) → torch.Tensor#

Return alpha for SB noise schedule.

alpha_t = exp( int_0^s f(s) ds )

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing alpha for each time.

alpha_bar_from_alpha( alpha: torch.Tensor, )#

Return alpha_bar for SB.

alpha_bar = alpha_t / alpha_t_max

Parameters:: alpha – tensor with alpha values
Returns:: Tensors the same size as alpha, representing alpha_bar and alpha_t_max.

property alpha_t_max#: Return alpha_t at t_max.

abstract copy()#: Return a copy of the noise schedule.

property dt: float#: Time step for the process.

abstract f(time: torch.Tensor) → torch.Tensor#

Drift scaling f(t).

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing drift scaling.

abstract g(time: torch.Tensor) → torch.Tensor#

Diffusion scaling g(t).

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing diffusion scaling.

generate_time( size: int, device: torch.device, ) → torch.Tensor#: Generate random time steps in the valid range.

get_alphas( time: torch.Tensor, )#

Return alpha, alpha_bar and alpha_t_max for SB.

Parameters:: time – tensor with time steps
Returns:: Tuple of tensors with alpha, alpha_bar and alpha_t_max.

get_sigmas( time: torch.Tensor, )#

Return sigma, sigma_bar and sigma_t_max for SB.

Parameters:: time – tensor with time steps
Returns:: Tuple of tensors with sigma, sigma_bar and sigma_t_max.

abstract sigma(time: torch.Tensor) → torch.Tensor#

Return sigma_t for SB.

sigma_t^2 = int_0^s g^2(s) / alpha_s^2 ds

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing sigma for each time.

sigma_bar_from_sigma( sigma: torch.Tensor, )#

Return sigma_bar_t for SB.

sigma_bar_t^2 = sigma_t_max^2 - sigma_t^2

Parameters:: sigma – tensor with sigma values
Returns:: Tensors the same size as sigma, representing sigma_bar and sigma_t_max.

property sigma_t_max#: Return sigma_t at t_max.

property time_delta: float#: Time range for the process.

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBNoiseScheduleVE(*args: Any, **kwargs: Any)#

Bases: SBNoiseSchedule

Variance exploding noise schedule for the Schrödinger Bridge.

Parameters:

k – defines the base for the exponential diffusion coefficient
c – scaling for the diffusion coefficient
time_min – minimum time for the process
time_max – maximum time for the process
num_steps – number of steps for the process
eps – small regularization

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074

alpha(time: torch.Tensor) → torch.Tensor#

Return alpha for SB noise schedule.

alpha_t = exp( int_0^s f(s) ds )

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing alpha for each time.

copy()#: Return a copy of the noise schedule.

f(time: torch.Tensor) → torch.Tensor#

Drift scaling f(t).

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing drift scaling.

g(time: torch.Tensor) → torch.Tensor#

Diffusion scaling g(t).

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing diffusion scaling.

sigma(time: torch.Tensor) → torch.Tensor#

Return sigma_t for SB.

sigma_t^2 = int_0^s g^2(s) / alpha_s^2 ds

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing sigma for each time.

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBNoiseScheduleVP(*args: Any, **kwargs: Any)#

Bases: SBNoiseSchedule

Variance preserving noise schedule for the Schrödinger Bridge.

Parameters:

beta_0 – defines the lower bound for diffusion coefficient
beta_1 – defines upper bound for diffusion coefficient
c – scaling for the diffusion coefficient
time_min – minimum time for the process
time_max – maximum time for the process
num_steps – number of steps for the process
eps – small regularization

alpha(time: torch.Tensor) → torch.Tensor#

Return alpha for SB noise schedule.

alpha_t = exp( int_0^s f(s) ds )

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing alpha for each time.

copy()#: Return a copy of the noise schedule.

f(time: torch.Tensor) → torch.Tensor#

Drift scaling f(t).

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing drift scaling.

g(time: torch.Tensor) → torch.Tensor#

Diffusion scaling g(t).

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing diffusion scaling.

sigma(time: torch.Tensor) → torch.Tensor#

Return sigma_t for SB.

sigma_t^2 = int_0^s g^2(s) / alpha_s^2 ds

Parameters:: time – tensor with time steps
Returns:: Tensor the same size as time, representing sigma for each time.

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBSampler(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Schrödinger Bridge sampler.

Parameters:

noise_schedule – noise schedule for the bridge
estimator – neural estimator
estimator_output – defines the output of the estimator, e.g., data_prediction
estimator_time – time for conditioning the estimator, e.g., ‘current’ or ‘previous’. Default is ‘previous’.
process – defines the process, e.g., sde or ode
time_max – maximum time for the process
time_min – minimum time for the process
num_steps – number of steps for the process
eps – small regularization to prevent division by zero

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074 Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis, https://arxiv.org/abs/2312.03491

Submodules: TransformerUNet#

class nemo.collections.audio.parts.submodules.transformerunet.LearnedSinusoidalPosEmb(*args: Any, **kwargs: Any)#

Bases: Module

The sinusoidal Embedding to encode time conditional information

forward(t: torch.Tensor) → torch.Tensor#

Parameters:: t – input time tensor, shape (B)
Returns:: the encoded time conditional embedding, shape (B, D)
Return type:: fouriered

class nemo.collections.audio.parts.submodules.transformerunet.ConvPositionEmbed(*args: Any, **kwargs: Any)#

Bases: Module

The Convolutional Embedding to encode time information of each frame

forward(x, mask=None)#

Parameters:: x – input tensor, shape (B, T, D)
Returns:: output tensor with the same shape (B, T, D)
Return type:: out

class nemo.collections.audio.parts.submodules.transformerunet.RMSNorm(*args: Any, **kwargs: Any)#

Bases: Module

The Root Mean Square Layer Normalization

References

Zhang et al., Root Mean Square Layer Normalization, 2019

class nemo.collections.audio.parts.submodules.transformerunet.AdaptiveRMSNorm(*args: Any, **kwargs: Any)#

Bases: Module

Adaptive Root Mean Square Layer Normalization given a conditional embedding. This enables the model to consider the conditional input during normalization.

class nemo.collections.audio.parts.submodules.transformerunet.GEGLU(*args: Any, **kwargs: Any)#

Bases: Module

The GeGLU activation implementation

class nemo.collections.audio.parts.submodules.transformerunet.TransformerUNet(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Implementation of the transformer Encoder Model with U-Net structure used in VoiceBox and AudioBox

References

Le et al., Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale, 2023 Vyas et al., Audiobox: Unified Audio Generation with Natural Language Prompts, 2023

forward( x, key_padding_mask: torch.Tensor | None = None, adaptive_rmsnorm_cond=None, )#

Forward pass of the model.

Parameters:

input – input tensor, shape (B, C, D, T)
key_padding_mask – mask tensor indicating the padding parts, shape (B, T)
adaptive_rmsnorm_cond – conditional input for the model, shape (B, D)

get_alibi_bias(batch_size: int, seq_len: int)#: Return the alibi_bias given batch size and seqence length

init_alibi(max_positions: int, heads: int)#

Initialize the Alibi bias parameters

References

Press et al., Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, 2021

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

class nemo.collections.audio.parts.submodules.transformerunet.SpectrogramTransformerUNet(*args: Any, **kwargs: Any)#

Bases: NeuralModule

This model handles complex-valued inputs by stacking real and imaginary components. Stacked tensor is processed using TransformerUNet and the output is projected to generate real and imaginary components of the output channels.

Convolutional Positional Embedding is applied for the input sequence

forward( input, input_length=None, condition=None, )#

Forward pass of the model.

Parameters:

input – input tensor, shape (B, C, D, T)
input_length – length of the valid time steps for each example in the batch, shape (B,)
condition – scalar condition (time) for the model, will be embedded using self.time_embedding

property input_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#: Returns definitions of module output ports.

Losses#

class nemo.collections.audio.losses.MAELoss(*args: Any, **kwargs: Any)#

Bases: Loss, Typing

Computes the mean absolute error (MAE) loss with weighted average across channels.

Parameters:

weight – weight for loss of each output channel, used for averaging the loss across channels. Defaults to None (averaging).
reduction – batch reduction. Defaults to mean over the batch.
ndim – Number of dimensions for the input signal

forward( estimate: torch.Tensor, target: torch.Tensor, input_length: torch.Tensor | None = None, mask: torch.Tensor | None = None, ) → torch.Tensor#

For input batch of multi-channel signals, calculate MAE between estimate and target for each channel, perform averaging across channels (weighting optional), and apply reduction across the batch.

Parameters:

estimate – Estimate of the target signal
target – Target signal
input_length – Length of each example in the batch
mask – Mask for each signal

Returns:

Scalar loss.

property input_types#: Input types definitions for MAELoss.

property output_types#: Output types definitions for MAELoss.

class nemo.collections.audio.losses.MSELoss(*args: Any, **kwargs: Any)#

Bases: Loss, Typing

Computes MSE loss with weighted average across channels.

Parameters:

weight – weight for loss of each output channel, used for averaging the loss across channels. Defaults to None (averaging).
reduction – batch reduction. Defaults to mean over the batch.
ndim – Number of dimensions for the input signal

forward( estimate: torch.Tensor, target: torch.Tensor, input_length: torch.Tensor | None = None, mask: torch.Tensor | None = None, ) → torch.Tensor#

For input batch of multi-channel signals, calculate SDR between estimate and target for each channel, perform averaging across channels (weighting optional), and apply reduction across the batch.

Parameters:

estimate – Estimate of the target signal
target – Target signal
input_length – Length of each example in the batch
mask – Mask for each signal

Returns:

Scalar loss.

property input_types#: Input types definitions for SDRLoss.

property output_types#: Output types definitions for MSELoss.

class nemo.collections.audio.losses.SDRLoss(*args: Any, **kwargs: Any)#

Bases: Loss, Typing

Computes signal-to-distortion ratio (SDR) loss with weighted average across channels.

Parameters:

weight – weight for SDR of each output channel, used for averaging the loss across channels. Defaults to None (averaging).
reduction – batch reduction. Defaults to mean over the batch.
scale_invariant – If True, use scale-invariant SDR. Defaults to False.
remove_mean – Remove mean before calculating the loss. Defaults to True.
sdr_max – Soft thresholding of the loss to SDR_max.
eps – Small value for regularization.

forward( estimate: torch.Tensor, target: torch.Tensor, input_length: torch.Tensor | None = None, mask: torch.Tensor | None = None, ) → torch.Tensor#

For input batch of multi-channel signals, calculate SDR between estimate and target for each channel, perform averaging across channels (weighting optional), and apply reduction across the batch.

Parameters:

estimate – Batch of signals, shape (B, C, T)
target – Batch of signals, shape (B, C, T)
input_length – Batch of lengths, shape (B,)
mask – Batch of temporal masks for each channel, shape (B, C, T)

Returns:

Scalar loss.

property input_types#: Input types definitions for SDRLoss.

property output_types#: Output types definitions for SDRLoss.

Datasets#

NeMo Format#

class nemo.collections.audio.data.audio_to_audio.BaseAudioDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

Base class of audio datasets, providing common functionality for other audio datasets.

Parameters:

collection – Collection of audio examples prepared from manifest files.
audio_processor – Used to process every example from the collection. A callable with process method. For reference, please check ASRAudioProcessor.

num_channels(signal_key) → int#

Returns the number of channels for a particular signal in items prepared by this dictionary.

More specifically, this will get the tensor from the first item in the dataset, check if it’s a one- or two-dimensional tensor, and return the number of channels based on the size of the first axis (shape[0]).

NOTE: This assumes that all examples have the same number of channels.

Parameters:: signal_key – string, used to select a signal from the dictionary output by __getitem__
Returns:: Number of channels for the selected signal.

abstract property output_types: Dict[str, NeuralType] | None#: Returns definitions of module output ports.

class nemo.collections.audio.data.audio_to_audio.AudioToTargetDataset(*args: Any, **kwargs: Any)#

Bases: BaseAudioDataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal.

Each line of the manifest file is expected to have the following format:

{"input_key": "path/to/input.wav", "target_key": "path/to/target.wav", "duration": "duration_in_seconds"}

Additionally, multiple audio files may be provided for each key in the manifest, for example,

{"input_key": "path/to/input.wav", "target_key": ["path/to/path_to_target_ch0.wav", "path/to/path_to_target_ch1.wav"], "duration": "duration_in_seconds"}

Keys for input and target signals can be configured in the constructor (input_key and target_key).

Parameters:

manifest_filepath – Path to manifest file in a format described above.
sample_rate – Sample rate for loaded audio signals.
input_key – Key pointing to input audio files in the manifest
target_key – Key pointing to target audio files in manifest
audio_duration – Optional duration of each item returned by __getitem__. If None, complete audio will be loaded. If set, a random subsegment will be loaded synchronously from target and audio, i.e., with the same start and end point.
random_offset – If True, offset will be randomized when loading a subsegment from a file.
max_duration – If audio exceeds this length, do not include in dataset.
min_duration – If audio is less than this length, do not include in dataset.
max_utts – Limit number of utterances.
input_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.
target_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.
normalization_signal – Normalize audio signals with a scale that ensures the normalization signal is in range [-1, 1]. All audio signals are scaled by the same factor. Supported values are None (no normalization), ‘input_signal’, ‘target_signal’.

property output_types: Dict[str, NeuralType] | None#

Returns definitions of module output ports.

Returns:

Dictionary containing the following items:

input_signal:: Batched single- or multi-channel input audio signal
input_length:: Batched original length of each input signal
target_signal:: Batched single- or multi-channel target audio signal
target_length:: Batched original length of each target signal

Return type:

OrderedDict

class nemo.collections.audio.data.audio_to_audio.AudioToTargetWithReferenceDataset(*args: Any, **kwargs: Any)#

Bases: BaseAudioDataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional reference signal is available.

This can be used, for example, when a reference signal is available from - enrollment utterance for the target signal - echo reference from playback - reference from another sensor that correlates with the target signal

Each line of the manifest file is expected to have the following format

{"input_key": "path/to/input.wav", "target_key": "path/to/path_to_target.wav", "reference_key": "path/to/path_to_reference.wav", "duration": "duration_in_seconds"}

Keys for input, target and reference signals can be configured in the constructor.

Parameters:

manifest_filepath – Path to manifest file in a format described above.
sample_rate – Sample rate for loaded audio signals.
input_key – Key pointing to input audio files in the manifest
target_key – Key pointing to target audio files in manifest
reference_key – Key pointing to reference audio files in manifest
audio_duration – Optional duration of each item returned by __getitem__. If None, complete audio will be loaded. If set, a random subsegment will be loaded synchronously from target and audio, i.e., with the same start and end point.
random_offset – If True, offset will be randomized when loading a subsegment from a file.
max_duration – If audio exceeds this length, do not include in dataset.
min_duration – If audio is less than this length, do not include in dataset.
max_utts – Limit number of utterances.
input_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.
target_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.
reference_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.
reference_is_synchronized – If True, it is assumed that the reference signal is synchronized with the input signal, so the same subsegment will be loaded as for input and target. If False, reference signal will be loaded independently from input and target.
reference_duration – Optional, can be used to set a fixed duration of the reference utterance. If None, complete audio file will be loaded.
normalization_signal – Normalize audio signals with a scale that ensures the normalization signal is in range [-1, 1]. All audio signals are scaled by the same factor. Supported values are None (no normalization), ‘input_signal’, ‘target_signal’, ‘reference_signal’.

property output_types: Dict[str, NeuralType] | None#

Returns definitions of module output ports.

Returns:

Dictionary containing the following items:

input_signal:: Batched single- or multi-channel input audio signal
input_length:: Batched original length of each input signal
target_signal:: Batched single- or multi-channel target audio signal
target_length:: Batched original length of each target signal
reference_signal:: Batched single- or multi-channel reference audio signal
reference_length:: Batched original length of each reference signal

Return type:

OrderedDict

class nemo.collections.audio.data.audio_to_audio.AudioToTargetWithEmbeddingDataset(*args: Any, **kwargs: Any)#

Bases: BaseAudioDataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional embedding signal. It is assumed that the embedding is in a form of a vector.

Each line of the manifest file is expected to have the following format

{"input_key": "path/to/input.wav", "target_key": "path/to/path_to_target.wav", "embedding_key": "path/to/path_to_reference.npy", "duration": "duration_in_seconds"}

Keys for input, target and embedding signals can be configured in the constructor.

Parameters:

manifest_filepath – Path to manifest file in a format described above.
sample_rate – Sample rate for loaded audio signals.
input_key – Key pointing to input audio files in the manifest
target_key – Key pointing to target audio files in manifest
embedding_key – Key pointing to embedding files in manifest
audio_duration – Optional duration of each item returned by __getitem__. If None, complete audio will be loaded. If set, a random subsegment will be loaded synchronously from target and audio, i.e., with the same start and end point.
random_offset – If True, offset will be randomized when loading a subsegment from a file.
max_duration – If audio exceeds this length, do not include in dataset.
min_duration – If audio is less than this length, do not include in dataset.
max_utts – Limit number of utterances.
input_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.
target_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.
normalization_signal – Normalize audio signals with a scale that ensures the normalization signal is in range [-1, 1]. All audio signals are scaled by the same factor. Supported values are None (no normalization), ‘input_signal’, ‘target_signal’.

property output_types: Dict[str, NeuralType] | None#

Returns definitions of module output ports.

Returns:

Dictionary containing the following items:

input_signal:: Batched single- or multi-channel input audio signal
input_length:: Batched original length of each input signal
target_signal:: Batched single- or multi-channel target audio signal
target_length:: Batched original length of each target signal
embedding_vector:: Batched embedded vector format
embedding_length:: Batched original length of each embedding vector

Return type:

OrderedDict

Lhotse Format#

class nemo.collections.audio.data.audio_to_audio_lhotse.LhotseAudioToTargetDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal.

Note

This is a Lhotse variant of nemo.collections.asr.data.audio_to_audio.AudioToTargetDataset.