Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

NeMo Audio API#

Model Classes#

Base Classes#

class nemo.collections.audio.models.AudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: ModelPT, ABC

Base class for audio-to-audio models.

Parameters:
  • cfg – A DictConfig object with the configuration parameters.

  • trainer – A Trainer object to be used for training.

configure_callbacks()#

Create an callback to add audio/spectrogram into tensorboard & wandb.

classmethod list_available_models() List[PretrainedModelInfo]#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns:

List of available pre-trained models.

static match_batch_length(
input: torch.Tensor,
batch_length: int,
) torch.Tensor#

Trim or pad the output to match the batch length.

Parameters:
  • input – tensor with shape (B, C, T)

  • batch_length – int

Returns:

Tensor with shape (B, C, T), where T matches the batch length.

multi_test_epoch_end(
outputs,
dataloader_idx: int = 0,
)#

Adds support for multiple test datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters:
  • outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.

  • dataloader_idx – int representing the index of the dataloader.

Returns:

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

multi_validation_epoch_end(
outputs,
dataloader_idx: int = 0,
)#

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters:
  • outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.

  • dataloader_idx – int representing the index of the dataloader.

Returns:

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

on_after_backward()#

zero-out the gradients which any of them is NAN or INF

process(
paths2audio_files: List[str],
output_dir: str,
batch_size: int = 1,
num_workers: int | None = None,
input_channel_selector: int | Iterable[int] | str | None = None,
input_dir: str | None = None,
) List[str]#

Takes paths to audio files and returns a list of paths to processed audios.

Parameters:
  • paths2audio_files – paths to audio files to be processed

  • output_dir – directory to save the processed files

  • batch_size – (int) batch size to use during inference.

  • num_workers – Number of workers for the dataloader

  • input_channel_selector (int | Iterable[int] | str) – select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None.

  • input_dir – Optional, directory that contains the input files. If provided, the output directory will mirror the input directory structure.

Returns:

Paths to processed audio signals.

setup_optimization_flags()#

Utility method that must be explicitly called by the subclass in order to support optional optimization flags. This method is the only valid place to access self.cfg prior to DDP training occurs.

The subclass may chose not to support this method, therefore all variables here must be checked via hasattr()

Processing Models#

class nemo.collections.audio.models.EncMaskDecAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

Class for encoder-mask-decoder audio processing models.

The model consists of the following blocks:
  • encoder: transforms input multi-channel audio signal into an encoded representation (analysis transform)

  • mask_estimator: estimates a mask used by signal processor

  • mask_processor: mask-based signal processor, combines the encoded input and the estimated mask

  • decoder: transforms processor output into the time domain (synthesis transform)

forward(input_signal, input_length=None)#

Forward pass of the model.

Parameters:
  • input_signal – Tensor that represents a batch of raw audio signals, of shape [B, T] or [B, T, C]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.

  • input_signal_length – Vector of length B, that contains the individual lengths of the audio sequences.

Returns:

Output signal output in the time domain and the length of the output signal output_length.

property input_types: Dict[str, NeuralType]#

Define these to enable input neural type checks

classmethod list_available_models() PretrainedModelInfo | None#

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns:

List of available pre-trained models.

property output_types: Dict[str, NeuralType]#

Define these to enable output neural type checks

class nemo.collections.audio.models.FlowMatchingAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models uses a flow matching process to generate an encoded representation of the enhanced signal.

The model consists of the following blocks:
  • encoder: transforms input multi-channel audio signal into an encoded representation (analysis transform)

  • estimator: neural model, estimates a score for the diffusion process

  • flow: ordinary differential equation (ODE) defining a flow and a vector field.

  • sampler: sampler for the inference process, estimates coefficients of the target signal

  • decoder: transforms sampler output into the time domain (synthesis transform)

  • ssl_pretrain_masking: if it is defined, perform the ssl pretrain masking for self reconstruction in the training process

forward_internal(
input_signal,
input_length=None,
enable_ssl_masking=False,
)#

Internal forward pass of the model.

Parameters:
  • input_signal – Tensor that represents a batch of raw audio signals, of shape [B, T] or [B, T, C]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.

  • input_signal_length – Vector of length B, that contains the individual lengths of the audio sequences.

  • enable_ssl_masking – Whether to enable SSL masking of the input. If using SSL pretraining, masking is applied to the input signal. If not using SSL pretraining, masking is not applied.

Returns:

Output signal output in the time domain and the length of the output signal output_length.

property input_types: Dict[str, NeuralType]#

Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#

Define these to enable output neural type checks

class nemo.collections.audio.models.PredictiveAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models aims to directly estimate the coefficients in the encoded domain by applying a neural model.

forward(input_signal, input_length=None)#

Forward pass of the model.

Parameters:
  • input_signal – time-domain signal

  • input_length – valid length of each example in the batch

Returns:

Output signal output in the time domain and the length of the output signal output_length.

property input_types: Dict[str, NeuralType]#

Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#

Define these to enable output neural type checks

class nemo.collections.audio.models.ScoreBasedGenerativeAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models is using a score-based diffusion process to generate an encoded representation of the enhanced signal.

The model consists of the following blocks:
  • encoder: transforms input multi-channel audio signal into an encoded representation (analysis transform)

  • estimator: neural model, estimates a score for the diffusion process

  • sde: stochastic differential equation (SDE) defining the forward and reverse diffusion process

  • sampler: sampler for the reverse diffusion process, estimates coefficients of the target signal

  • decoder: transforms sampler output into the time domain (synthesis transform)

property input_types: Dict[str, NeuralType]#

Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#

Define these to enable output neural type checks

class nemo.collections.audio.models.SchroedingerBridgeAudioToAudioModel(*args: Any, **kwargs: Any)#

Bases: AudioToAudioModel

This models is using a Schrödinger Bridge process to generate an encoded representation of the enhanced signal.

The model consists of the following blocks:
  • encoder: transforms input audio signal into an encoded representation (analysis transform)

  • estimator: neural model, estimates the coefficients for the SB process

  • noise_schedule: defines the path between the clean and noisy signals

  • sampler: sampler for the reverse process, estimates coefficients of the target signal

  • decoder: transforms sampler output into the time domain (synthesis transform)

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074

property input_types: Dict[str, NeuralType]#

Define these to enable input neural type checks

property output_types: Dict[str, NeuralType]#

Define these to enable output neural type checks

Modules#

Features#

class nemo.collections.audio.modules.features.SpectrogramToMultichannelFeatures(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Convert a complex-valued multi-channel spectrogram to multichannel features.

Parameters:
  • num_subbands – Expected number of subbands in the input signal

  • num_input_channels – Optional, provides the number of channels of the input signal. Used to infer the number of output channels.

  • mag_reduction – Reduction across channels. Default None, will calculate magnitude of each channel.

  • mag_power – Optional, apply power on the magnitude.

  • use_ipd – Use inter-channel phase difference (IPD).

  • mag_normalization – Normalization for magnitude features

  • ipd_normalization – Normalization for IPD features

  • eps – Small regularization constant.

forward(
input: torch.Tensor,
input_length: torch.Tensor,
) torch.Tensor#

Convert input batch of C-channel spectrograms into a batch of time-frequency features with dimension num_feat. The output number of channels may be the same as input, or reduced to 1, e.g., if averaging over magnitude and not appending individual IPDs.

Parameters:
  • input – Spectrogram for C channels with F subbands and N time frames, (B, C, F, N)

  • input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

num_feat_channels channels with num_feat features, shape (B, num_feat_channels, num_feat, N)

classmethod get_mean_std_time_channel(
input: torch.Tensor,
input_length: torch.Tensor | None = None,
eps: float = 1e-10,
) torch.Tensor#

Calculate mean and standard deviation across time and channel dimensions.

Parameters:
  • input – tensor with shape (B, C, F, T)

  • input_length – tensor with shape (B,)

Returns:

Mean and standard deviation of the input calculated across time and channel dimension, each with shape (B, 1, F, 1).

static get_mean_time_channel(
input: torch.Tensor,
input_length: torch.Tensor | None = None,
) torch.Tensor#

Calculate mean across time and channel dimensions.

Parameters:
  • input – tensor with shape (B, C, F, T)

  • input_length – tensor with shape (B,)

Returns:

Mean of input calculated across time and channel dimension with shape (B, 1, F, 1)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

normalize_mean(
input: torch.Tensor,
input_length: torch.Tensor,
) torch.Tensor#

Mean normalization for the input tensor.

Parameters:
  • input – input tensor

  • input_length – valid length for each example

Returns:

Mean normalized input.

normalize_mean_var(
input: torch.Tensor,
input_length: torch.Tensor,
) torch.Tensor#

Mean and variance normalization for the input tensor.

Parameters:
  • input – input tensor

  • input_length – valid length for each example

Returns:

Mean and variance normalized input.

property num_channels: int#

Configured number of channels

property num_features: int#

Configured number of features

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

Masking#

class nemo.collections.audio.modules.masking.MaskEstimatorRNN(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate num_outputs masks from the input spectrogram using stacked RNNs and projections.

The module is structured as follows:
input –> spatial features –> input projection –>

–> stacked RNNs –> output projection for each output –> sigmoid

Reference:

Multi-microphone neural speech separation for far-field multi-talker speech recognition (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8462081)

Parameters:
  • num_outputs – Number of output masks to estimate

  • num_subbands – Number of subbands of the input spectrogram

  • num_features – Number of features after the input projections

  • num_layers – Number of RNN layers

  • num_hidden_features – Number of hidden features in RNN layers

  • num_input_channels – Number of input channels

  • dropout – If non-zero, introduces dropout on the outputs of each RNN layer except the last layer, with dropout probability equal to dropout. Default: 0

  • bidirectional – If True, use bidirectional RNN.

  • rnn_type – Type of RNN, either lstm or gru. Default: lstm

  • mag_reduction – Channel-wise reduction for magnitude features

  • use_ipd – Use inter-channel phase difference (IPD) features

forward(
input: torch.Tensor,
input_length: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#

Estimate num_outputs masks from the input spectrogram.

Parameters:
  • input – C-channel input, shape (B, C, F, N)

  • input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Returns num_outputs masks in a tensor, shape (B, num_outputs, F, N), and output length with shape (B,)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskEstimatorFlexChannels(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate num_outputs masks from the input spectrogram using stacked channel-wise and temporal layers.

This model is using interlaved channel blocks and temporal blocks, and it can process arbitrary number of input channels. Default channel block is the transform-average-concatenate layer. Default temporal block is the Conformer encoder. Reduction from multichannel signal to single-channel signal is performed after channel_reduction_position blocks. Only temporal blocks are used afterwards. After the sequence of blocks, the output mask is computed using an additional output temporal layer and a nonlinearity.

References

  • Yoshioka et al, VarArray: Array-Geometry-Agnostic Continuous Speech Separation, 2022

  • Jukić et al, Flexible multichannel speech enhancement for noise-robust frontend, 2023

Parameters:
  • num_outputs – Number of output masks.

  • num_subbands – Number of subbands on the input spectrogram.

  • num_blocks – Number of blocks in the model.

  • channel_reduction_position – After this block, the signal will be reduced across channels.

  • channel_reduction_type – Reduction across channels: ‘average’ or ‘attention’

  • channel_block_type – Block for channel processing: ‘transform_average_concatenate’ or ‘transform_attend_concatenate’

  • temporal_block_type – Block for temporal processing: ‘conformer_encoder’

  • temporal_block_num_layers – Number of layers for the temporal block

  • temporal_block_num_heads – Number of heads for the temporal block

  • temporal_block_dimension – The hidden size of the model

  • temporal_block_self_attention_model – Self attention model for the temporal block

  • temporal_block_att_context_size – Attention context size for the temporal block

  • mag_reduction – Channel-wise reduction for magnitude features

  • mag_power – Power to apply on magnitude features

  • use_ipd – Use inter-channel phase difference (IPD) features

  • mag_normalization – Normalize using mean (‘mean’) or mean and variance (‘mean_var’)

  • ipd_normalization – Normalize using mean (‘mean’) or mean and variance (‘mean_var’)

forward(
input: torch.Tensor,
input_length: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#

Estimate num_outputs masks from the input spectrogram.

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskEstimatorGSS(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate masks using guided source separation with a complex angular Central Gaussian Mixture Model (cACGMM) [1].

This module corresponds to GSS in Fig. 2 in [2].

Notation is approximately following [1], where gamma denotes the time-frequency mask, alpha denotes the mixture weights, and BM denotes the shape matrix. Additionally, the provided source activity is denoted as activity.

Parameters:
  • num_iterations – Number of iterations for the EM algorithm

  • eps – Small value for regularization

  • dtype – Data type for internal computations (default torch.cdouble)

References

[1] Ito et al., Complex Angular Central Gaussian Mixture Model for Directional Statistics in Mask-Based Microphone Array Signal Processing, 2016 [2] Boeddeker et al., Front-End Processing for the CHiME-5 Dinner Party Scenario, 2018

forward(
input: torch.Tensor,
activity: torch.Tensor,
) torch.Tensor#

Apply GSS to estimate the time-frequency masks for each output source.

Parameters:
  • input – batched C-channel input signal, shape (B, num_inputs, F, T)

  • activity – batched frame-wise activity for each output source, shape (B, num_outputs, T)

Returns:

Masks for the components of the model, shape (B, num_outputs, F, T)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

normalize(
x: torch.Tensor,
dim: int = 1,
) torch.Tensor#

Normalize input to have a unit L2-norm across dim. By default, normalizes across the input channels.

Parameters:
  • x – C-channel input signal, shape (B, C, F, T)

  • dim – Dimension for normalization, defaults to -3 to normalize over channels

Returns:

Normalized signal, shape (B, C, F, T)

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

update_masks(
alpha: torch.Tensor,
activity: torch.Tensor,
log_pdf: torch.Tensor,
) torch.Tensor#

Update masks for the cACGMM.

Parameters:
  • alpha – component weights, shape (B, num_outputs, F)

  • activity – temporal activity for the components, shape (B, num_outputs, T)

  • log_pdf – logarithm of the PDF, shape (B, num_outputs, F, T)

Returns:

Masks for the components of the model, shape (B, num_outputs, F, T)

update_pdf(
z: torch.Tensor,
gamma: torch.Tensor,
zH_invBM_z: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#

Update PDF of the cACGMM.

Parameters:
  • z – directional statistics, shape (B, num_inputs, F, T)

  • gamma – masks, shape (B, num_outputs, F, T)

  • zH_invBM_z – energy weighted by shape matrices, shape (B, num_outputs, F, T)

Returns:

Logarithm of the PDF, shape (B, num_outputs, F, T), the energy term, shape (B, num_outputs, F, T)

update_weights(gamma: torch.Tensor) torch.Tensor#

Update weights for the individual components in the mixture model.

Parameters:

gamma – masks, shape (B, num_outputs, F, T)

Returns:

Component weights, shape (B, num_outputs, F)

class nemo.collections.audio.modules.masking.MaskReferenceChannel(*args: Any, **kwargs: Any)#

Bases: NeuralModule

A simple mask processor which applies mask on ref_channel of the input signal.

Parameters:
  • ref_channel – Index of the reference channel.

  • mask_min_db – Threshold mask to a minimal value before applying it, defaults to -200dB

  • mask_max_db – Threshold mask to a maximal value before applying it, defaults to 0dB

forward(
input: torch.Tensor,
input_length: torch.Tensor,
mask: torch.Tensor,
) Tuple[torch.Tensor, torch.Tensor]#

Apply mask on ref_channel of the input signal. This can be used to generate multi-channel output. If mask has M channels, the output will have M channels as well.

Parameters:
  • input – Input signal complex-valued spectrogram, shape (B, C, F, N)

  • input_length – Length of valid entries along the time dimension, shape (B,)

  • mask – Mask for M outputs, shape (B, M, F, N)

Returns:

M-channel output complex-valed spectrogram with shape (B, M, F, N)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskBasedBeamformer(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Multi-channel processor using masks to estimate signal statistics.

Parameters:
  • filter_type – string denoting the type of the filter. Defaults to mvdr

  • filter_beta – Parameter of the parameteric multichannel Wiener filter

  • filter_rank – Parameter of the parametric multichannel Wiener filter

  • filter_postfilter – Optional, postprocessing of the filter

  • ref_channel – Optional, reference channel. If None, it will be estimated automatically

  • ref_hard – If true, hard (one-hot) reference. If false, a soft reference

  • ref_hard_use_grad – If true, use straight-through gradient when using the hard reference

  • ref_subband_weighting – If true, use subband weighting when estimating reference channel

  • num_subbands – Optional, used to determine the parameter size for reference estimation

  • mask_min_db – Threshold mask to a minimal value before applying it, defaults to -200dB

  • mask_max_db – Threshold mask to a maximal value before applying it, defaults to 0dB

  • diag_reg – Optional, diagonal regularization for the multichannel filter

  • eps – Small regularization constant to avoid division by zero

forward(
input: torch.Tensor,
mask: torch.Tensor,
mask_undesired: torch.Tensor | None = None,
input_length: torch.Tensor | None = None,
) torch.Tensor#

Apply a mask-based beamformer to the input spectrogram. This can be used to generate multi-channel output. If mask has multiple channels, a multichannel filter is created for each mask, and the output is concatenation of individual outputs along the channel dimension. The total number of outputs is num_masks * M, where M is the number of channels at the filter output.

Parameters:
  • input – Input signal complex-valued spectrogram, shape (B, C, F, N)

  • mask – Mask for M output signals, shape (B, num_masks, F, N)

  • input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Multichannel output signal complex-valued spectrogram, shape (B, num_masks * M, F, N)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

class nemo.collections.audio.modules.masking.MaskBasedDereverbWPE(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Multi-channel linear prediction-based dereverberation using weighted prediction error for filter estimation.

An optional mask to estimate the signal power can be provided. If a time-frequency mask is not provided, the algorithm corresponds to the conventional WPE algorithm.

Parameters:
  • filter_length – Length of the convolutional filter for each channel in frames.

  • prediction_delay – Delay of the input signal for multi-channel linear prediction in frames.

  • num_iterations – Number of iterations for reweighting

  • mask_min_db – Threshold mask to a minimal value before applying it, defaults to -200dB

  • mask_max_db – Threshold mask to a minimal value before applying it, defaults to 0dB

  • diag_reg – Diagonal regularization for WPE

  • eps – Small regularization constant

  • dtype – Data type for internal computations

References

  • Kinoshita et al, Neural network-based spectrum estimation for online WPE dereverberation, 2017

  • Yoshioka and Nakatani, Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening, 2012

forward(
input: torch.Tensor,
input_length: torch.Tensor | None = None,
mask: torch.Tensor | None = None,
) torch.Tensor#

Given an input signal input, apply the WPE dereverberation algoritm.

Parameters:
  • input – C-channel complex-valued spectrogram, shape (B, C, F, T)

  • input_length – Optional length for each signal in the batch, shape (B,)

  • mask – Optional mask, shape (B, 1, F, N) or (B, C, F, T)

Returns:

Processed tensor with the same number of channels as the input, shape (B, C, F, T).

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

Projections#

class nemo.collections.audio.modules.projections.MixtureConsistencyProjection(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Ensure estimated sources are consistent with the input mixture. Note that the input mixture is assume to be a single-channel signal.

Parameters:
  • weighting – Optional weighting mode for the consistency constraint. If None, use uniform weighting. If power, use the power of the estimated source as the weight.

  • eps – Small positive value for regularization

Reference:

Wisdom et al, Differentiable consistency constraints for improved deep speech enhancement, 2018

forward(
mixture: torch.Tensor,
estimate: torch.Tensor,
) torch.Tensor#

Enforce mixture consistency on the estimated sources. :param mixture: Single-channel mixture, shape (B, 1, F, N) :param estimate: M estimated sources, shape (B, M, F, N)

Returns:

Source estimates consistent with the mixture, shape (B, M, F, N)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

SSL Pretraining#

class nemo.collections.audio.modules.ssl_pretrain_masking.SSLPretrainWithMaskedPatch(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Zeroes out fixed size time patches of the spectrogram. All samples in batch are guaranteed to have the same amount of masked time steps. Note that this may be problematic when we do pretraining on a unbalanced dataset.

For example, say a batch contains two spectrograms of length 87 and 276. With mask_fraction=0.7 and patch_size=10, we’ll obrain mask_patches=7. Each of the two data will then have 7 patches of 10-frame mask.

Parameters:
  • patch_size (int) – up to how many time steps does one patch consist of. Defaults to 10.

  • mask_fraction (float) – how much fraction in each sample to be masked (number of patches is rounded up). Range from 0.0 to 1.0. Defaults to 0.7.

forward(input_spec, length)#

Apply Patched masking on the input_spec.

During the training stage, the mask is generated randomly, with approximately self.mask_fraction of the time frames being masked out.

In the validation stage, the masking pattern is fixed to ensure consistent evaluation of checkpoints and to prevent overfitting. Note that the same masking pattern is applied to all data, regardless of their lengths. On average, approximately self.mask_fraction of the time frames will be masked out.

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

Transforms#

class nemo.collections.audio.modules.transforms.AudioToSpectrogram(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Transform a batch of input multi-channel signals into a batch of STFT-based spectrograms.

Parameters:
  • fft_length – length of FFT

  • hop_length – length of hops/shifts of the sliding window

  • power – exponent for magnitude spectrogram. Default None will return a complex-valued spectrogram

  • magnitude_power – Transform magnitude of the spectrogram as x^magnitude_power.

  • scale – Positive scaling of the spectrogram.

forward(
input: torch.Tensor,
input_length: torch.Tensor | None = None,
) Tuple[torch.Tensor, torch.Tensor]#

Convert a batch of C-channel input signals into a batch of complex-valued spectrograms.

Parameters:
  • input – Time-domain input signal with C channels, shape (B, C, T)

  • input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Output spectrogram with F subbands and N time frames, shape (B, C, F, N) and output length with shape (B,).

get_output_length(
input_length: torch.Tensor,
) torch.Tensor#

Get length of valid frames for the output.

Parameters:

input_length – number of valid samples, shape (B,)

Returns:

Number of valid frames, shape (B,)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

stft(x: torch.Tensor)#

Apply STFT as in torchaudio.transforms.Spectrogram(power=None)

Parameters:

x_spec – Input time-domain signal, shape (…, T)

Returns:

Time-domain signal x_spec = STFT(x), shape (…, F, N).

class nemo.collections.audio.modules.transforms.SpectrogramToAudio(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Transform a batch of input multi-channel spectrograms into a batch of time-domain multi-channel signals.

Parameters:
  • fft_length – length of FFT

  • hop_length – length of hops/shifts of the sliding window

  • magnitude_power – Transform magnitude of the spectrogram as x^(1/magnitude_power).

  • scale – Spectrogram will be scaled with 1/scale before the inverse transform.

forward(
input: torch.Tensor,
input_length: torch.Tensor | None = None,
) torch.Tensor#

Convert input complex-valued spectrogram to a time-domain signal. Multi-channel IO is supported.

Parameters:
  • input – Input spectrogram for C channels, shape (B, C, F, N)

  • input_length – Length of valid entries along the time dimension, shape (B,)

Returns:

Time-domain signal with T time-domain samples and C channels, (B, C, T) and output length with shape (B,).

get_output_length(
input_length: torch.Tensor,
) torch.Tensor#

Get length of valid samples for the output.

Parameters:

input_length – number of valid frames, shape (B,)

Returns:

Number of valid samples, shape (B,)

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

istft(x_spec: torch.Tensor)#

Apply iSTFT as in torchaudio.transforms.InverseSpectrogram

Parameters:

x_spec – Input complex-valued spectrogram, shape (…, F, N)

Returns:

Time-domain signal x = iSTFT(x_spec), shape (…, T).

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

Parts#

Submodules: Diffusion#

class nemo.collections.audio.parts.submodules.diffusion.StochasticDifferentialEquation(*args: Any, **kwargs: Any)#

Bases: NeuralModule, ABC

Base class for stochastic differential equations.

abstract coefficients(
state: torch.Tensor,
time: torch.Tensor,
**kwargs,
) Tuple[torch.Tensor, torch.Tensor]#
Parameters:
  • state – tensor of shape (B, C, D, T)

  • time – tensor of shape (B,)

Returns:

Tuple with drift and diffusion coefficients.

abstract copy()#

Create a copy of this SDE.

discretize(
*,
state: torch.Tensor,
time: torch.Tensor,
state_length: torch.Tensor | None = None,
**kwargs,
) Tuple[torch.Tensor, torch.Tensor]#

Assume we have the following SDE:

dx = drift(x, t) * dt + diffusion(x, t) * dwt

where wt is the standard Wiener process.

We assume the following discretization:

new_state = current_state + total_drift + total_diffusion * z_norm

where z_norm is sampled from normal distribution with zero mean and unit variance.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • time – current time of the process, shape (B,)

  • state_length – length of the valid time steps for each example in the batch, shape (B,)

  • **kwargs – other parameters

Returns:

Drift and diffusion.

property dt: float#

Time step for this SDE. This denotes the step size between 0 and self.time_max when using self.num_steps.

generate_time(
size: int,
device: torch.device,
) torch.Tensor#

Generate random time steps in the valid range.

Time steps are generated between self.time_min and self.time_max.

Parameters:
  • size – number of samples

  • device – device to use

Returns:

A tensor of floats with shape (size,)

prior_sampling(
prior_mean: torch.Tensor,
) torch.Tensor#

Generate a sample from the prior distribution p_T.

Parameters:

prior_mean – Mean of the prior distribution

Returns:

A sample from the prior distribution.

property time_delta: float#

Time range for this SDE.

class nemo.collections.audio.parts.submodules.diffusion.OrnsteinUhlenbeckVarianceExplodingSDE(*args: Any, **kwargs: Any)#

Bases: StochasticDifferentialEquation

This class implements the Ornstein-Uhlenbeck SDE with variance exploding noise schedule.

The SDE is given by:

dx = theta * (y - x) dt + g(t) dw

where theta is the stiffness parameter and g(t) is the diffusion coefficient:

g(t) = std_min * (std_max/std_min)^t * sqrt(2 * log(std_max/std_min))

References

Richter et al., Speech Enhancement and Dereverberation with Diffusion-based Generative Models, Tr. ASLP 2023

coefficients(
state: torch.Tensor,
time: torch.Tensor,
prior_mean: torch.Tensor,
state_length: torch.Tensor | None = None,
) Tuple[torch.Tensor, torch.Tensor]#

Compute drift and diffusion coefficients for this SDE.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • time – current time of the process, shape (B,)

  • prior_mean – mean of the prior distribution

  • state_length – length of the valid time steps for each example in the batch

Returns:

Drift and diffusion coefficients.

copy()#

Create a copy of this SDE.

perturb_kernel_mean(
state: torch.Tensor,
prior_mean: torch.Tensor,
time: torch.Tensor,
) torch.Tensor#

Return the mean of the perturbation kernel for this SDE.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • prior_mean – mean of the prior distribution

  • time – current time of the process, shape (B,)

Returns:

A tensor of shape (B, C, D, T)

perturb_kernel_params(
state: torch.Tensor,
prior_mean: torch.Tensor,
time: torch.Tensor,
) torch.Tensor#

Return the mean and standard deviation of the perturbation kernel for this SDE.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • prior_mean – mean of the prior distribution

  • time – current time of the process, shape (B,)

perturb_kernel_std(
time: torch.Tensor,
) torch.Tensor#

Return the standard deviation of the perturbation kernel for this SDE.

Note that the standard deviation depends on the time and the noise schedule, which is parametrized using self.stiffness, self.std_min and self.std_max.

Parameters:

time – current time of the process, shape (B,)

Returns:

A tensor of shape (B,)

prior_sampling(
prior_mean: torch.Tensor,
) torch.Tensor#

Generate a sample from the prior distribution p_T.

Parameters:

prior_mean – Mean of the prior distribution

class nemo.collections.audio.parts.submodules.diffusion.ReverseStochasticDifferentialEquation(*args: Any, **kwargs: Any)#

Bases: StochasticDifferentialEquation

coefficients(
state: torch.Tensor,
time: torch.Tensor,
score_condition: torch.Tensor | None = None,
state_length: torch.Tensor | None = None,
**kwargs,
) Tuple[torch.Tensor, torch.Tensor]#

Compute drift and diffusion coefficients for the reverse SDE.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • time – current time of the process, shape (B,)

copy()#

Create a copy of this SDE.

discretize(
*,
state: torch.Tensor,
time: torch.Tensor,
score_condition: torch.Tensor | None = None,
state_length: torch.Tensor | None = None,
**kwargs,
) Tuple[torch.Tensor, torch.Tensor]#

Discretize the reverse SDE.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • time – current time of the process, shape (B,)

  • score_condition – condition for the score estimator

  • state_length – length of the valid time steps for each example in the batch

  • **kwargs – other parameters for discretization of the forward SDE

prior_sampling(
shape: torch.Size,
device: torch.device,
) torch.Tensor#

Prior sampling is not necessary for the reverse SDE.

class nemo.collections.audio.parts.submodules.diffusion.PredictorCorrectorSampler(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Predictor-Corrector sampler for the reverse SDE.

Parameters:
  • sde – forward SDE

  • score_estimator – neural score estimator

  • predictor – predictor for the reverse process

  • corrector – corrector for the reverse process

  • num_steps – number of time steps for the reverse process

  • num_corrector_steps – number of corrector steps

  • time_max – maximum time

  • time_min – minimum time

  • snr – SNR for Annealed Langevin Dynamics

  • output_type – type of the output (‘state’ for the final state, or ‘mean’ for the mean of the final state)

References

  • Song et al., Score-based generative modeling through stochastic differential equations, 2021

class nemo.collections.audio.parts.submodules.diffusion.Predictor(*args: Any, **kwargs: Any)#

Bases: Module, ABC

Predictor for the reverse process.

Parameters:
  • sde – forward SDE

  • score_estimator – neural score estimator

abstract forward(
*,
state: torch.Tensor,
time: torch.Tensor,
score_condition: torch.Tensor | None = None,
state_length: torch.Tensor | None = None,
**kwargs,
)#

Predict the next state of the reverse process.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • time – current time of the process, shape (B,)

  • score_condition – conditioning for the score estimator

  • state_length – length of the valid time steps for each example in the batch

Returns:

New state and mean.

class nemo.collections.audio.parts.submodules.diffusion.ReverseDiffusionPredictor(*args: Any, **kwargs: Any)#

Bases: Predictor

Predict the next state of the reverse process using the reverse diffusion process.

Parameters:
  • sde – forward SDE

  • score_estimator – neural score estimator

forward(
*,
state,
time,
score_condition=None,
state_length=None,
**kwargs,
)#

Predict the next state of the reverse process using the reverse diffusion process.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • time – current time of the process, shape (B,)

  • score_condition – conditioning for the score estimator

  • state_length – length of the valid time steps for each example in the batch

Returns:

New state and mean of the diffusion process.

class nemo.collections.audio.parts.submodules.diffusion.Corrector(*args: Any, **kwargs: Any)#

Bases: NeuralModule, ABC

Corrector for the reverse process.

Parameters:
  • sde – forward SDE

  • score_estimator – neural score estimator

  • snr – SNR for Annealed Langevin Dynamics

  • num_steps – number of steps for the corrector

class nemo.collections.audio.parts.submodules.diffusion.AnnealedLangevinDynamics(*args: Any, **kwargs: Any)#

Bases: Corrector

Annealed Langevin Dynamics for the reverse process.

References

  • Song et al., Score-based generative modeling through stochastic differential equations, 2021

forward(
state,
time,
score_condition=None,
state_length=None,
)#

Correct the state using Annealed Langevin Dynamics.

Parameters:
  • state – current state of the process, shape (B, C, D, T)

  • time – current time of the process, shape (B,)

  • score_condition – conditioning for the score estimator

  • state_length – length of the valid time steps for each example in the batch

Returns:

New state and mean of the diffusion process.

References

Alg. 4 in http://arxiv.org/abs/2011.13456

Submodules: Flow#

class nemo.collections.audio.parts.submodules.flow.ConditionalFlow(time_min: float = 1e-08, time_max: float = 1.0)#

Bases: ABC

Abstract class for different conditional flow-matching (CFM) classes

Time horizon is [time_min, time_max (should be 1)]

every path is “conditioned” on endpoints of the path endpoints are just our paired data samples subclasses need to implement mean, std, and vector_field

flow(
*,
time: torch.Tensor,
x_start: torch.Tensor,
x_end: torch.Tensor,
point: torch.Tensor,
) torch.Tensor#

Compute the conditional flow phi_t( point | x_start, x_end). This is an affine flow.

generate_time(
batch_size: int,
rng: torch.random.Generator | None = None,
) torch.Tensor#

Randomly sample a batchsize of time_steps from U[self.time_min, self.time_max] Supports an external random number generator for better reproducibility

abstract mean(
*,
time: torch.Tensor,
x_start: torch.Tensor,
x_end: torch.Tensor,
) torch.Tensor#

Return the mean of p_t(x | x_start, x_end) at time t

sample(
*,
time: torch.Tensor,
x_start: torch.Tensor,
x_end: torch.Tensor,
) torch.Tensor#

Generate a sample from p_t(x | x_start, x_end) at time t. Note that this implementation assumes all path marginals are normally distributed.

abstract std(
*,
time: torch.Tensor,
x_start: torch.Tensor,
x_end: torch.Tensor,
) torch.Tensor#

Return the standard deviation of p_t(x | x_start, x_end) at time t

abstract vector_field(
*,
time: torch.Tensor,
x_start: torch.Tensor,
x_end: torch.Tensor,
point: torch.Tensor,
) torch.Tensor#

Compute the conditional vector field v_t( point | x_start, x_end)

class nemo.collections.audio.parts.submodules.flow.OptimalTransportFlow(
time_min: float = 1e-08,
time_max: float = 1.0,
sigma_start: float = 1.0,
sigma_end: float = 0.0001,
)#

Bases: ConditionalFlow

The OT-CFM model from [Lipman et at, 2023]

Every conditional path the following holds: p_0 = N(x_start, sigma_start) p_1 = N(x_end, sigma_end),

mean(x, t) = (time_max - t) * x_start + t * x_end

(linear interpolation between x_start and x_end)

std(x, t) = (time_max - t) * sigma_start + t * sigma_end

Every conditional path is optimal transport map from p_0(x_start, x_end) to p_1(x_start, x_end) Marginal path is not guaranteed to be an optimal transport map from p_0 to p_1

To get the OT-CFM model from [Lipman et at, 2023] just pass zeroes for x_start To get the I-CFM model, set sigma_min=sigma_max To get the rectified flow model, set sigma_min=sigma_max=0

Parameters:
  • time_min – minimum time value used in the process

  • time_max – maximum time value used in the process

  • sigma_start – the standard deviation of the initial distribution

  • sigma_end – the standard deviation of the target distribution

mean(
*,
x_start: torch.Tensor,
x_end: torch.Tensor,
time: torch.Tensor,
) torch.Tensor#

Return the mean of p_t(x | x_start, x_end) at time t

std(
*,
x_start: torch.Tensor,
x_end: torch.Tensor,
time: torch.Tensor,
) torch.Tensor#

Return the standard deviation of p_t(x | x_start, x_end) at time t

vector_field(
*,
x_start: torch.Tensor,
x_end: torch.Tensor,
time: torch.Tensor,
point: torch.Tensor,
eps: float = 1e-06,
) torch.Tensor#

Compute the conditional vector field v_t( point | x_start, x_end)

class nemo.collections.audio.parts.submodules.flow.ConditionalFlowMatchingSampler(
estimator: torch.nn.Module,
num_steps: int = 5,
time_min: float = 1e-08,
time_max: float = 1.0,
)#

Bases: ABC

Abstract class for different sampler to solve the ODE in CFM

Parameters:
  • estimator – the NN-based conditional vector field estimator

  • num_steps – How many time steps to iterate in the process

  • time_min – minimum time value used in the process

  • time_max – maximum time value used in the process

class nemo.collections.audio.parts.submodules.flow.ConditionalFlowMatchingEulerSampler(
estimator: torch.nn.Module,
num_steps: int = 5,
time_min: float = 1e-08,
time_max: float = 1.0,
)#

Bases: ConditionalFlowMatchingSampler

The Euler Sampler for solving the ODE in CFM on a uniform time grid

Submodules: Multichannel#

class nemo.collections.audio.parts.submodules.multichannel.ChannelAugment(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Randomly permute and selects a subset of channels.

Parameters:
  • permute_channels (bool) – Apply a random permutation of channels.

  • num_channels_min (int) – Minimum number of channels to select.

  • num_channels_max (int) – Max number of channels to select.

  • rng – Optional, random generator.

  • seed – Optional, seed for the generator.

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.TransformAverageConcatenate(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Apply transform-average-concatenate across channels. We’re using a version from [2].

Parameters:
  • in_features – Number of input features

  • out_features – Number of output features

References

[1] Luo et al, End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation, 2019 [2] Yoshioka et al, VarArray: Array-Geometry-Agnostic Continuous Speech Separation, 2022

forward(
input: torch.Tensor,
) torch.Tensor#
Parameters:

input – shape (B, M, in_features, T)

Returns:

Output tensor with shape shape (B, M, out_features, T)

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.TransformAttendConcatenate(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Apply transform-attend-concatenate across channels. The output is a concatenation of transformed channel and MHA over channels.

Parameters:
  • in_features – Number of input features

  • out_features – Number of output features

  • n_head – Number of heads for the MHA module

  • dropout_rate – Dropout rate for the MHA module

References

  • Jukić et al, Flexible multichannel speech enhancement for noise-robust frontend, 2023

forward(
input: torch.Tensor,
) torch.Tensor#
Parameters:

input – shape (B, M, in_features, T)

Returns:

Output tensor with shape shape (B, M, out_features, T)

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.ChannelAveragePool(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Apply average pooling across channels.

forward(input: torch.Tensor) torch.Tensor#
Parameters:

input – shape (B, M, F, T)

Returns:

Output tensor with shape shape (B, F, T)

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.ChannelAttentionPool(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Use attention pooling to aggregate information across channels. First apply MHA across channels and then apply averaging.

Parameters:
  • in_features – Number of input features

  • out_features – Number of output features

  • n_head – Number of heads for the MHA module

  • dropout_rate – Dropout rate for the MHA module

References

  • Wang et al, Neural speech separation using sparially distributed microphones, 2020

  • Jukić et al, Flexible multichannel speech enhancement for noise-robust frontend, 2023

forward(input: torch.Tensor) torch.Tensor#
Parameters:

input – shape (B, M, F, T)

Returns:

Output tensor with shape shape (B, F, T)

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.ParametricMultichannelWienerFilter(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Parametric multichannel Wiener filter, with an adjustable tradeoff between noise reduction and speech distortion. It supports automatic reference channel selection based on the estimated output SNR.

Parameters:
  • beta – Parameter of the parameteric filter, tradeoff between noise reduction and speech distortion (0: MVDR, 1: MWF).

  • rank – Rank assumption for the speech covariance matrix.

  • postfilter – Optional postfilter. If None, no postfilter is applied.

  • ref_channel – Optional, reference channel. If None, it will be estimated automatically.

  • ref_hard – If true, estimate a hard (one-hot) reference. If false, a soft reference.

  • ref_hard_use_grad – If true, use straight-through gradient when using the hard reference

  • ref_subband_weighting – If true, use subband weighting when estimating reference channel

  • num_subbands – Optional, used to determine the parameter size for reference estimation

  • diag_reg – Optional, diagonal regularization for the multichannel filter

  • eps – Small regularization constant to avoid division by zero

References

  • Souden et al, On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction, 2010

apply_ban(
input: torch.Tensor,
filter: torch.Tensor,
psd_n: torch.Tensor,
) torch.Tensor#

Apply blind analytic normalization postfilter. Note that this normalization has been derived for the GEV beamformer in [1]. More specifically, the BAN postfilter aims to scale GEV to satisfy the distortionless constraint and the final analytical expression is derived using an assumption on the norm of the transfer function. However, this may still be useful in some instances.

Parameters:
  • input – batch with M output channels (B, M, F, T)

  • filter – batch of C-input, M-output filters, shape (B, F, C, M)

  • psd_n – batch of noise PSDs, shape (B, F, C, C)

Returns:

Filtere input, shape (B, M, F, T)

References

  • Warsitz and Haeb-Umbach, Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition, 2007

apply_diag_reg(
psd: torch.Tensor,
) torch.Tensor#

Apply diagonal regularization on psd.

Parameters:

psd – tensor, shape (…, C, C)

Returns:

Tensor, same shape as input.

apply_filter(
input: torch.Tensor,
filter: torch.Tensor,
) torch.Tensor#

Apply the MIMO filter on the input.

Parameters:
  • input – batch with C input channels, shape (B, C, F, T)

  • filter – batch of C-input, M-output filters, shape (B, F, C, M)

Returns:

M-channel filter output, shape (B, M, F, T)

forward(
input: torch.Tensor,
mask_s: torch.Tensor,
mask_n: torch.Tensor,
) torch.Tensor#

Return processed signal. The output has either one channel (M=1) if a ref_channel is selected, or the same number of channels as the input (M=C) if ref_channel is None.

Parameters:
  • input – Input signal, complex tensor with shape (B, C, F, T)

  • mask_s – Mask for the desired signal, shape (B, F, T)

  • mask_n – Mask for the undesired noise, shape (B, F, T)

Returns:

Processed signal, shape (B, M, F, T)

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

static trace(
x: torch.Tensor,
keepdim: bool = False,
) torch.Tensor#

Calculate trace of matrix slices over the last two dimensions in the input tensor.

Parameters:

x – tensor, shape (…, C, C)

Returns:

Trace for each (C, C) matrix. shape (…)

class nemo.collections.audio.parts.submodules.multichannel.ReferenceChannelEstimatorSNR(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Estimate a reference channel by selecting the reference that maximizes the output SNR. It returns one-hot encoded vector or a soft reference.

A straight-through estimator is used for gradient when using hard reference.

Parameters:
  • hard – If true, use hard estimate of ref channel. If false, use a soft estimate across channels.

  • hard_use_grad – Use straight-through estimator for the gradient.

  • subband_weighting – If true, use subband weighting when adding across subband SNRs. If false, use average across subbands.

References

Boeddeker et al, Front-End Processing for the CHiME-5 Dinner Party Scenario, 2018

forward(
W: torch.Tensor,
psd_s: torch.Tensor,
psd_n: torch.Tensor,
) torch.Tensor#
Parameters:
  • W – Multichannel input multichannel output filter, shape (B, F, C, M), where C is the number of input channels and M is the number of output channels

  • psd_s – Covariance for the signal, shape (B, F, C, C)

  • psd_n – Covariance for the noise, shape (B, F, C, C)

Returns:

One-hot or soft reference channel, shape (B, M)

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

class nemo.collections.audio.parts.submodules.multichannel.WPEFilter(*args: Any, **kwargs: Any)#

Bases: NeuralModule

A weighted prediction error filter. Given input signal, and expected power of the desired signal, this class estimates a multiple-input multiple-output prediction filter and returns the filtered signal. Currently, estimation of statistics and processing is performed in batch mode.

Parameters:
  • filter_length – Length of the prediction filter in frames, per channel

  • prediction_delay – Prediction delay in frames

  • diag_reg – Diagonal regularization for the correlation matrix Q, applied as diag_reg * trace(Q) + eps

  • eps – Small positive constant for regularization

References

  • Yoshioka and Nakatani, Generalization of Multi-Channel Linear Prediction

    Methods for Blind MIMO Impulse Response Shortening, 2012

  • Jukić et al, Group sparsity for MIMO speech dereverberation, 2015

apply_filter(
filter: torch.Tensor,
input: torch.Tensor | None = None,
tilde_input: torch.Tensor | None = None,
) torch.Tensor#

Apply a prediction filter filter on the input input as

output(b,f) = tilde{input(b,f)} * filter(b,f)

If available, directly use the convolution matrix tilde_input.

Parameters:
  • input – Input signal, shape (B, C, F, N)

  • tilde_input – Convolution matrix for the input signal, shape (B, C, F, N, filter_length)

  • filter – Prediction filter, shape (B, C, F, C, filter_length)

Returns:

Multi-channel signal obtained by applying the prediction filter on the input signal, same shape as input (B, C, F, N)

classmethod convtensor(
x: torch.Tensor,
filter_length: int,
delay: int = 0,
n_steps: int | None = None,
) torch.Tensor#

Create a tensor equivalent of convmtx_mc for each example in the batch. The input signal tensor x has shape (B, C, F, N). Convtensor returns a view of the input signal x.

Note: We avoid reshaping the output to collapse channels and filter taps into a single dimension, e.g., (B, F, N, -1). In this way, the output is a view of the input, while an additional reshape would result in a contiguous array and more memory use.

Parameters:
  • x – input tensor, shape (B, C, F, N)

  • filter_length – length of the filter, determines the shape of the convolution tensor

  • delay – delay to add to the input signal x before constructing the convolution tensor

  • n_steps – Optional, number of time steps to keep in the out. Defaults to the number of time steps in the input tensor.

Returns:

Return a convolutional tensor with shape (B, C, F, n_steps, filter_length)

estimate_correlations(
input: torch.Tensor,
weight: torch.Tensor,
tilde_input: torch.Tensor,
input_length: torch.Tensor | None = None,
) Tuple[torch.Tensor]#
Parameters:
  • input – Input signal, shape (B, C, F, N)

  • weight – Time-frequency weight, shape (B, F, N)

  • tilde_input – Multi-channel convolution tensor, shape (B, C, F, N, filter_length)

  • input_length – Length of each input example, shape (B)

Returns:

Returns a tuple of correlation matrices for each batch.

Let X denote the input signal in a single subband, tilde{X} the corresponding multi-channel correlation matrix, and w the vector of weights.

The first output is

Q = tilde{X}^H * diag(w) * tilde{X} (1)

for each (b, f). The matrix calculated in (1) has shape (C * filter_length, C * filter_length) The output is returned in a tensor with shape (B, F, C, filter_length, C, filter_length).

The second output is

R = tilde{X}^H * diag(w) * X (2)

for each (b, f). The matrix calculated in (2) has shape (C * filter_length, C) The output is returned in a tensor with shape (B, F, C, filter_length, C). The last dimension corresponds to output channels.

estimate_filter(
Q: torch.Tensor,
R: torch.Tensor,
) torch.Tensor#
Estimate the MIMO prediction filter as

G(b,f) = Q(b,f) R(b,f)

for each subband in each example in the batch (b, f).

Parameters:
  • Q – shape (B, F, C, filter_length, C, filter_length)

  • R – shape (B, F, C, filter_length, C)

Returns:

Complex-valued prediction filter, shape (B, C, F, C, filter_length)

forward(
input: torch.Tensor,
power: torch.Tensor,
input_length: torch.Tensor | None = None,
) torch.Tensor#

Given input and the predicted power for the desired signal, estimate the WPE filter and return the processed signal.

Parameters:
  • input – Input signal, shape (B, C, F, N)

  • power – Predicted power of the desired signal, shape (B, C, F, N)

  • input_length – Optional, length of valid frames in input. Defaults to None

Returns:

Tuple of (processed_signal, output_length). Processed signal has the same shape as the input signal (B, C, F, N), and the output length is the same as the input length.

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

classmethod permute_convtensor(x: torch.Tensor) torch.Tensor#

Reshape and permute columns to convert the result of convtensor to be equal to convmtx_mc. This is used for verification purposes and it is not required to use the filter.

Parameters:

x – output of self.convtensor, shape (B, C, F, N, filter_length)

Returns:

Output has shape (B, F, N, C*filter_length) that corresponds to the layout of convmtx_mc.

Submodules: NCSN++#

class nemo.collections.audio.parts.submodules.ncsnpp.SpectrogramNoiseConditionalScoreNetworkPlusPlus(
*args: Any,
**kwargs: Any,
)#

Bases: NeuralModule

This model handles complex-valued inputs by stacking real and imaginary components. Stacked tensor is processed using NCSN++ and the output is projected to generate real and imaginary components of the output channels.

Parameters:
  • in_channels – number of input complex-valued channels

  • out_channels – number of output complex-valued channels

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

class nemo.collections.audio.parts.submodules.ncsnpp.NoiseConditionalScoreNetworkPlusPlus(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Implementation of Noise Conditional Score Network (NCSN++) architecture.

References

  • Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, NeurIPS 2021

  • Brock et al., Large scale GAN training for high fidelity natural image synthesis, ICLR 2018

forward(
*,
input: torch.Tensor,
input_length: torch.Tensor | None,
condition: torch.Tensor | None = None,
)#

Forward pass of the model.

Parameters:
  • input – input tensor, shjae (B, C, D, T)

  • input_length – length of the valid time steps for each example in the batch, shape (B,)

  • condition – scalar condition (time) for the model, will be embedded using self.time_embedding

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

pad_input(
input: torch.Tensor,
) torch.Tensor#

Pad input tensor to match the required dimensions across T and D.

class nemo.collections.audio.parts.submodules.ncsnpp.GaussianFourierProjection(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Gaussian Fourier embeddings for input scalars.

The input scalars are typically time or noise levels.

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

class nemo.collections.audio.parts.submodules.ncsnpp.ResnetBlockBigGANPlusPlus(*args: Any, **kwargs: Any)#

Bases: Module

Implementation of a ResNet block for the BigGAN model.

References

  • Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, NeurIPS 2021

  • Brock et al., Large scale GAN training for high fidelity natural image synthesis, ICLR 2018

forward(
x: torch.Tensor,
diffusion_time_embedding: torch.Tensor | None = None,
)#

Forward pass of the model.

Parameters:
  • x – input tensor

  • diffusion_time_embedding – embedding of the diffusion time step

Returns:

Output tensor

init_weights_()#

Weight initialization

Submodules: Schrödinger Bridge#

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBNoiseSchedule(*args: Any, **kwargs: Any)#

Bases: NeuralModule, ABC

Noise schedule for the Schrödinger Bridge

Parameters:
  • time_min – minimum time for the process

  • time_max – maximum time for the process

  • num_steps – number of steps for the process

  • eps – small regularization

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074

abstract alpha(time: torch.Tensor) torch.Tensor#

Return alpha for SB noise schedule.

alpha_t = exp( int_0^s f(s) ds )

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing alpha for each time.

alpha_bar_from_alpha(
alpha: torch.Tensor,
)#

Return alpha_bar for SB.

alpha_bar = alpha_t / alpha_t_max

Parameters:

alpha – tensor with alpha values

Returns:

Tensors the same size as alpha, representing alpha_bar and alpha_t_max.

property alpha_t_max#

Return alpha_t at t_max.

abstract copy()#

Return a copy of the noise schedule.

property dt: float#

Time step for the process.

abstract f(time: torch.Tensor) torch.Tensor#

Drift scaling f(t).

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing drift scaling.

abstract g(time: torch.Tensor) torch.Tensor#

Diffusion scaling g(t).

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing diffusion scaling.

generate_time(
size: int,
device: torch.device,
) torch.Tensor#

Generate random time steps in the valid range.

get_alphas(
time: torch.Tensor,
)#

Return alpha, alpha_bar and alpha_t_max for SB.

Parameters:

time – tensor with time steps

Returns:

Tuple of tensors with alpha, alpha_bar and alpha_t_max.

get_sigmas(
time: torch.Tensor,
)#

Return sigma, sigma_bar and sigma_t_max for SB.

Parameters:

time – tensor with time steps

Returns:

Tuple of tensors with sigma, sigma_bar and sigma_t_max.

abstract sigma(time: torch.Tensor) torch.Tensor#

Return sigma_t for SB.

sigma_t^2 = int_0^s g^2(s) / alpha_s^2 ds

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing sigma for each time.

sigma_bar_from_sigma(
sigma: torch.Tensor,
)#

Return sigma_bar_t for SB.

sigma_bar_t^2 = sigma_t_max^2 - sigma_t^2

Parameters:

sigma – tensor with sigma values

Returns:

Tensors the same size as sigma, representing sigma_bar and sigma_t_max.

property sigma_t_max#

Return sigma_t at t_max.

property time_delta: float#

Time range for the process.

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBNoiseScheduleVE(*args: Any, **kwargs: Any)#

Bases: SBNoiseSchedule

Variance exploding noise schedule for the Schrödinger Bridge.

Parameters:
  • k – defines the base for the exponential diffusion coefficient

  • c – scaling for the diffusion coefficient

  • time_min – minimum time for the process

  • time_max – maximum time for the process

  • num_steps – number of steps for the process

  • eps – small regularization

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074

alpha(time: torch.Tensor) torch.Tensor#

Return alpha for SB noise schedule.

alpha_t = exp( int_0^s f(s) ds )

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing alpha for each time.

copy()#

Return a copy of the noise schedule.

f(time: torch.Tensor) torch.Tensor#

Drift scaling f(t).

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing drift scaling.

g(time: torch.Tensor) torch.Tensor#

Diffusion scaling g(t).

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing diffusion scaling.

sigma(time: torch.Tensor) torch.Tensor#

Return sigma_t for SB.

sigma_t^2 = int_0^s g^2(s) / alpha_s^2 ds

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing sigma for each time.

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBNoiseScheduleVP(*args: Any, **kwargs: Any)#

Bases: SBNoiseSchedule

Variance preserving noise schedule for the Schrödinger Bridge.

Parameters:
  • beta_0 – defines the lower bound for diffusion coefficient

  • beta_1 – defines upper bound for diffusion coefficient

  • c – scaling for the diffusion coefficient

  • time_min – minimum time for the process

  • time_max – maximum time for the process

  • num_steps – number of steps for the process

  • eps – small regularization

alpha(time: torch.Tensor) torch.Tensor#

Return alpha for SB noise schedule.

alpha_t = exp( int_0^s f(s) ds )

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing alpha for each time.

copy()#

Return a copy of the noise schedule.

f(time: torch.Tensor) torch.Tensor#

Drift scaling f(t).

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing drift scaling.

g(time: torch.Tensor) torch.Tensor#

Diffusion scaling g(t).

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing diffusion scaling.

sigma(time: torch.Tensor) torch.Tensor#

Return sigma_t for SB.

sigma_t^2 = int_0^s g^2(s) / alpha_s^2 ds

Parameters:

time – tensor with time steps

Returns:

Tensor the same size as time, representing sigma for each time.

class nemo.collections.audio.parts.submodules.schroedinger_bridge.SBSampler(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Schrödinger Bridge sampler.

Parameters:
  • noise_schedule – noise schedule for the bridge

  • estimator – neural estimator

  • estimator_output – defines the output of the estimator, e.g., data_prediction

  • estimator_time – time for conditioning the estimator, e.g., ‘current’ or ‘previous’. Default is ‘previous’.

  • process – defines the process, e.g., sde or ode

  • time_max – maximum time for the process

  • time_min – minimum time for the process

  • num_steps – number of steps for the process

  • eps – small regularization to prevent division by zero

References

Schrödinger Bridge for Generative Speech Enhancement, https://arxiv.org/abs/2407.16074 Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis, https://arxiv.org/abs/2312.03491

Submodules: TransformerUNet#

class nemo.collections.audio.parts.submodules.transformerunet.LearnedSinusoidalPosEmb(*args: Any, **kwargs: Any)#

Bases: Module

The sinusoidal Embedding to encode time conditional information

forward(t: torch.Tensor) torch.Tensor#
Parameters:

t – input time tensor, shape (B)

Returns:

the encoded time conditional embedding, shape (B, D)

Return type:

fouriered

class nemo.collections.audio.parts.submodules.transformerunet.ConvPositionEmbed(*args: Any, **kwargs: Any)#

Bases: Module

The Convolutional Embedding to encode time information of each frame

forward(x, mask=None)#
Parameters:

x – input tensor, shape (B, T, D)

Returns:

output tensor with the same shape (B, T, D)

Return type:

out

class nemo.collections.audio.parts.submodules.transformerunet.RMSNorm(*args: Any, **kwargs: Any)#

Bases: Module

The Root Mean Square Layer Normalization

References

  • Zhang et al., Root Mean Square Layer Normalization, 2019

class nemo.collections.audio.parts.submodules.transformerunet.AdaptiveRMSNorm(*args: Any, **kwargs: Any)#

Bases: Module

Adaptive Root Mean Square Layer Normalization given a conditional embedding. This enables the model to consider the conditional input during normalization.

class nemo.collections.audio.parts.submodules.transformerunet.GEGLU(*args: Any, **kwargs: Any)#

Bases: Module

The GeGLU activation implementation

class nemo.collections.audio.parts.submodules.transformerunet.TransformerUNet(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Implementation of the transformer Encoder Model with U-Net structure used in VoiceBox and AudioBox

References

Le et al., Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale, 2023 Vyas et al., Audiobox: Unified Audio Generation with Natural Language Prompts, 2023

forward(
x,
key_padding_mask: torch.Tensor | None = None,
adaptive_rmsnorm_cond=None,
)#

Forward pass of the model.

Parameters:
  • input – input tensor, shape (B, C, D, T)

  • key_padding_mask – mask tensor indicating the padding parts, shape (B, T)

  • adaptive_rmsnorm_cond – conditional input for the model, shape (B, D)

get_alibi_bias(batch_size: int, seq_len: int)#

Return the alibi_bias given batch size and seqence length

init_alibi(max_positions: int, heads: int)#

Initialize the Alibi bias parameters

References

  • Press et al., Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, 2021

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

class nemo.collections.audio.parts.submodules.transformerunet.SpectrogramTransformerUNet(*args: Any, **kwargs: Any)#

Bases: NeuralModule

This model handles complex-valued inputs by stacking real and imaginary components. Stacked tensor is processed using TransformerUNet and the output is projected to generate real and imaginary components of the output channels.

Convolutional Positional Embedding is applied for the input sequence

forward(
input,
input_length=None,
condition=None,
)#

Forward pass of the model.

Parameters:
  • input – input tensor, shape (B, C, D, T)

  • input_length – length of the valid time steps for each example in the batch, shape (B,)

  • condition – scalar condition (time) for the model, will be embedded using self.time_embedding

property input_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

property output_types: Dict[str, NeuralType]#

Returns definitions of module output ports.

Losses#

class nemo.collections.audio.losses.MAELoss(*args: Any, **kwargs: Any)#

Bases: Loss, Typing

Computes the mean absolute error (MAE) loss with weighted average across channels.

Parameters:
  • weight – weight for loss of each output channel, used for averaging the loss across channels. Defaults to None (averaging).

  • reduction – batch reduction. Defaults to mean over the batch.

  • ndim – Number of dimensions for the input signal

forward(
estimate: torch.Tensor,
target: torch.Tensor,
input_length: torch.Tensor | None = None,
mask: torch.Tensor | None = None,
) torch.Tensor#

For input batch of multi-channel signals, calculate MAE between estimate and target for each channel, perform averaging across channels (weighting optional), and apply reduction across the batch.

Parameters:
  • estimate – Estimate of the target signal

  • target – Target signal

  • input_length – Length of each example in the batch

  • mask – Mask for each signal

Returns:

Scalar loss.

property input_types#

Input types definitions for MAELoss.

property output_types#

Output types definitions for MAELoss. loss:

NeuralType(None)

class nemo.collections.audio.losses.MSELoss(*args: Any, **kwargs: Any)#

Bases: Loss, Typing

Computes MSE loss with weighted average across channels.

Parameters:
  • weight – weight for loss of each output channel, used for averaging the loss across channels. Defaults to None (averaging).

  • reduction – batch reduction. Defaults to mean over the batch.

  • ndim – Number of dimensions for the input signal

forward(
estimate: torch.Tensor,
target: torch.Tensor,
input_length: torch.Tensor | None = None,
mask: torch.Tensor | None = None,
) torch.Tensor#

For input batch of multi-channel signals, calculate SDR between estimate and target for each channel, perform averaging across channels (weighting optional), and apply reduction across the batch.

Parameters:
  • estimate – Estimate of the target signal

  • target – Target signal

  • input_length – Length of each example in the batch

  • mask – Mask for each signal

Returns:

Scalar loss.

property input_types#

Input types definitions for SDRLoss.

property output_types#

Output types definitions for SDRLoss. loss:

NeuralType(None)

class nemo.collections.audio.losses.SDRLoss(*args: Any, **kwargs: Any)#

Bases: Loss, Typing

Computes signal-to-distortion ratio (SDR) loss with weighted average across channels.

Parameters:
  • weight – weight for SDR of each output channel, used for averaging the loss across channels. Defaults to None (averaging).

  • reduction – batch reduction. Defaults to mean over the batch.

  • scale_invariant – If True, use scale-invariant SDR. Defaults to False.

  • remove_mean – Remove mean before calculating the loss. Defaults to True.

  • sdr_max – Soft thresholding of the loss to SDR_max.

  • eps – Small value for regularization.

forward(
estimate: torch.Tensor,
target: torch.Tensor,
input_length: torch.Tensor | None = None,
mask: torch.Tensor | None = None,
) torch.Tensor#

For input batch of multi-channel signals, calculate SDR between estimate and target for each channel, perform averaging across channels (weighting optional), and apply reduction across the batch.

Parameters:
  • estimate – Batch of signals, shape (B, C, T)

  • target – Batch of signals, shape (B, C, T)

  • input_length – Batch of lengths, shape (B,)

  • mask – Batch of temporal masks for each channel, shape (B, C, T)

Returns:

Scalar loss.

property input_types#

Input types definitions for SDRLoss.

property output_types#

Output types definitions for SDRLoss. loss:

NeuralType(None)

Datasets#

NeMo Format#

class nemo.collections.audio.data.audio_to_audio.BaseAudioDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

Base class of audio datasets, providing common functionality for other audio datasets.

Parameters:
  • collection – Collection of audio examples prepared from manifest files.

  • audio_processor – Used to process every example from the collection. A callable with process method. For reference, please check ASRAudioProcessor.

num_channels(signal_key) int#

Returns the number of channels for a particular signal in items prepared by this dictionary.

More specifically, this will get the tensor from the first item in the dataset, check if it’s a one- or two-dimensional tensor, and return the number of channels based on the size of the first axis (shape[0]).

NOTE: This assumes that all examples have the same number of channels.

Parameters:

signal_key – string, used to select a signal from the dictionary output by __getitem__

Returns:

Number of channels for the selected signal.

abstract property output_types: Dict[str, NeuralType] | None#

Returns definitions of module output ports.

class nemo.collections.audio.data.audio_to_audio.AudioToTargetDataset(*args: Any, **kwargs: Any)#

Bases: BaseAudioDataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal.

Each line of the manifest file is expected to have the following format

``` {

‘input_key’: ‘path/to/input.wav’, ‘target_key’: ‘path/to/path_to_target.wav’, ‘duration’: duration_of_input,

Additionally, multiple audio files may be provided for each key in the manifest, for example,

``` {

‘input_key’: ‘path/to/input.wav’, ‘target_key’: [‘path/to/path_to_target_ch0.wav’, ‘path/to/path_to_target_ch1.wav’], ‘duration’: duration_of_input,

Keys for input and target signals can be configured in the constructor (input_key and target_key).

Parameters:
  • manifest_filepath – Path to manifest file in a format described above.

  • sample_rate – Sample rate for loaded audio signals.

  • input_key – Key pointing to input audio files in the manifest

  • target_key – Key pointing to target audio files in manifest

  • audio_duration – Optional duration of each item returned by __getitem__. If None, complete audio will be loaded. If set, a random subsegment will be loaded synchronously from target and audio, i.e., with the same start and end point.

  • random_offset – If True, offset will be randomized when loading a subsegment from a file.

  • max_duration – If audio exceeds this length, do not include in dataset.

  • min_duration – If audio is less than this length, do not include in dataset.

  • max_utts – Limit number of utterances.

  • input_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.

  • target_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.

  • normalization_signal – Normalize audio signals with a scale that ensures the normalization signal is in range [-1, 1]. All audio signals are scaled by the same factor. Supported values are None (no normalization), ‘input_signal’, ‘target_signal’.

property output_types: Dict[str, NeuralType] | None#

Returns definitions of module output ports.

Returns:

``` {

’input_signal’: batched single- or multi-channel format, ‘input_length’: batched original length of each input signal ‘target_signal’: batched single- or multi-channel format, ‘target_length’: batched original length of each target signal

Return type:

Ordered dictionary in the following form

class nemo.collections.audio.data.audio_to_audio.AudioToTargetWithReferenceDataset(*args: Any, **kwargs: Any)#

Bases: BaseAudioDataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional reference signal is available.

This can be used, for example, when a reference signal is available from - enrollment utterance for the target signal - echo reference from playback - reference from another sensor that correlates with the target signal

Each line of the manifest file is expected to have the following format

``` {

‘input_key’: ‘path/to/input.wav’, ‘target_key’: ‘path/to/path_to_target.wav’, ‘reference_key’: ‘path/to/path_to_reference.wav’, ‘duration’: duration_of_input,

Keys for input, target and reference signals can be configured in the constructor.

Parameters:
  • manifest_filepath – Path to manifest file in a format described above.

  • sample_rate – Sample rate for loaded audio signals.

  • input_key – Key pointing to input audio files in the manifest

  • target_key – Key pointing to target audio files in manifest

  • reference_key – Key pointing to reference audio files in manifest

  • audio_duration – Optional duration of each item returned by __getitem__. If None, complete audio will be loaded. If set, a random subsegment will be loaded synchronously from target and audio, i.e., with the same start and end point.

  • random_offset – If True, offset will be randomized when loading a subsegment from a file.

  • max_duration – If audio exceeds this length, do not include in dataset.

  • min_duration – If audio is less than this length, do not include in dataset.

  • max_utts – Limit number of utterances.

  • input_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.

  • target_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.

  • reference_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.

  • reference_is_synchronized – If True, it is assumed that the reference signal is synchronized with the input signal, so the same subsegment will be loaded as for input and target. If False, reference signal will be loaded independently from input and target.

  • reference_duration – Optional, can be used to set a fixed duration of the reference utterance. If None, complete audio file will be loaded.

  • normalization_signal – Normalize audio signals with a scale that ensures the normalization signal is in range [-1, 1]. All audio signals are scaled by the same factor. Supported values are None (no normalization), ‘input_signal’, ‘target_signal’, ‘reference_signal’.

property output_types: Dict[str, NeuralType] | None#

Returns definitions of module output ports.

Returns:

``` {

’input_signal’: batched single- or multi-channel format, ‘input_length’: batched original length of each input signal ‘target_signal’: batched single- or multi-channel format, ‘target_length’: batched original length of each target signal ‘reference_signal’: single- or multi-channel format, ‘reference_length’: original length of each reference signal

Return type:

Ordered dictionary in the following form

class nemo.collections.audio.data.audio_to_audio.AudioToTargetWithEmbeddingDataset(*args: Any, **kwargs: Any)#

Bases: BaseAudioDataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal and an additional embedding signal. It is assumed that the embedding is in a form of a vector.

Each line of the manifest file is expected to have the following format

``` {

input_key: ‘path/to/input.wav’, target_key: ‘path/to/path_to_target.wav’, embedding_key: ‘path/to/path_to_reference.npy’, ‘duration’: duration_of_input,

Keys for input, target and embedding signals can be configured in the constructor.

Parameters:
  • manifest_filepath – Path to manifest file in a format described above.

  • sample_rate – Sample rate for loaded audio signals.

  • input_key – Key pointing to input audio files in the manifest

  • target_key – Key pointing to target audio files in manifest

  • embedding_key – Key pointing to embedding files in manifest

  • audio_duration – Optional duration of each item returned by __getitem__. If None, complete audio will be loaded. If set, a random subsegment will be loaded synchronously from target and audio, i.e., with the same start and end point.

  • random_offset – If True, offset will be randomized when loading a subsegment from a file.

  • max_duration – If audio exceeds this length, do not include in dataset.

  • min_duration – If audio is less than this length, do not include in dataset.

  • max_utts – Limit number of utterances.

  • input_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.

  • target_channel_selector – Optional, select subset of channels from each input audio file. If None, all channels will be loaded.

  • normalization_signal – Normalize audio signals with a scale that ensures the normalization signal is in range [-1, 1]. All audio signals are scaled by the same factor. Supported values are None (no normalization), ‘input_signal’, ‘target_signal’.

property output_types: Dict[str, NeuralType] | None#

Returns definitions of module output ports.

Returns:

``` {

’input_signal’: batched single- or multi-channel format, ‘input_length’: batched original length of each input signal ‘target_signal’: batched single- or multi-channel format, ‘target_length’: batched original length of each target signal ‘embedding_vector’: batched embedded vector format, ‘embedding_length’: batched original length of each embedding vector

Return type:

Ordered dictionary in the following form

Lhotse Format#

class nemo.collections.audio.data.audio_to_audio_lhotse.LhotseAudioToTargetDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

A dataset for audio-to-audio tasks where the goal is to use an input signal to recover the corresponding target signal.

Note

This is a Lhotse variant of nemo.collections.asr.data.audio_to_audio.AudioToTargetDataset.