NeMo SSL collection API

Model Classes

class nemo.collections.asr.models.SpeechEncDecSelfSupervisedModel(*args: Any, **kwargs: Any)

Bases: nemo.core.classes.modelPT.ModelPT, nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin, nemo.core.classes.mixins.access_mixins.AccessMixin

Base class for encoder-decoder models used for self-supervised encoder pre-training

decoder_loss_step(spectrograms, spec_masks, encoded, encoded_len, targets=None, target_lengths=None)

Forward pass through all decoders and calculate corresponding losses. :param spectrograms: Processed spectrograms of shape [B, D, T]. :param spec_masks: Masks applied to spectrograms of shape [B, D, T]. :param encoded: The encoded features tensor of shape [B, D, T]. :param encoded_len: The lengths of the acoustic sequence after propagation through the encoder, of shape [B]. :param targets: Optional target labels of shape [B, T] :param target_lengths: Optional target label lengths of shape [B]

Returns

A tuple of 2 elements - 1) Total sum of losses weighted by corresponding loss_alphas 2) Dictionary of unweighted losses

forward(input_signal=None, input_signal_length=None, processed_signal=None, processed_signal_length=None)

Forward pass of the model.

Parameters
  • input_signal – Tensor that represents a batch of raw audio signals, of shape [B, T]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.

  • input_signal_length – Vector of length B, that contains the individual lengths of the audio sequences.

  • processed_signal – Tensor that represents a batch of processed audio signals, of shape (B, D, T) that has undergone processing via some DALI preprocessor.

  • processed_signal_length – Vector of length B, that contains the individual lengths of the processed audio sequences.

Returns

A tuple of 4 elements - 1) Processed spectrograms of shape [B, D, T]. 2) Masks applied to spectrograms of shape [B, D, T]. 3) The encoded features tensor of shape [B, D, T]. 2) The lengths of the acoustic sequence after propagation through the encoder, of shape [B].

property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable input neural type checks

classmethod list_available_models() List[nemo.core.classes.common.PretrainedModelInfo]

This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.

Returns

List of available pre-trained models.

multi_validation_epoch_end(outputs, dataloader_idx: int = 0)

Adds support for multiple validation datasets. Should be overriden by subclass, so as to obtain appropriate logs for each of the dataloaders.

Parameters
  • outputs – Same as that provided by LightningModule.on_validation_epoch_end() for a single dataloader.

  • dataloader_idx – int representing the index of the dataloader.

Returns

A dictionary of values, optionally containing a sub-dict log, such that the values in the log will be pre-pended by the dataloader prefix.

property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]

Define these to enable output neural type checks

setup_training_data(train_data_config: Optional[Union[omegaconf.DictConfig, Dict]])

Sets up the training data loader via a Dict-like object.

Parameters

train_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:
setup_validation_data(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]])

Sets up the validation data loader via a Dict-like object.

Parameters

val_data_config – A config that contains the information regarding construction of an ASR Training dataset.

Supported Datasets:

Mixins