NeMo Speech Intent Classification and Slot Filling collection API
Model Classes
- class nemo.collections.asr.models.SLUIntentSlotBPEModel(*args: Any, **kwargs: Any)
Bases:
nemo.collections.asr.models.asr_model.ASRModel
,nemo.collections.asr.models.asr_model.ExportableEncDecModel
,nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin
,nemo.collections.asr.parts.mixins.mixins.ASRBPEMixin
,nemo.collections.asr.parts.mixins.transcription.ASRTranscriptionMixin
Model for end-to-end speech intent classification and slot filling, which is formulated as a speech-to-sequence task
- forward(input_signal=None, input_signal_length=None, target_semantics=None, target_semantics_length=None, processed_signal=None, processed_signal_length=None)
Forward pass of the model.
- Params:
input_signal: Tensor that represents a batch of raw audio signals, of shape [B, T]. T here represents timesteps, with 1 second of audio represented as self.sample_rate number of floating point values.
input_signal_length: Vector of length B, that contains the individual lengths of the audio sequences.
target_semantics: Tensor that represents a batch of semantic tokens, of shape [B, L].
target_semantics_length: Vector of length B, that contains the individual lengths of the semantic sequences.
processed_signal: Tensor that represents a batch of processed audio signals, of shape (B, D, T) that has undergone processing via some DALI preprocessor.
processed_signal_length: Vector of length B, that contains the individual lengths of the processed audio sequences.
- Returns
A tuple of 3 elements - 1) The log probabilities tensor of shape [B, T, D]. 2) The lengths of the output sequence after decoder, of shape [B]. 3) The token predictions of the model of shape [B, T].
- property input_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]
Define these to enable input neural type checks
- classmethod list_available_models() Optional[nemo.core.classes.common.PretrainedModelInfo]
This method returns a list of pre-trained model which can be instantiated directly from NVIDIA’s NGC cloud.
- Returns
List of available pre-trained models.
- property output_types: Optional[Dict[str, nemo.core.neural_types.neural_type.NeuralType]]
Define these to enable output neural type checks
- setup_test_data(test_data_config: Optional[Union[omegaconf.DictConfig, Dict]])
Sets up the test data loader via a Dict-like object.
- Parameters
test_data_config – A config that contains the information regarding construction of an ASR Training dataset.
- Supported Datasets:
AudioToCharDALIDataset
- setup_training_data(train_data_config: Optional[Union[omegaconf.DictConfig, Dict]])
Sets up the training data loader via a Dict-like object.
- Parameters
train_data_config – A config that contains the information regarding construction of an ASR Training dataset.
- Supported Datasets:
AudioToCharDALIDataset
- setup_validation_data(val_data_config: Optional[Union[omegaconf.DictConfig, Dict]])
Sets up the validation data loader via a Dict-like object.
- Parameters
val_data_config – A config that contains the information regarding construction of an ASR Training dataset.
- Supported Datasets:
AudioToCharDALIDataset
- transcribe(audio: Union[List[str], torch.utils.data.DataLoader], batch_size: int = 4, return_hypotheses: bool = False, num_workers: int = 0, verbose: bool = True) Union[List[str], List[Hypothesis], Tuple[List[str]], Tuple[List[Hypothesis]]]
Uses greedy decoding to transcribe audio files into SLU semantics. Use this method for debugging and prototyping.
- Parameters
audio – (a single or list) of paths to audio files or a np.ndarray audio array. Can also be a dataloader object that provides values that can be consumed by the model. Recommended length per file is between 5 and 25 seconds. But it is possible to pass a few hours long file if enough GPU memory is available.
batch_size – (int) batch size to use during inference. Bigger will result in better throughput performance but would use more memory.
return_hypotheses – (bool) Either return hypotheses or text With hypotheses can do some postprocessing like getting timestamp or rescoring
num_workers – (int) number of workers for DataLoader
verbose – (bool) whether to display tqdm progress bar
- Returns
A list of transcriptions (or raw log probabilities if logprobs is True) in the same order as paths2audio_files