Automatic Speech Recognition (ASR)#
ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). Our goal is usually to have a model that minimizes the Word Error Rate (WER) metric when transcribing speech input. In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with as few errors as possible?
Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to evaluate a speech sample. We would start from a language model that encapsulates the most likely orderings of words that are generated (e.g. an n-gram model), to a pronunciation model for each word in that ordering (e.g. a pronunciation table), to an acoustic model that translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model).
Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio
according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model Pr(audio|transcript)*Pr(transcript)
,
and take the argmax of this over possible transcripts.
Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the whole prediction.
Thus, we can see the appeal of end-to-end ASR architectures: discriminative models that simply take an audio input and give a textual output, and in which all components of the architecture are trained together towards the same goal. The model’s encoder would be akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, we could integrate a language model that would improve our predictions, as well.
And the entire end-to-end ASR model can be trained at once–a much easier pipeline to handle!
A demo below allows evaluation of NeMo ASR models in multiple langauges from the browser:
The full documentation tree is as follows:
- Models
- Datasets
- ASR Language Modeling
- Checkpoints
- Scores
- Scores with Punctuation and Capitalization
- NeMo ASR Configuration Files
- NeMo ASR collection API
- Model Classes
EncDecCTCModel
EncDecCTCModel.change_decoding_strategy()
EncDecCTCModel.change_vocabulary()
EncDecCTCModel.forward()
EncDecCTCModel.input_types
EncDecCTCModel.list_available_models()
EncDecCTCModel.multi_test_epoch_end()
EncDecCTCModel.multi_validation_epoch_end()
EncDecCTCModel.output_types
EncDecCTCModel.predict_step()
EncDecCTCModel.setup_test_data()
EncDecCTCModel.setup_training_data()
EncDecCTCModel.setup_validation_data()
EncDecCTCModel.test_dataloader()
EncDecCTCModel.test_step()
EncDecCTCModel.training_step()
EncDecCTCModel.transcribe()
EncDecCTCModel.validation_step()
EncDecCTCModelBPE
EncDecRNNTModel
EncDecRNNTModel.change_decoding_strategy()
EncDecRNNTModel.change_vocabulary()
EncDecRNNTModel.decoder_joint
EncDecRNNTModel.extract_rnnt_loss_cfg()
EncDecRNNTModel.forward()
EncDecRNNTModel.input_types
EncDecRNNTModel.list_available_models()
EncDecRNNTModel.list_export_subnets()
EncDecRNNTModel.multi_test_epoch_end()
EncDecRNNTModel.multi_validation_epoch_end()
EncDecRNNTModel.on_after_backward()
EncDecRNNTModel.output_types
EncDecRNNTModel.predict_step()
EncDecRNNTModel.setup_optim_normalization()
EncDecRNNTModel.setup_test_data()
EncDecRNNTModel.setup_training_data()
EncDecRNNTModel.setup_validation_data()
EncDecRNNTModel.test_step()
EncDecRNNTModel.training_step()
EncDecRNNTModel.transcribe()
EncDecRNNTModel.validation_step()
EncDecRNNTBPEModel
EncDecClassificationModel
EncDecClassificationModel.change_labels()
EncDecClassificationModel.forward()
EncDecClassificationModel.list_available_models()
EncDecClassificationModel.multi_test_epoch_end()
EncDecClassificationModel.multi_validation_epoch_end()
EncDecClassificationModel.output_types
EncDecClassificationModel.test_step()
EncDecClassificationModel.training_step()
EncDecClassificationModel.validation_step()
EncDecSpeakerLabelModel
EncDecSpeakerLabelModel.batch_inference()
EncDecSpeakerLabelModel.evaluation_step()
EncDecSpeakerLabelModel.extract_labels()
EncDecSpeakerLabelModel.forward()
EncDecSpeakerLabelModel.forward_for_export()
EncDecSpeakerLabelModel.get_embedding()
EncDecSpeakerLabelModel.get_label()
EncDecSpeakerLabelModel.infer_file()
EncDecSpeakerLabelModel.input_types
EncDecSpeakerLabelModel.list_available_models()
EncDecSpeakerLabelModel.multi_evaluation_epoch_end()
EncDecSpeakerLabelModel.multi_test_epoch_end()
EncDecSpeakerLabelModel.multi_validation_epoch_end()
EncDecSpeakerLabelModel.output_types
EncDecSpeakerLabelModel.setup_test_data()
EncDecSpeakerLabelModel.setup_training_data()
EncDecSpeakerLabelModel.setup_validation_data()
EncDecSpeakerLabelModel.test_dataloader()
EncDecSpeakerLabelModel.test_step()
EncDecSpeakerLabelModel.training_step()
EncDecSpeakerLabelModel.validation_step()
EncDecSpeakerLabelModel.verify_speakers()
ASRWithTTSModel
ASRWithTTSModel.ASRModelTypes
ASRWithTTSModel.asr_model
ASRWithTTSModel.enhancer_model
ASRWithTTSModel.from_asr_config()
ASRWithTTSModel.from_pretrained_models()
ASRWithTTSModel.list_available_models()
ASRWithTTSModel.multi_test_epoch_end()
ASRWithTTSModel.multi_validation_epoch_end()
ASRWithTTSModel.on_fit_start()
ASRWithTTSModel.save_asr_model_to()
ASRWithTTSModel.setup_multiple_test_data()
ASRWithTTSModel.setup_multiple_validation_data()
ASRWithTTSModel.setup_optimization()
ASRWithTTSModel.setup_test_data()
ASRWithTTSModel.setup_training_data()
ASRWithTTSModel.setup_validation_data()
ASRWithTTSModel.test_epoch_end()
ASRWithTTSModel.train()
ASRWithTTSModel.training_step()
ASRWithTTSModel.transcribe()
ASRWithTTSModel.tts_model
ASRWithTTSModel.unfreeze()
ASRWithTTSModel.val_dataloader()
ASRWithTTSModel.validation_epoch_end()
ASRWithTTSModel.validation_step()
- Modules
ConvASREncoder
ConvASRDecoder
ConvASRDecoderClassification
SpeakerDecoder
ConformerEncoder
ConformerEncoder.change_attention_model()
ConformerEncoder.disabled_deployment_input_names
ConformerEncoder.disabled_deployment_output_names
ConformerEncoder.enable_pad_mask()
ConformerEncoder.forward()
ConformerEncoder.forward_for_export()
ConformerEncoder.forward_internal()
ConformerEncoder.get_initial_cache_state()
ConformerEncoder.input_example()
ConformerEncoder.input_types
ConformerEncoder.input_types_for_export
ConformerEncoder.output_types
ConformerEncoder.output_types_for_export
ConformerEncoder.set_max_audio_length()
ConformerEncoder.setup_streaming_params()
ConformerEncoder.streaming_post_process()
ConformerEncoder.update_max_seq_length()
SqueezeformerEncoder
SqueezeformerEncoder.enable_pad_mask()
SqueezeformerEncoder.forward()
SqueezeformerEncoder.forward_for_export()
SqueezeformerEncoder.input_example()
SqueezeformerEncoder.input_types
SqueezeformerEncoder.make_pad_mask()
SqueezeformerEncoder.output_types
SqueezeformerEncoder.set_max_audio_length()
SqueezeformerEncoder.update_max_seq_length()
RNNEncoder
RNNTDecoder
RNNTDecoder.add_adapter()
RNNTDecoder.batch_concat_states()
RNNTDecoder.batch_copy_states()
RNNTDecoder.batch_initialize_states()
RNNTDecoder.batch_score_hypothesis()
RNNTDecoder.batch_select_state()
RNNTDecoder.forward()
RNNTDecoder.initialize_state()
RNNTDecoder.input_example()
RNNTDecoder.input_types
RNNTDecoder.output_types
RNNTDecoder.predict()
RNNTDecoder.score_hypothesis()
StatelessTransducerDecoder
StatelessTransducerDecoder.batch_concat_states()
StatelessTransducerDecoder.batch_copy_states()
StatelessTransducerDecoder.batch_initialize_states()
StatelessTransducerDecoder.batch_score_hypothesis()
StatelessTransducerDecoder.batch_select_state()
StatelessTransducerDecoder.forward()
StatelessTransducerDecoder.initialize_state()
StatelessTransducerDecoder.input_example()
StatelessTransducerDecoder.input_types
StatelessTransducerDecoder.output_types
StatelessTransducerDecoder.predict()
StatelessTransducerDecoder.score_hypothesis()
RNNTJoint
RNNTJoint.add_adapter()
RNNTJoint.disabled_deployment_input_names
RNNTJoint.forward()
RNNTJoint.fuse_loss_wer
RNNTJoint.fused_batch_size
RNNTJoint.input_example()
RNNTJoint.input_types
RNNTJoint.joint()
RNNTJoint.loss
RNNTJoint.num_classes_with_blank
RNNTJoint.num_extra_outputs
RNNTJoint.output_types
RNNTJoint.set_fuse_loss_wer()
RNNTJoint.set_fused_batch_size()
RNNTJoint.set_loss()
RNNTJoint.set_wer()
RNNTJoint.wer
SampledRNNTJoint
- Parts
- Mixins
- Datasets
- Audio Preprocessors
AudioToMelSpectrogramPreprocessor
AudioToMelSpectrogramPreprocessor.filter_banks
AudioToMelSpectrogramPreprocessor.get_features()
AudioToMelSpectrogramPreprocessor.input_example()
AudioToMelSpectrogramPreprocessor.input_types
AudioToMelSpectrogramPreprocessor.output_types
AudioToMelSpectrogramPreprocessor.restore_from()
AudioToMelSpectrogramPreprocessor.save_to()
AudioToMFCCPreprocessor
- Audio Augmentors
- Miscellaneous Classes
- CTC Decoding
- RNNT Decoding
RNNTDecoding
RNNTBPEDecoding
GreedyRNNTInfer
GreedyBatchedRNNTInfer
BeamRNNTInfer
BeamRNNTInfer.align_length_sync_decoding()
BeamRNNTInfer.compute_ngram_score()
BeamRNNTInfer.default_beam_search()
BeamRNNTInfer.greedy_search()
BeamRNNTInfer.input_types
BeamRNNTInfer.modified_adaptive_expansion_search()
BeamRNNTInfer.output_types
BeamRNNTInfer.prefix_search()
BeamRNNTInfer.recombine_hypotheses()
BeamRNNTInfer.resolve_joint_output()
BeamRNNTInfer.set_decoding_type()
BeamRNNTInfer.sort_nbest()
BeamRNNTInfer.time_sync_decoding()
- Hypotheses
- Adapter Networks
- Adapter Strategies
- Model Classes
- Resources and Documentation
- Example: Kinyarwanda ASR using Mozilla Common Voice Dataset
Resources and Documentation#
Hands-on speech recognition tutorial notebooks can be found under the ASR tutorials folder. If you are a beginner to NeMo, consider trying out the ASR with NeMo tutorial. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks’ GitHub pages on Colab.
If you are looking for information about a particular ASR model, or would like to find out more about the model architectures available in the nemo_asr collection, refer to the Models section.
NeMo includes preprocessing scripts for several common ASR datasets. The Datasets section contains instructions on running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data.
Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list of the checkpoints available on NGC are located on the Checkpoints section.
Documentation regarding the configuration files specific to the nemo_asr
models can be found on the Configuration Files section.