Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR), also known as Speech To Text (STT), refers to the problem of automatically transcribing spoken language. You can use NeMo to transcribe speech using open-sourced pretrained models in 14+ languages, or train your own ASR models.

After installing NeMo, you can transcribe an audio file as follows:


import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") transcript = asr_model.transcribe(["path/to/audio_file.wav"])

Obtain word timestamps

You can also obtain timestamps for each word in the transcription as follows:


# import nemo_asr and instantiate asr_model as above import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") # update decoding config to preserve alignments and compute timestamps from omegaconf import OmegaConf, open_dict decoding_cfg = asr_model.cfg.decoding with open_dict(decoding_cfg): decoding_cfg.preserve_alignments = True decoding_cfg.compute_timestamps = True asr_model.change_decoding_strategy(decoding_cfg) # specify flag `return_hypotheses=True`` hypotheses = asr_model.transcribe(["path/to/audio_file.wav"], return_hypotheses=True) # if hypotheses form a tuple (from RNNT), extract just "best" hypotheses if type(hypotheses) == tuple and len(hypotheses) == 2: hypotheses = hypotheses[0] timestamp_dict = hypotheses[0].timestep # extract timesteps from hypothesis of first (and only) audio file print("Hypothesis contains following timestep information :", list(timestamp_dict.keys())) # For a FastConformer model, you can display the word timestamps as follows: # 80ms is duration of a timestep at output of the Conformer time_stride = 8 * asr_model.cfg.preprocessor.window_stride word_timestamps = timestamp_dict['word'] for stamp in word_timestamps: start = stamp['start_offset'] * time_stride end = stamp['end_offset'] * time_stride word = stamp['char'] if 'char' in stamp else stamp['word'] print(f"Time :{start:0.2f}-{end:0.2f}-{word}")

You can also transcribe speech via the command line using the following script, for example:


python <path_to_NeMo>/blob/main/examples/asr/ \ pretrained_name="stt_en_fastconformer_transducer_large" \ audio_dir=<path_to_audio_dir> # path to dir containing audio files to transcribe

The script will save all transcriptions in a JSONL file where each line corresponds to an audio file in <audio_dir>. This file will correspond to a format that NeMo commonly uses for saving model predictions, and also for storing input data for training and evaluation. You can learn more about the format that NeMo uses for these files (which we refer to as “manifest files”) here.

You can also specify the files to be transcribed inside a manifest file, and pass that in using the argument dataset_manifest=<path to manifest specifying audio files to transcribe> instead of audio_dir.

You can often get a boost to transcription accuracy by using a Language Model to help choose words that are more likely to be spoken in a sentence.

You can get a good improvement in transcription accuracy even using a simple N-gram LM.

After training an N-gram LM, you can use it for transcribing audio as follows:

  1. Install the OpenSeq2Seq beam search decoding and KenLM libraries using the install_beamsearch_decoders script.

  2. Perform transcription using the eval_beamsearch_ngram script:


python nemo_model_file=<path to the .nemo file of the model> \ input_manifest=<path to the evaluation JSON manifest file \ kenlm_model_file=<path to the binary KenLM model> \ beam_width=[<list of the beam widths, separated with commas>] \ beam_alpha=[<list of the beam alphas, separated with commas>] \ beam_beta=[<list of the beam betas, separated with commas>] \ preds_output_folder=<optional folder to store the predictions> \ probs_cache_file=null \ decoding_mode=beamsearch_ngram \ decoding_strategy="<Beam library such as beam, pyctcdecode or flashlight>"

See more information about LM decoding here.

It is possible to use NeMo to transcribe speech in real-time. We provide tutorial notebooks for Cache Aware Streaming and Buffered Streaming.

NeMo offers a variety of open-sourced pretrained ASR models that vary by model architecture:

  • encoder architecture (FastConformer, Conformer, Citrinet, etc.),

  • decoder architecture (Transducer, CTC & hybrid of the two),

  • size of the model (small, medium, large, etc.).

The pretrained models also vary by:

  • language (English, Spanish, etc., including some multilingual and code-switching models),

  • whether the output text contains punctuation & capitalization or not.

The NeMo ASR checkpoints can be found on HuggingFace, or on NGC. All models released by the NeMo team can be found on NGC, and some of those are also available on HuggingFace.

All NeMo ASR checkpoints open-sourced by the NeMo team follow the following naming convention: stt_{language}_{encoder name}_{decoder name}_{model size}{_optional descriptor}.

You can load the checkpoints automatically using the ASRModel.from_pretrained() class method, for example:


import nemo.collections.asr as nemo_asr # model will be fetched from NGC asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") # if model name is prepended with "nvidia/", the model will be fetched from huggingface asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_fastconformer_transducer_large") # you can also load open-sourced NeMo models released by other HF users using: # asr_model = nemo_asr.models.ASRModel.from_pretrained("<HF username>/<model name>")

See further documentation about loading checkpoints, a full list of models and their benchmark scores.

There is also more information about the ASR model architectures available in NeMo here.

You can try out transcription with a NeMo ASR model without leaving your browser, by using the HuggingFace Space embedded below.

This HuggingFace Space uses Canary-1B, the latest ASR model from NVIDIA NeMo. It sits at the top of the HuggingFace OpenASR Leaderboard at time of publishing.

Canary-1B is a multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) as well as translation between English and the 3 other supported languages.

Hands-on speech recognition tutorial notebooks can be found under the ASR tutorials folder. If you are a beginner to NeMo, consider trying out the ASR with NeMo tutorial. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks’ GitHub pages on Colab.

Documentation regarding the configuration files specific to the nemo_asr models can be found in the Configuration Files section.

NeMo includes preprocessing scripts for several common ASR datasets. The Datasets section contains instructions on running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data.

Previous ViT
Next Models
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.