Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
Checkpoints
There are two main ways to load pretrained checkpoints in NeMo:
Using the
restore_from()
method to load a local checkpoint file (.nemo), orUsing the
from_pretrained()
method to download and set up a checkpoint from NGC.
See the following sections for instructions and examples for each.
Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning.
For resuming an unfinished training experiment, please use the experiment manager to do so by setting the
resume_if_exists
flag to True.
Loading Local Checkpoints
NeMo will automatically save checkpoints of a model you are training in a .nemo format.
You can also manually save your models at any point using model.save_to(<checkpoint_path>.nemo)
.
If you have a local .nemo
checkpoint that you’d like to load, simply use the restore_from()
method:
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
Where the model base class is the ASR model class of the original checkpoint, or the general ASRModel class.
Transcribing/Inference
The audio files should be 16KHz monochannel wav files.
Transcribe speech command segment:
You may perform inference and transcribe a sample of speech after loading the model by using its ‘transcribe()’ method:
mbn_model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="<MODEL_NAME>")
mbn_model.transcribe([list of audio files], batch_size=BATCH_SIZE, logprobs=False)
Setting argument logprobs
to True would return the log probabilities instead of transcriptions. You may find more details in Modules.
Learn how to fine tune on your own data or on subset classes in <NeMo_git_root>/tutorials/asr/Speech_Commands.ipynb
Run VAD inference:
python <NeMo-git-root>/examples/asr/speech_classification/vad_infer.py --config-path="../conf/vad" --config-name="vad_inference_postprocessing.yaml" dataset=<Path of json file of evaluation data. Audio files should have unique names>
This script will perform vad frame-level prediction and will help you perform postprocessing and generate speech segments as well if needed.
Have a look at configuration file <NeMo-git-root>/examples/asr/conf/vad/vad_inference_postprocessing.yaml
and scripts under <NeMo-git-root>/scripts/voice_activity_detection
for details regarding posterior processing, postprocessing and threshold tuning.
Posterior processing includes generating predictions with overlapping input segments. Then a smoothing filter is applied to decide the label for a frame spanned by multiple segments.
For VAD postprocessing we introduce
- Binarization:
onset
andoffset
threshold for detecting the beginning and end of a speech.padding durations
pad_onset
before and padding duarationspad_offset
after each speech segment;
- Filtering:
min_duration_on
threshold for short speech segment deletion,min_duration_on
threshold for small silence deletion,filter_speech_first
to control whether to perform short speech segment deletion first.
Identify language of utterance
You may load the model and identify the language of an audio file by using get_label() method:
langid_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="<MODEL_NAME>")
lang = langid_model.get_label('<audio_path>')
or you can run batch_inference() to perform inference on a manifest with seleted batch_size to get trained model labels and gt_labels with logits
langid_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained(model_name="<MODEL_NAME>")
lang_embs, logits, gt_labels, trained_labels = langid_model.batch_inference(manifest_filepath, batch_size=32)
NGC Pretrained Checkpoints
The Speech Classification collection has checkpoints of several models trained on various datasets for a variety of tasks. These checkpoints are obtainable via NGC NeMo Automatic Speech Recognition collection. The model cards on NGC contain more information about each of the checkpoints available.
The tables below list the Speech Classification models available from NGC, and the models can be accessed via the
from_pretrained()
method inside the ASR Model class.
In general, you can load any of these models with code in the following format.
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="<MODEL_NAME>")
Where the model name is the value under “Model Name” entry in the tables below.
For example, to load the MatchboxNet3x2x64_v1 model for speech command detection, run:
model = nemo_asr.models.EncDecClassificationModel.from_pretrained(model_name="commandrecognition_en_matchboxnet3x2x64_v1")
You can also call from_pretrained()
from the specific model class (such as EncDecClassificationModel
for MatchboxNet and MarbleNet) if you will need to access specific model functionality.
If you would like to programatically list the models available for a particular base class, you can use the
list_available_models()
method.
nemo_asr.models.<MODEL_BASE_CLASS>.list_available_models()
Speech Classification Models
Model Name |
Model Base Class |
Model Card |
---|---|---|
langid_ambernet |
EncDecSpeakerLabelModel |
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/langid_ambernet |
vad_multilingual_marblenet |
EncDecClassificationModel |
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/vad_multilingual_marblenet |
vad_marblenet |
EncDecClassificationModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_marblenet |
vad_telephony_marblenet |
EncDecClassificationModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_telephony_marblenet |
commandrecognition_en_matchboxnet3x1x64_v1 |
EncDecClassificationModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x1x64_v1 |
commandrecognition_en_matchboxnet3x2x64_v1 |
EncDecClassificationModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x2x64_v1 |
commandrecognition_en_matchboxnet3x1x64_v2 |
EncDecClassificationModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x1x64_v2 |
commandrecognition_en_matchboxnet3x2x64_v2 |
EncDecClassificationModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:commandrecognition_en_matchboxnet3x2x64_v2 |
commandrecognition_en_matchboxnet3x1x64_v2_subset_task |
EncDecClassificationModel |
|
commandrecognition_en_matchboxnet3x2x64_v2_subset_task |
EncDecClassificationModel |