Checkpoints#
There are two main ways to load pretrained checkpoints in NeMo:
Using the
restore_from()
method to load a local checkpoint file (.nemo
), orUsing the
from_pretrained()
method to download and set up a checkpoint from NGC.
Refer to the following sections for instructions and examples for each.
Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. For resuming an unfinished
training experiment, use the Experiment Manager to do so by setting the resume_if_exists
flag to True
.
Loading Local Checkpoints#
NeMo automatically saves checkpoints of a model that is trained in a .nemo
format. Alternatively, to manually save the model at any
point, issue model.save_to(<checkpoint_path>.nemo)
.
If there is a local .nemo
checkpoint that you’d like to load, use the restore_from()
method:
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
Where the model base class is the ASR model class of the original checkpoint, or the general ASRModel
class.
Hybrid ASR-TTS Models Checkpoints#
Hybrid ASR-TTS model is a transparent wrapper for the ASR model, text-to-mel-spectrogram generator, and optional enhancer.
The model is saved as a solid .nemo
checkpoint containing all these parts.
Due to transparency, the ASR model can be extracted after training/finetuning separately by using the asr_model
attribute (NeMo submodel)
hybrid_model.asr_model.save_to(<asr_checkpoint_path>.nemo)
or by using a wrapper
made for convenience purpose hybrid_model.save_asr_model_to(<asr_checkpoint_path>.nemo)
NGC Pretrained Checkpoints#
The ASR collection has checkpoints of several models trained on various datasets for a variety of tasks. These checkpoints are obtainable via NGC NeMo Automatic Speech Recognition collection. The model cards on NGC contain more information about each of the checkpoints available.
The tables below list the ASR models available from NGC. The models can be accessed via the from_pretrained()
method inside
the ASR Model class. In general, you can load any of these models with code in the following format:
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained(model_name="<MODEL_NAME>")
Where the model name is the value under “Model Name” entry in the tables below.
For example, to load the base English QuartzNet model for speech recognition, run:
model = nemo_asr.models.ASRModel.from_pretrained(model_name="QuartzNet15x5Base-En")
You can also call from_pretrained()
from the specific model class (such as EncDecCTCModel
for QuartzNet) if you need to access a specific model functionality.
If you would like to programmatically list the models available for a particular base class, you can use the
list_available_models()
method.
nemo_asr.models.<MODEL_BASE_CLASS>.list_available_models()
Transcribing/Inference#
To perform inference and transcribe a sample of speech after loading the model, use the transcribe()
method:
model.transcribe(paths2audio_files=[list of audio files], batch_size=BATCH_SIZE, logprobs=False)
Setting the argument logprobs
to True
returns the log probabilities instead of transcriptions. For more information, see nemo.collections.asr.modules.
The audio files should be 16KHz mono-channel wav files.
Inference on long audio#
In some cases the audio is too long for standard inference, especially if you’re using a model such as Conformer, where the time and memory costs of the attention layers scale quadratically with the duration.
There are two main ways of performing inference on long audio files in NeMo:
The first way is to use buffered inference, where the audio is divided into chunks to run on, and the output is merged afterwards. The relevant scripts for this are contained in this folder.
The second way, specifically for models with the Conformer/Fast Conformer encoder, is to use local attention, which changes the costs to be linear.
You can train Fast Conformer models with Longformer-style (https://arxiv.org/abs/2004.05150) local+global attention using one of the following configs: CTC config at
<NeMo_git_root>/examples/asr/conf/fastconformer/fast-conformer-long_ctc_bpe.yaml
and transducer config at <NeMo_git_root>/examples/asr/conf/fastconformer/fast-conformer-long_transducer_bpe.yaml
.
You can also convert any model trained with full context attention to local, though this may result in lower WER in some cases. You can switch to local attention when running the
transcribe or evaluation
scripts in the following way:
python speech_to_text_eval.py \
(...other parameters...) \
++model_change.conformer.self_attention_model="rel_pos_local_attn" \
++model_change.conformer.att_context_size=[128, 128]
Alternatively, you can change the attention model after loading a checkpoint:
asr_model = ASRModel.from_pretrained('stt_en_conformer_ctc_large')
asr_model.change_attention_model(
self_attention_model="rel_pos_local_attn",
att_context_size=[128, 128]
)
Inference on Apple M-Series GPU#
To perform inference on Apple Mac M-Series GPU (mps
PyTorch device), use PyTorch 2.0 or higher (see Mac computers with Apple silicon section). Environment variable PYTORCH_ENABLE_MPS_FALLBACK=1
should be set, since not all operations in PyTorch are currently implemented on mps
device.
If allow_mps=true
flag is passed to speech_to_text_eval.py
, the mps
device will be selected automatically.
PYTORCH_ENABLE_MPS_FALLBACK=1 python speech_to_text_eval.py \
(...other parameters...) \
allow_mps=true
Fine-tuning on Different Datasets#
There are multiple ASR tutorials provided in the Tutorials section. Most of these tutorials explain how to instantiate a pre-trained model, prepare the model for fine-tuning on some dataset (in the same language) as a demonstration.
Inference Execution Flow Diagram#
When preparing your own inference scripts, please follow the execution flow diagram order for correct inference, found at the examples directory for ASR collection.
Automatic Speech Recognition Models#
Below is a list of all the ASR models that are available in NeMo for specific languages, as well as auxiliary language models for certain languages.
Language Models for ASR#
Model Name |
Model Base Class |
Model Card |
---|---|---|
asrlm_en_transformer_large_ls |
TransformerLMModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls |
Speech Recognition (Languages)#
English#
Mandarin#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_zh_citrinet_512 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_512 |
stt_zh_citrinet_1024_gamma_0_25 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_1024_gamma_0_25 |
stt_zh_conformer_transducer_large |
EncDecRNNTModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_conformer_transducer_large |
German#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_de_quartznet15x5 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_quartznet15x5 |
stt_de_citrinet_1024 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_citrinet_1024 |
stt_de_contextnet_1024 |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_contextnet_1024 |
stt_de_conformer_ctc_large |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_conformer_ctc_large |
stt_de_conformer_transducer_large |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_conformer_transducer_large |
stt_de_fastconformer_hybrid_large_pc |
EncDecHybridRNNTCTCBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_fastconformer_hybrid_large_pc |
French#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_fr_quartznet15x5 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_quartznet15x5 |
stt_fr_citrinet_1024_gamma_0_25 |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_citrinet_1024_gamma_0_25 |
stt_fr_no_hyphen_citrinet_1024_gamma_0_25 |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_citrinet_1024_gamma_0_25 |
stt_fr_contextnet_1024 |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_contextnet_1024 |
stt_fr_conformer_ctc_large |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_conformer_ctc_large |
stt_fr_no_hyphen_conformer_ctc_large |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_conformer_ctc_large |
stt_fr_conformer_transducer_large |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_fr_conformer_transducer_large |
Polish#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_pl_quartznet15x5 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_pl_quartznet15x5 |
stt_pl_fastconformer_hybrid_large_pc |
EncDecHybridRNNTCTCBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_pl_fastconformer_hybrid_large_pc |
Italian#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_it_quartznet15x5 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_it_quartznet15x5 |
stt_it_fastconformer_hybrid_large_pc |
EncDecHybridRNNTCTCBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_it_fastconformer_hybrid_large_pc |
Russian#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_ru_quartznet15x5 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ru_quartznet15x5 |
Spanish#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_es_quartznet15x5 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_quartznet15x5 |
stt_es_citrinet_512 |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_citrinet_512 |
stt_es_citrinet_1024_gamma_0_25 |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_citrinet_1024_gamma_0_25 |
stt_es_conformer_ctc_large |
EncDecCTCModelBPE |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_conformer_ctc_large |
stt_es_conformer_transducer_large |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_conformer_transducer_large |
stt_es_contextnet_1024 |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_contextnet_1024 |
stt_es_fastconformer_hybrid_large_pc |
EncDecHybridRNNTCTCBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_es_fastconformer_hybrid_large_pc |
Catalan#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_ca_quartznet15x5 |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ca_quartznet15x5 |
stt_ca_conformer_ctc_large |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ca_conformer_ctc_large |
stt_ca_conformer_transducer_large |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ca_conformer_transducer_large |
Hindi#
Model Name |
Model Base Class |
Model Card |
---|---|---|
stt_hi_conformer_ctc_medium |
EncDecCTCModelBPE |
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_hi_conformer_ctc_medium |
Marathi#
Model Name |
Model Base Class |
Model Card |
---|---|---|
stt_mr_conformer_ctc_medium |
EncDecCTCModelBPE |
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_mr_conformer_ctc_medium |
Kinyarwanda#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_rw_conformer_ctc_large |
EncDecCTCModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_ctc_large |
stt_rw_conformer_transducer_large |
EncDecRNNTBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_transducer_large |
Belarusian#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_by_fastconformer_hybrid_large_pc |
EncDecHybridRNNTCTCBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_by_fastconformer_hybrid_large_pc |
Ukrainian#
Model |
Model Base Class |
Model Card |
---|---|---|
stt_ua_fastconformer_hybrid_large_pc |
EncDecHybridRNNTCTCBPEModel |
https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_ua_fastconformer_hybrid_large_pc |