Customization#
This section demonstrates customization options with ASR models. These options can be used with Streaming and Offline APIs.
These examples use Riva ASR sample clients in Python to demonstrate Riva ASR features. You can build your own speech AI applications with Riva using API Reference, Python libraries, and referring to the sample clients.
Refer to the table below for supported customizations for each model. Automatic Punctuation is supported for all models.
Model |
Word Boosting |
Silero VAD |
Profanity Filter |
Speaker Diarization |
|---|---|---|---|---|
✅ |
✅ |
✅ |
✅ |
|
✅ |
✅ |
✅ |
✅ |
|
❌ |
❌ |
✅ |
❌ |
|
❌ |
❌ |
✅ |
✅ |
|
✅ |
✅ |
✅ |
✅ |
|
✅ |
✅ |
✅ |
✅ |
|
✅ |
✅ |
✅ |
✅ |
|
✅ |
✅ |
❌ |
✅ |
|
❌ |
❌ |
❌ |
❌ |
|
❌ |
❌ |
✅ |
❌ |
Runtime Customizations#
Runtime customizations can be applied without NIM server redeployment. These customizations are sent as parameters in the client request and are processed dynamically by the server.
Word Boosting#
Word boosting allows you to bias the ASR engine to recognize particular words of interest at request time by assigning them higher scores when decoding the acoustic model’s output. We recommend a boost score in the range of 20 to 100.
Copy an example audio file from the NIM container to the host machine, or use your own.
docker cp $CONTAINER_ID:/opt/riva/wav/en-US_wordboosting_sample.wav .
First, run ASR on the sample audio without word boosting.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--language-code en-US \
--input-file en-US_wordboosting_sample.wav
Output:
## aunt bertha and ab loper both transformer based language models are examples of the emerging work in using graph neural networks to design protein sequences for particular target antigens
As seen in the output, ASR struggles to recognize domain-specific terms like AntiBERTa and ABlooper. You can apply word boosting to improve ASR accuracy for these domain-specific terms.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--language-code en-US \
--input-file en-US_wordboosting_sample.wav \
--boosted-lm-words AntiBERTa --boosted-lm-score 20 \
--boosted-lm-words ABlooper --boosted-lm-score 20
Output:
## AntiBERTa and ABlooper both transformer based language models are examples of the emerging work in using graph neural networks to design protein sequences for particular target antigens
With word boosting enabled, ASR is able to correctly transcribe the domain-specific terms AntiBERTa and ABlooper.
Additional information about Word boosting#
The recommended range boost score range is 20 to 100. The higher the boost score, the more biased the ASR engine becomes toward recognizing this word. Negative boost scores can even discourage the ASR engine from predicting certain words.
There is no limit to the number of words that can be boosted. You should not notice any significant impact on latency, even with ~100 boosted words, except for the first request, where slightly higher latency is expected.
For Parakeet 0.6b CTC Mandarin, the boosted words must be specified with space between each Mandarin character. Example
--boosted-lm-words "望 岳 "
Automatic Punctuation#
Automatic punctuation and capitalization can be enabled by passing the flag --automatic-punctuation.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file en-US_sample.wav \
--language-code en-US \
--automatic-punctuation
Note
--automatic-punctuation applies punctuation only to the final transcripts. If punctuation is needed for partial transcripts, pass --custom-configuration="apply_partial_pnc:true" to the above command.
The previous command prints the transcript with punctuation and capitalization as shown in the following example.
## What is natural language processing?
End of Utterance#
Endpointing is the process by which Riva ASR determines when a user has finished speaking. This allows the system to segment continuous audio streams into distinct utterances for accurate transcription. Proper endpointing ensures that transcripts are generated promptly and that partial or incomplete utterances are not prematurely finalized.
Riva ASR detects endpointing primarily through silence detection. The system monitors the audio stream for periods of silence and uses configurable thresholds to determine when speech has ended. When the system detects a sufficient duration of silence (typically measured in milliseconds), it triggers the endpointing mechanism to finalize the current utterance and generate the transcript. To configure the amount of silence to look for before detecting EOU, use the --stop_history parameter.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file en-US_sample.wav \
--language-code en-US \
--stop_history 800
Note
--stop_history specifies silence duration in milliseconds and must be a multiple of 80ms. We recommend at least 560ms for good accuracy.
Inverse Text Normalization#
Inverse Text Normalization can be enabled by passing the flag --no-verbatim-transcripts
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file <your_file_with_ITNizable_values> \
--language-code en-US \
--no-verbatim-transcripts
Note
--no-verbatim-transcriptsapplies ITN only to the final transcripts. If ITN is needed for partial transcripts, pass--custom-configuration="apply_partial_itn:true"to the above command.The Canary and Whisper models apply ITN to transcripts by default, and this behavior cannot be turned off.
Profanity Filter#
Riva ASR models can detect profane words in your audio data and censor them in the transcript. This feature uses a pre-defined list of profane words and is supported only for the English language.
To enable the profanity filter, pass the --profanity-filter flag to the sample client. When enabled, profane words appear with only the first letter visible, followed by asterisks in the transcript (for example, f***).
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file <your_file_with_profane_words> \
--language-code en-US \
--profanity-filter
Silero VAD Customization#
Profiles with vad=silero use Silero VAD to detect the start and end of an utterance. The Silero VAD parameters control how speech boundaries (start/end) are identified. The default values are optimized for typical use cases, but they can be adjusted as needed for specific scenarios.
The following parameters can be configured at runtime using the custom-configuration option.
Parameter |
Details |
Range |
Default |
|---|---|---|---|
|
Minimum probability threshold to detect the start of a speech segment |
0.0 to 1.0 |
0.85 |
|
Minimum probability threshold to detect the end of a speech segment |
0.0 to 1.0 |
0.3 |
|
Minimum duration (in seconds) of speech to be considered a valid segment |
> 0 |
0.2 |
|
Minimum duration (in seconds) of silence to be considered a non-speech segment |
> 0 |
0.5 |
|
Duration (in seconds) to pad before the detected speech onset |
> 0 |
0.3 |
|
Duration (in seconds) to pad after the detected speech offset |
> 0 |
0.08 |
Example of runtime configuration:
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file <your_speech_file> \
--language-code en-US \
--custom-configuration="neural_vad.onset:0.9,neural_vad.offset:0.4,neural_vad.min_duration_on:0.3,neural_vad.min_duration_off:0.6"
Riva NIM can also return VAD probabilites. To get them, add get_vad_probabilities:true to --custom_configuration parameter in the above command.
Riva NIM can also return VAD (Voice Activity Detection) probabilities. To enable this feature, add get_vad_probabilities:true to the --custom-configuration parameter in the command above. When enabled, Riva NIM generates probability values for the entire buffer, with each value representing a 32ms segment of audio. These VAD probabilities indicate presence, ranging from 0 to 1, for each segment across the buffer. The text following the ## symbol represents the transcript corresponding to that buffer.
The output will appear as follows:
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file en-US_sample.wav \
--language-code en-US \
--custom-configuration="neural_vad.onset:0.9,neural_vad.offset:0.4,neural_vad.min_duration_on:0.3,neural_vad.min_duration_off:0.6,get_vad_probabilities:true"
VAD States: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00
VAD States: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
.
.
.
##what is natural language processing
VAD States: 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.63 0.97 0.98 0.96 0.93 0.94 0.93 0.92 0.94 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.57 0.61 0.48 0.23 0.14 0.92 0.99 0.99 0.99 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.94 0.94 0.92 0.85 0.53 0.32 0.42 0.72 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.93 0.62 0.23 0.08 0.03 0.02 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
.
.
Speaker Diarization Customization#
Profiles with diarizer=sortformer use the Sortformer model for speaker diarization. For every final transcript generated at end-of-utterance detection, speaker tags are provided for all words in the transcript. Enable speaker diarization using the --speaker-diarization flag.
The following is an example of speaker diarization.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file <your_speech_file> \
--language-code en-US \
--speaker-diarization
Deploy time Customizations#
These customizations require model artifacts to be prepared offline and the NIM server to be redeployed. These customizations cannot be configured from the client end and are applied at the server level.
Custom Vocabulary#
The Flashlight decoder, deployed by default in Riva ASR NIM, is a lexicon-based decoder and emits only words that are present in the provided vocabulary file. This means that domain-specific words that are not present in the vocabulary file cannot be generated.
To expand the decoder vocabulary, we need to build a custom model, In the riva-build command, pass the extended vocabulary file to the --decoding_vocab=<vocabulary_file> parameter of the build command. Out of the box vocabulary files for Riva languages can be found on NGC, for example, for English, the vocabulary file named flashlight_decoder_vocab.txt can be found in the Riva ASR English(en-US) LM model. For information on how to use riva-build, refer to the Deploying Custom Models as NIM section.
Custom Pronunciation (Lexicon Mapping)#
When using the Flashlight decoder, the lexicon file provides a mapping between vocabulary dictionary words and their tokenized form (for example, sentence piece tokens for many Riva models).
Modifying the lexicon file serves two purposes:
Extends the vocabulary.
Provides one or more explicit custom pronunciations for a specific word. For example:
manu ▁ma n u manu ▁man n n ew manu ▁man n ew
Custom Language Models#
Introducing a language model to an ASR pipeline is an easy way to improve accuracy for natural language, and you can fine-tune it for niche settings. An n-gram language model estimates the probability distribution over groups of n or fewer consecutive words/tokens, P (word-1, …, word-n). By altering or biasing the data on which a language model is trained, you change the distribution it is estimating. As a result, it can predict different transcriptions as more likely, altering the prediction without changing the acoustic model. Riva supports n-gram models that are trained and exported from KenLM.
Custom language models can provide a permanent solution for improving the recognition of domain-specific terms and phrases. You can mix a domain-specific custom LM with a general domain LM using a process called interpolation.
To deploy a custom n-gram language model file in binary format as part of an ASR NIM, pass the binary language model file to riva-build. Use the flag --decoding_language_model_binary=<lm_binary> for CTC models and --nemo_decoder.language_model_file=<nemo LM> for RNNT models.
Inverse Text Normalization#
Riva ASR NIM implements inverse text normalization (ITN) for ASR requests. It uses weighted finite-state transducer (WFST) based models to convert spoken-domain output from an ASR model into written-domain text to improve the readability of the ASR system’s output.
Text normalization converts text from written form into its verbalized form. It is used as a preprocessing step before text-to-speech (TTS) and can also be used for preprocessing ASR training transcripts.
ITN is part of the ASR post-processing pipeline. ITN is the task of converting the raw spoken output of the ASR model into its written form to improve text readability.
Enable ITN by passing the --no-verbatim-transcripts flag.
python3 python-clients/scripts/asr/transcribe_file.py --server 0.0.0.0:50051 \
--input-file en-US_sample.wav \
--language-code en-US \
--no-verbatim-transcripts
Note
--no-verbatim-transcripts applies ITN only to the final transcripts. If you need ITN for partial transcripts, pass --custom_configuration="apply_partial_itn:true" to the command.
Riva implements NVIDIA NeMo ITN, which is based on WFST grammars. The tool uses Pynini to construct WFSTs. You can export the created grammars and integrate them into Sparrowhawk for production. Sparrowhawk is an open-source version of the Kestrel TTS text normalization system. F0C18A3F596B75D83B75C479E23795DA).
For example, with a functional NeMo installation, you can export the German ITN grammars with the pynini_export.py tool.
python3 pynini_export.py --output_dir . --grammars itn_grammars --input_case cased --language de
This exports the tokenizer_and_classify and verbalize FSTs as OpenFst finite state archive (FAR) files, ready to be deployed with Riva.
[NeMo I 2022-04-12 14:43:17 tokenize_and_classify:80] Creating ClassifyFst grammars.
Created ./de/classify/tokenize_and_classify.far
Created ./de/verbalize/verbalize.far
To deploy these ITN rules with Riva, pass the FAR files to the riva-build command under these options:
riva-build speech_recognition
[--wfst_tokenizer_model WFST_TOKENIZER_MODEL]
[--wfst_verbalizer_model WFST_VERBALIZER_MODEL]
Additionally, riva-build also supports --wfst_pre_process_model and --wfst_post_process_model arguments to pass the pre and post processing FAR files for inverse text normalization.
To learn more on how to build grammars from the ground-up, consult the NeMo Weighted Finite State Transducers (WSFT) tutorial.
Details on the model architecture can be found in the paper NeMo Inverse Text Normalization: From Development To Production.
Speech Hints#
Speech hints apply an out-of-vision (OOV) class as a part of ASR post-processing pipeline. It uses finite state transducers (FST) to improve readability based on the expected OOV class applied to normalize the output in a more readable format.
Speech hints are applied on the spoken domain output of ASR before passing the generated text through the ITN. The phrases which need to be applied are added to RecognitionConfig using SpeechContext.
import riva.client
uri = "localhost:50051" # Default value
auth = riva.client.Auth(uri=uri)
asr_service = riva.client.ASRService(auth)
config = riva.client.RecognitionConfig(
encoding=riva.client.AudioEncoding.LINEAR_PCM,
max_alternatives=1,
profanity_filter=False,
enable_automatic_punctuation=True,
verbatim_transcripts=False,
)
my_wav_file=PATH_TO_YOUR_WAV_FILE
speech_hints = ["$OOV_ALPHA_SEQUENCE", "i worked at the $OOV_ALPHA_SEQUENCE"]
boost_lm_score = 4.0
riva.client.add_audio_file_specs_to_config(config, my_wav_file)
riva.client.add_word_boosting_to_config(config, speech_hints, boost_lm_score)
The following classes and phrases are supported:
$OOV_NUMERIC_SEQUENCE$OOV_ALPHA_SEQUENCE$OOV_ALPHA_NUMERIC_SEQUENCE$ADDRESSNUM$FULLPHONENUM$POSTALCODE$OOV_CLASS_ORDINAL$MONTH
Training or Fine-Tuning an Acoustic Model#
Model fine-tuning is a set of techniques for making fine adjustments to a pre-existing model with new data. This adapts it to new situations while it retains its original capabilities.
Model training is the process of training a new model either from scratch (that is, starting from random weights) or with weights initialized from an existing model. The goal is for the model to acquire new skills without necessarily retaining the original capabilities, such as in cross-language transfer learning.
Many use cases require training new models or fine-tuning existing ones with new data. In these cases, follow these best practices. Many of these best practices also apply to inputs at inference time.
Use lossless audio formats, if possible. The use of lossy codecs, such as MP3, can reduce quality.
Augment training data. Adding background noise to audio training data can initially decrease accuracy but increase robustness.
Limit vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.
Use a minimum sampling rate of 16kHz, if possible, but do not resample.
If using NeMo to fine-tune ASR models, refer to the Finetuning CTC models on other languages tutorial. We recommend fine-tuning ASR models only with sufficient data, on the order of several hundred hours of speech. If such data is not available, it can be more useful to adapt the LM on an in-domain text corpus than to train the ASR model.
There is no formal guarantee that the ASR model is or is not streamable after training.
Training New Models#
Train models from scratch - End-to-end training of ASR models requires large datasets and heavy compute resources. There are more than 5,000 languages around the world, but very few languages have datasets large enough to train high-quality ASR models. For this reason, we recommend training models from scratch only when several thousands of hours of transcribed speech data are available.
Cross-language transfer learning - Cross-language transfer learning is especially helpful when training new models for low-resource languages. Even when a substantial amount of data is available, cross-language transfer learning can help boost the performance further.
It is based on the idea that phoneme representation can be shared across different languages. Experiments by the NeMo team showed that on as little as 16h of target language audio data, transfer learning works substantially better than training from scratch. In the GTC 2020 talk, NVIDIA data scientists demonstrate cross-language transfer learning for a low resource language with less than 30 hours of speech data.
Fine-Tuning Existing Models#
When other easier approaches have failed to address accuracy issues in challenging situations that are brought about by significant acoustic factors, such as different accents, noisy environments, or poor audio quality, you should attempt to fine-tune acoustic models.
We recommend fine-tuning ASR models with sufficient data, on the order of 100 hours of speech or more. The minimal number of hours that we used for NeMo transfer learning was approximately 100 hours for the CORAAL dataset, as shown in the [Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition](https://arxiv.org/pdf/ 2005.04290.pdf) paper. Our experiments demonstrate that in all three cases of cross-language transfer learning, continuous learning, and domain adaptation, transfer learning from a good base model has higher accuracy than a model trained from scratch. It is also preferred to fine-tune large models rather than training small models from scratch, even if the dataset for fine-tuning is small.
Low-resource domain adaptation - For smaller datasets, such as approximately 10 hours, take appropriate precautions to avoid overfitting to the domain and sacrificing significant accuracy in the general domains, also known as catastrophic forgetting. If you perform fine-tuning on this small dataset, mix it with other larger datasets (“base”). For English, for example, NeMo has a list of public datasets that you can mix it with.
In transfer learning, continual learning is a sub-problem in which models that are trained with new domain data should still retain good performance on the original source domain.
If you are using NeMo to fine-tune ASR models, refer to the Finetuning CTC models on other languages tutorial.
Data quality and augmentation - Use lossless audio formats, if possible. The use of lossy codecs, such as MP3, can reduce quality. As a regular practice, use a minimum sampling rate of 16kHz. You can also use Opus-encoded sources with 8K, 16K, 24K, or 48K sampling rates.
Augmenting training data with noise can improve the model’s ability to cope with noisy environments. Adding background noise to audio training data can initially decrease accuracy but increase robustness.
Punctuation and Capitalization Model#
ASR systems typically generate text with no punctuation and capitalization of the words. In Riva, the punctuation and capitalization model is responsible for formatting the text, enhanced with both punctuation and capitalization.
The punctuation and capitalization model should be customized when an out-of-the-box model does not perform well in the application context, such as when applying to a new language variant.
To either train or fine-tune, and then deploy a custom punctuation and capitalization model, refer to RIVA Punctuation and NeMo Punctuation and Capitalization.
Note
All models provide punctuation support; however, only the English Parakeet CTC model requires a separate punctuation and capitalization (PnC) model. For all other models, punctuation is handled within the ASR model itself.
Deploying a Custom Acoustic Model#
If using NVIDIA NeMo, first convert the model from .nemo format to .riva format using the nemo2riva tool that is available as part of the Riva distribution. Next, use the Riva ASR NIM container and tools (riva-build and riva-deploy) for deployment. For more information, refer to the Deploying Custom Models as NIM section.
Summary of Riva ASR Customizations#
The following table lists the corresponding customizations in increasing order of difficulty and effort:
Techniques |
Difficulty |
What it Does |
When to Use |
How to Use |
|---|---|---|---|---|
Word boosting |
Quick and easy |
Extends the vocabulary while increasing the chance of recognition for a provided list of keywords. This strategy enables you to easily improve recognition of specific words at request time. |
When certain words or phrases are important in a particular context. For example, attendee names in a meeting. |
|
Custom vocabulary |
Easy |
Extends the vocabulary while increasing the chance of recognition for a provided list of keywords. This strategy enables you to improve recognition of specific words at request time easily. |
When certain words or phrases are important in a particular context, for example, attendee names in a meeting. |
|
Custom pronunciation (Lexicon mapping) |
Easy |
Explicitly guides the decoder to map pronunciations (that is, token sequences) to specific words. The lexicon decoder emits words that are present in the decoder lexicon. It is possible to modify the lexicon used by the decoder to improve recognition. |
When a word can have one or more possible pronunciations. |
|
Retrain language model |
Moderate |
Trains a new language model for the application domain to improve the recognition of domain specific terms. The Riva ASR pipeline supports the use of n-gram language models. Using a language model that is tailored to your use case can greatly help in improving the accuracy of transcripts. |
When domain text data is available. |
|
Fine tune an existing acoustic model |
Moderately hard |
Fine-tunes an existing acoustic model using a small amount of domain data to better suit the domain. |
When you have transcribed domain audio data (10h-100h) and other easier approaches fall short. |
|
Train a new acoustic model |
Hard |
Trains a new acoustic model from scratch or with cross-language transfer learning, using thousands of hours of audio data. |
Recommended only when adapting Riva to a new language or dialect. |