Speech Recognition¶
Automatic Speech Recognition (ASR) takes as input an audio stream or audio buffer and returns one or more text transcripts, along with additional optional metadata. ASR represents a full speech recognition pipeline that is GPU accelerated with optimized performance and accuracy. ASR supports synchronous and streaming recognition modes.
Riva ASR features include:
Support for offline and streaming use cases
A streaming mode that returns intermediate transcripts with low latency
GPU-accelerated feature extraction
Multiple (and growing) acoustic model architecture options accelerated by NVIDIA TensorRT
Beam search decoder based on n-gram language models
Voice activity detection algorithms (CTC-based)
Automatic punctuation
Ability to return top-N transcripts from beam decoder
Word-level timestamps
Inverse Text Normalization (ITN)
For more information, refer to the Speech To Text notebook, which is an end-to-end workflow for speech recognition. This workflow starts with training in TAO Toolkit and ends with deployment using Riva.
Model Architectures¶
Citrinet¶
Citrinet is the recommended new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models.
Details on the model architecture can be found in the paper Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition.
Conformer-CTC¶
The Conformer-CTC model is a non-autoregressive variant of Conformer model (https://arxiv.org/abs/2005.08100) for Automatic Speech Recognition which uses CTC loss/decoding instead of Transducer. Fore more information, refer to: Conformer-CTC Model.
The model used in Riva is a large size version of Conformer-CTC (around 120M parameters) trained on NeMo ASRSet. The model transcribes speech in lower case english alphabet along with spaces and apostrophes.
Jasper¶
The Jasper model is an end-to-end neural acoustic model for ASR that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting the strict real-time requirements of ASR systems in deployment.
The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment during a post-processing step called decoding.
Details on the model architecture can be found in the paper Jasper: An End-to-End Convolutional Neural Acoustic Model.
QuartzNet¶
QuartzNet is the next generation of the Jasper speech recognition model. It improves on Jasper by replacing 1D convolutions with 1D time-channel separable convolutions. Doing this effectively factorizes the convolution kernels, enabling deeper models while reducing the number of parameters by over an order of magnitude.
Details on the model architecture can be found in the paper QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions.
Normalization¶
Riva implements inverse text normalization (ITN) for ASR requests using weight finite state transducers (WSFT) based models to convert spoken domain output from an ASR model into written domain text to improve readability of the ASR systems output.
Details on the model archiecture can be found in the paper NeMo Inverse Text Normalization: From Development To Production.
Languages Supported¶
Language |
Language code |
Supported Architectures |
---|---|---|
English |
en-US |
Jasper, Quartznet, Citrinet, Conformer-CTC |
German |
de-DE |
Citrinet |
Russian |
ru-RU |
Citrinet |
Spanish |
es-US |
Citrinet |
Services¶
Riva ASR supports both offline/batch and streaming inference modes.
Offline Recognition¶
In synchronous mode, the full audio signal is first read from a file or captured from a microphone. Following the capture of the entire signal, the client makes a request to the Riva Speech Server to transcribe it. The client then waits for the response from the server.
Note
This method can have long latency since the processing of the audio signal starts after the full audio signal has been captured or read from the file.
Streaming Recognition¶
In streaming recognition mode, as soon as an audio segment of a specified length is captured or read, a request is made to the server to process that segment. On the server side, a response is returned as soon as an intermediate transcript is available.
Note
You can select the length of the audio segments based on speed and memory requirements.
Refer to the riva/proto/riva_asr.proto documentation for more details.
Pipeline Configuration¶
In the simplest use case, you can deploy an ASR model without any language model as follows:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=greedy \
--acoustic_model_name=<acoustic_model_name>
where:
<encryption_key>
is the encryption key used during the export of the.riva
file.<pipeline_name>
and<acoustic_model_name>
are optional user-defined names for the components in the model repository.Note
<acoustic_model_name>
is global and can conflict across model pipelines. Override this only in cases when you know what other models will be deployed and there will not be any incompatibilities in model weights or input shapes.<riva_filename>
is the name of theriva
file to use as input.<rmir_filename>
is the Rivarmir
file that is generated.
Upon succesful completion of this command, a file named <rmir_filename>
is created in the /servicemaker-dev/
folder. Since
no language model is specified, the Riva greedy decoder is used to predict the transcript based on the output of the acoustic model. If your .riva archives are encrypted you need to include :<encryption_key> at the end of the RMIR filename and riva filename. Otherwise this is unnecessary.
The following summary lists the riva-build
commands used to generate the RMIR files from the Quickstart scripts for different models, modes, and their limitations:
Acoustic Model |
Mode |
Limitations |
|
---|---|---|---|
Citrinet-1024 |
Streaming Low-Latency |
None |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-english-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_lexicon> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
|
Citrinet-1024 |
Streaming High-Throughput |
None |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=citrinet-1024-english-asr-streaming \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_lexicon> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
|
Citrinet-1024 |
Offline |
Maximum audio duration of 15 minutes |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=citrinet-1024-english-asr-offline \
--ms_per_timestep=80 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_lexicon> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
|
Conformer-CTC |
Streaming Low-Latency |
ONNX runtime only |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=conformer-en-US-asr-streaming \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.16 \
--left_padding_size=1.92 \
--right_padding_size=1.92 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_lexicon> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
|
Conformer-CTC |
Streaming High-Throughput |
ONNX runtime only |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=conformer-en-US-asr-streaming-throughput \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--vad.residue_blanks_at_start=-2 \
--chunk_size=0.8 \
--left_padding_size=1.6 \
--right_padding_size=1.6 \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_lexicon> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
|
Conformer-CTC |
Offline |
ONNX runtime only and maximum audio duration of 3 minutes |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=conformer-en-US-asr-offline \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--ms_per_timestep=40 \
--nn.use_onnx_runtime \
--vad.vad_start_history=200 \
--chunk_size=200 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=flashlight \
--flashlight_decoder.asr_model_delay=-1 \
--decoding_language_model_binary=<lm_binary> \
--decoding_vocab=<decoder_lexicon> \
--flashlight_decoder.lm_weight=0.2 \
--flashlight_decoder.word_insertion_score=0.2 \
--flashlight_decoder.beam_threshold=20. \
--language_code=en-US
|
QuartzNet |
Streaming Low-Latency |
None |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=quartznet-en-US-asr-streaming \
--decoder_type=os2s \
--decoding_language_model_binary=<lm_binary> \
--language_code=en-US
|
QuartzNet |
Streaming High-Throughput |
None |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=quartznet-en-US-asr-streaming-throughput \
--chunk_size=0.8 \
--left_padding_size=0.8 \
--right_padding_size=0.8 \
--decoder_type=os2s \
--decoding_language_model_binary=<lm_binary> \
--language_code=en-US
|
QuartzNet |
Offline |
Maximum audio duration of 15 minutes |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=quartznet-en-US-asr-offline \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=os2s \
--decoding_language_model_binary=<lm_binary> \
--language_code=en-US
|
Jasper |
Streaming Low-Latency |
None |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=jasper-en-US-asr-streaming \
--decoder_type=os2s \
--decoding_language_model_binary=<lm_binary> \
--language_code=en-US
|
Jasper |
Streaming High-Throughput |
None |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--name=jasper-en-US-asr-streaming-throughput \
--chunk_size=0.8 \
--left_padding_size=0.8 \
--right_padding_size=0.8 \
--decoder_type=os2s \
--decoding_language_model_binary=<lm_binary> \
--language_code=en-US
|
Jasper |
Offline |
Maximum audio duration of 15 minutes |
riva-build speech_recognition \
<rmir_filename>:<key> <riva_filename>:<key> \
--offline \
--name=jasper-en-US-asr-offline \
--chunk_size=900 \
--left_padding_size=0. \
--right_padding_size=0. \
--decoder_type=os2s \
--decoding_language_model_binary=<lm_binary> \
--language_code=en-US
|
Streaming/Offline Configuration¶
By default, the Riva RMIR file is configured to be used with the Riva StreamingRecognize
RPC call, for streaming use cases.
To use the Recognize
RPC call, generate the Riva RMIR file by adding the --offline
option.
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=greedy \
--offline \
--chunk_size=900 \
--padding_size=0.
where chunk_size
specifies the maximum audio duration in seconds. This value has an impact on the GPU memory usage and
can be increased or decreased depending on deployment scenarios. Furthermore, the default streaming Riva RMIR configuration is to
provide intermediate transcripts with very low latency. For use cases where being able to support additional concurrent audio streams is more important, run:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=greedy \
--chunk_size=0.8 \
--padding_size=0.8
Citrinet and Conformer-CTC Acoustic Models¶
The Citrinet and Conformer-CTC acoustic models have different properties than Jasper and QuartzNet. We recommend the following riva-build parameters to export Citrinet or Conformer-CTC for low-latency streaming recognition:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=greedy \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--greedy_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False
For high throughput streaming recognition, chunk_size and padding_size can be set as follows:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=greedy \
--chunk_size=0.8 \
--padding_size=1.6 \
--ms_per_timestep=80 \
--greedy_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False
Finally, for offline recogition, we recommend the following settings:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--offline \
--name=<pipeline_name> \
--decoder_type=greedy \
--chunk_size=900. \
--padding_size=0. \
--ms_per_timestep=80 \
--greedy_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False
Language Models¶
Riva ASR supports decoding with an n-gram language model. The n-gram language model can be provided in a few different ways.
A
.riva
file exported from TAO Toolkit.A
.arpa
format file.A KenLM binary format file.
For more information on buliding language models, see Training Language Models.
When using the Jasper or QuartzNet acoustic model, you can configure the Riva ASR pipeline to use an n-gram language model stored in .riva format by running:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<acoustic_riva_filename>:<encryption_key> \
/servicemaker-dev/<n_gram_riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=os2s
When using the Citrinet or Conformer-CTC acoustic model, specify the language model by running the following riva-build
command:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
/servicemaker-dev/<n_gram_riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_vocab=<vocabulary_filename>
where vocabulary_filename
is the vocabulary used by the Flashlight lexicon decoder. The vocabulary file must contain one vocabulary word per line.
When using the Jasper or QuartzNet acoustic model, one can configure the Riva ASR Pipeline to use an n-gram language model stored in arpa
format by running:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=os2s \
--decoding_language_model_arpa=<arpa_filename>
When using the Citrinet or Conformer-CTC acoustic model, the language model can be specified with the following riva-build
command:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_language_model_arpa=<arpa_filename> \
--decoding_vocab=<vocabulary_filename>
When using a KenLM binary file to specify the language model, one can generate the Riva RMIR with:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=os2s \
--decoding_language_model_binary=<KenLM_binary_filename>
when using the Jasper or QuartzNet acoustic model and with:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_language_model_binary=<KENLM_binary_filename> \
--decoding_vocab=<vocab_filename>
when using the Citrinet or Conformer-CTC acoustic model.
The decoder language model hyperparameters can also be specified from the riva-build
command.
When using the Jasper or QuartzNet acoustic models, the language model parameters
alpha
, beta
, and beam_search_width
can be specified with:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=os2s \
--decoding_language_model_binary=<KenLM_binary_filename> \
--os2s_decoder.beam_search_width=<beam_search_width> \
--os2s_decoder.language_model_alpha=<language_model_alpha> \
--os2s_decoder.language_model_beta=<language_model_beta>
With the Citrinet or Conformer-CTC acoustic model, you can specify the Flashlight decoder hyper-parameters
beam_size
, beam_size_token
, beam_threshold
, lm_weight
and word_insertion_score
as follows:
riva-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=flashlight \
--chunk_size=0.16 \
--padding_size=1.92 \
--ms_per_timestep=80 \
--flashlight_decoder.asr_model_delay=-1 \
--vad.residue_blanks_at_start=-2 \
--featurizer.use_utterance_norm_params=False \
--featurizer.precalc_norm_time_steps=0 \
--featurizer.precalc_norm_params=False \
--decoding_language_model_binary=<arpa_filename> \
--decoding_vocab=<vocab_filename> \
--flashlight_decoder.beam_size=<beam_size> \
--flashlight_decoder.beam_size_token=<beam_size_token> \
--flashlight_decoder.beam_threshold=<beam_threshold> \
--flashlight_decoder.lm_weight=<lm_weight> \
--flashlight_decoder.word_insertion_score=<word_insertion_score>
Flashlight Decoder Lexicon¶
The Flashlight decoder used in Riva is a lexicon-based decoder and only emits words that are present in the provided lexicon file. The lexicon file can be specified with the parameter --decoding_vocab
in the riva-build
command. You’ll need to ensure that words of interest are in the lexicon file. The Riva Service Maker automatically tokenizes the words in the lexicon file. It’s also possible to add additional tokenizations for the words in the lexicon by performing the following steps.
The riva-build
and riva-deploy
commands provided in the section above generates the lexicon tokenizations in the /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt
file.
To add additional tokenizations to the lexicon, copy the lexicon file:
cp /data/models/citrinet-ctc-decoder-cpu-streaming/1/lexicon.txt decoding_lexicon.txt
and modify it to add the sentencepiece tokenizations for the word of interest. For example, one could add:
manu ▁ma n u
manu ▁man n n ew
manu ▁man n ew
to file decoding_lexicon.txt
so that the word manu
is generated in the transcript if the acoustic model predicts those tokens. You’ll need to ensure that the new lines follow the indentation/space pattern like the rest of the file and that the tokens used are part of the tokenizer model. Once this is done, regenerate the model repository using that new decoding lexicon tokenization by passing --decoding_lexicon=decoding_lexicon.txt
to riva-build
instead of --decoding_vocab=decoding_vocab.txt
.
GPU-accelerated Decoder¶
The Riva ASR pipeline can also use a GPU-accelerated weighted finite-state transducer (WFST) decoder that was initially developed
for Kaldi. To use the GPU decoder, using a language model defined
by an .arpa
file, run:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoding_language_model_arpa=<decoding_lm_arpa_filename> \
--decoder_type=kaldi
where <decoding_lm_arpa_filename>
is the language model .arpa
file that was used during the WFST decoding phase.
Note
Conversion from an .arpa
file to a WFST graph can take a very long time, especially for large language models.
Also, large language models will increase GPU memory utilization. When using the GPU decoder, it is recommended to use different language
models for the WFST decoding phase and the lattice rescoring phase. This can be achieved by using the following riva-build
command:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoding_language_model_arpa=<decoding_lm_arpa_filename> \
--rescoring_language_model_arpa=<rescoring_lm_arpa_filename> \
--decoder_type=kaldi
where:
<decoding_lm_arpa_filename>
is the language model.arpa
file that was used during the WFST decoding phase<rescoring_lm_arpa_filename>
is the language model used during the lattice rescoring phase
Typically, one would use a small language model for the WFST decoding phase (for example, a pruned 2 or 3-gram language model) and a larger language model for the lattice rescoring phase (for example, an unpruned 4-gram language model).
For advanced users, it is also possible to configure the GPU decoder by specifying the decoding WFST file and the vocabulary
directly, instead of using an .arpa
file. For example:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoding_language_model_fst=<decoding_lm_fst_filename> \
--decoding_language_model_words=<decoding_lm_words_file> \
--decoder_type=kaldi
Furthermore, you can specify the .arpa
files to use in the case where lattice rescoring is needed:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoding_language_model_fst=<decoding_lm_fst_filename> \
--decoding_language_model_carpa=<decoding_lm_carpa_filename> \
--decoding_language_model_words=<decoding_lm_words_filename> \
--rescoring_language_model_carpa=<rescoring_lm_carpa_filename> \
--decoder_type=kaldi
where:
<decoding_lm_carpa_filename>
is the language model construct.arpa
representation to use during the WFST decoding phase<rescoring_lm_carpa_filename>
is the language model construct.arpa
representation to use during the lattice rescoring phase
The GPU decoder hyperparameters (default_beam
, lattice_beam
, word_insertion_penalty
and acoustic_scale
) can be set
with the riva-build
command as follows:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoding_language_model_arpa=<decoding_lm_arpa_filename> \
--lattice_beam=<lattice_beam> \
--kaldi_decoder.default_beam=<default_beam> \
--kaldi_decoder.acoustic_scale=<acoustic_scale> \
--rescorer.word_insertion_penalty=<word_insertion_penalty> \
--decoder_type=kaldi
Beginning/End of Utterance Detection¶
Riva ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder
state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in
a 300ms window has non-blank characters, and the end of an utterance is flagged when 98% of the frames in a 800ms window are
blank characters. You can tune those values for their particular use case by using the following riva-build
command:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=greedy \
--vad.vad_start_history=300 \
--vad.vad_start_th=0.2 \
--vad.vad_stop_history=800 \
--vad.vad_stop_th=0.98
Additionally, it is possible to disable the beginning/end of utterance detection with the following code:
riva-build speech_recognition \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
--name=<pipeline_name> \
--decoder_type=greedy \
--vad.vad_type=none
Note that in this case, the decoder state would only get reset after the full audio signal has been sent by the client. Similarly, the punctuator model would only get called once.
Inverse Text Normalization¶
Currently, the grammars are limited to English. In a future release, additional information on training, tuning, and loading custom grammars will be available.
Selecting Custom Model at Runtime¶
When receiving requests from the client application, the Riva server selects the deployed ASR model
to use based on the RecognitionConfig
of the client request. If no models are available to fulfill
the request, an error is returned. In the case where multiple models might be able to fulfill the
client request, one model is selected at random. You can also explicitly select which ASR model
to use by setting the model
field of the RecognitionConfig
protobuf object to the value of
<pipeline_name>
which was used with the riva-build
command. This enables you to deploy
multiple ASR pipelines concurrently and select which one to use at runtime.
Generating Multiple Transcript Hypotheses¶
By default, the Riva ASR pipeline is configured to only generate the best transcript hypothesis for
each utterance. It is possible to generate multiple transcript hypotheses by passing the parameter
--max_supported_transcripts=N
to the riva-build
command, where N
is the maximum number of
hypotheses to generate. With these changes, the client application can retrieve the multiple hypotheses
by setting the max_alternatives
field of RecognitionConfig
to values greater than 1.
Training Language Models¶
Introducing a language model to an ASR pipeline is an easy way to improve accuracy for natural language
and can be fine-tuned for niche settings. In short, an n-gram language model estimates the probability
distribution over groups of n
or less consecutive words, P
(word-1, …, word-n). By altering or biasing
the data on which a language model is trained on, and thus the distribution it is estimating, it can be
used to predict different transcriptions as more likely, and thus alter the prediction without changing
the acoustic model. Riva supports n-gram models trained and exported from either NVIDIA TAO Toolkit or KenLM.
TAO Toolkit Language Model¶
The general TAO Toolkit model development pipeline is outlined in the Model Overview page. To train a new language model, run:
!tao n_gram train -e /specs/nlp/lm/n_gra/train.yaml \
export_to=PATH_TO_TAO_FILE \
training_ds.data_dir=PATH_TO_DATA \
model.order=4 \
model.pruning=[0,1,1,3] \
-k $KEY
To export a pre-trained model, run:
### For export to Riva
!tao n_gram export \
-e /specs/nlp/intent_slot_classification/export.yaml \
-m PATH_TO_TAO_FILE \
export_to=PATH_TO_RIVA_FILE \
binary_type=probing \
-k $KEY
KenLM Setup¶
KenLM is the recommended tool for building language models. This toolkit supports estimating, filtering and querying n-gram language models. To begin, first make sure you have Boost and zlib installed. Depending on your requirements, you may require additional dependencies. Double check by referencing the dependencies list.
After all dependencies are met, create a separate directory to build KenLM.
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
Estimating¶
The next step is to gather and process data. In most cases, KenLM expects data to be natural language (suiting your use case). Common preprocessing steps include replacing numerics and removing umlauts, punctuation or special characters. However, it is most important that your preprocessing steps are consistent between both your language and acoustic model.
Assuming your current working directory is the build
subdirectory of KenLM, bin/lmplz
performs
estimation on the corpus provided through stdin
and writes the ARPA (a human readable from of the
language model) to stdout
. Running bin/lmplz
documents the command-line arguments, however, here are
a few important ones:
-o
: Required. The order of the language model. Depends on use case, but generally 3 to 8.-S
: Memory to use. Nubmer followed by%
for percentage,b
for bytes,K
for kilobytes, and so on. Default is80%
.-T
: Temproary file location--text arg
: Read text from a file instead ofstdin
.--arpa arg
: Write ARPA to a file instead ofstdout
.--prune arg
: Prune n-grams with count less than or equal to the given threshold, with one value specified for each order. For example, to prune singleton trigrams,--prune 0 0 1
. The sequence of values must be non-decreasing and the last value applies to all remaining orders. Default is to not prune. Unigram pruning is not supported, so the first number must be0
.--limit_vocab_file arg
: Read allowed vocabulary separated by whitespace from file in argument and prune all n-grams containing vocabulary items not from the list. Can be combined with pruning.
Pruning and limiting vocabulary help get rid of typos, uncommon words, and general outliers from the dataset, making the resulting ARPA smaller and generally less overfit, but potentially at the cost of losing some jargon or colloquial language.
With the appropriate options, the language model can be estimated.
bin/lmplz -o 4 < text > text.arpa
Querying and Evaluation¶
For faster loading, convert the ARPA file to binary.
bin/build_binary text.arpa text.binary
The binary or ARPA can be queried via the command-line.
bin/query text.binary < data
Pretrained Models¶
Deployment with CitriNet is currently recommended for most users. QuartzNet 1.2 is a smaller, more efficient model and is suitable for situations where reduced accuracy is acceptable in favor of higher throughput and lower latency.
Task |
Architecture |
Language |
Dataset |
Sampling Rate |
Compatibility with TAO Toolkit 3.0-21.08 |
Compatibility with Nemo 1.0.0b4 |
Link |
---|---|---|---|---|---|---|---|
Transcription |
Citrinet |
English |
ASR Set 3.0 - 16700 Hours |
16000 |
Yes |
Yes |
|
Transcription |
Conformer-CTC |
English |
ASR Set 3.0 - 16700 Hours |
16000 |
Yes |
Yes |
|
Transcription |
Jasper |
English |
ASR Set 1.2 with Noisy (profiles: room reverb, echo, wind, keyboard, baby crying) - 7K hours |
16000 |
Yes |
Yes |
|
Transcription |
QuartzNet |
English |
ASR Set 1.2 |
16000 |
Yes |
Yes |
|
Transcription |
Citrinet |
Spanish |
ASR Set 1.0 - 1800 Hours |
16000 |
Yes |
Yes |
|
Transcription |
Citrinet |
German |
ASR Set 1.0 - 2300 Hours |
16000 |
Yes |
Yes |
|
Transcription |
Citrinet |
Russian |
ASR Set 1.0 - 1700 Hours |
16000 |
Yes |
Yes |