.. _asr_models: Speech Recognition ================== Automatic Speech Recognition (ASR) takes as input an audio stream or audio buffer and returns one or more text transcripts, along with additional optional metadata. ASR represents a full speech recognition pipeline that is GPU accelerated with optimized performance and accuracy. ASR supports synchronous and streaming recognition modes. Jarvis ASR features include: - Support for offline and streaming use cases - A streaming mode that returns intermediate transcripts with low latency - GPU-accelerated feature extraction - Multiple (and growing) acoustic model architecture options accelerated by `NVIDIA TensorRT `_ - Beam search decoder based on n-gram language models - Voice activity detection algorithms (CTC-based) - Automatic punctuation - Ability to return top-N transcripts from beam decoder - Word-level time stamps For more information, refer to the `Speech Recognition notebook `_; an end-to-end workflow for speech recognition. This workflow starts with training in TLT and ends with deployment using Jarvis. Model Architectures ------------------- Jasper ^^^^^^ The Jasper model is an end-to-end neural acoustic model for ASR that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting the strict real-time requirements of ASR systems in deployment. The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment during a post-processing step called decoding. Details on the model architecture can be found in the paper `Jasper: An End-to-End Convolutional Neural Acoustic Model `_. QuartzNet ^^^^^^^^^ QuartzNet is the next generation of the Jasper speech recognition model. It improves on Jasper by replacing 1D convolutions with 1D time-channel separable convolutions. Doing this effectively factorizes the convolution kernels, enabling deeper models while reducing the number of parameters by over an order of magnitude. Details on the model architecture can be found in the paper `QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions `_. Services -------- Jarvis ASR supports both offline/batch and streaming inference modes. Offline Recognition ^^^^^^^^^^^^^^^^^^^ In synchronous mode, the full audio signal is first read from a file or captured from a microphone. Following the capture of the entire signal, the client makes a request to the Jarvis Speech Server to transcribe it. The client then waits for the response from the server. **Note:** This method might have long latency since the processing of the audio signal only starts once the full audio signal has been captured or read from the file. Streaming Recognition ^^^^^^^^^^^^^^^^^^^^^ In streaming recognition mode, as soon as an audio segment of a specified length is captured or read, a request is made to the server to process that segment. On the server side, a response is returned as soon as an intermediate transcript is available. **Note:** The length of the audio segments can be selected by the user based on speed and memory requirements. Refer to the :ref:`protobuf_docs_asr` documentation for more details. Pipeline Configuration ---------------------- In the simplest use case, you can deploy an ASR model without any language model as follows: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --acoustic_model_name= where: - ```` is the encryption key used during the export of the ``.ejrvs`` file. - ```` and ```` are optional user-defined names for the components in the model repository. **Note:** ```` is *global* and can conflict across model pipelines. Override this only in cases when you know what other models will be deployed and there will not be any incompatibilities in model weights or input shapes. - ```` is the name of the ``ejrvs`` file to use as input. - ```` is the Jarvis ``jmir`` file that will be generated. Upon succesful completion of this command, a file named ```` will be created in the ``/servicemaker-dev/`` folder. Since no language model is specified, the Jarvis greedy decoder will be used to predict the transcript based on the output of the acoustic model. If your .ejrvs archives are encrypted you need to include : at the end of the JMIR filename and ejrvs filename. Otherwise this is unnecessary. Streaming/Offline Configuration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By default, the Jarvis JMIR file is configured to be used with the Jarvis StreamingRecognize RPC call, for streaming use cases. To use the Recognize RPC call, generate the Jarvis JMIR file by adding the ``--offline`` option. .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --offline Furthermore, the default streaming Jarvis JMIR configuration is to provide intermediate transcripts with very low latency. For use cases where being able to support additional concurrent audio streams is more important, run: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --chunk_size=0.8 \ --padding_factor=2 \ --padding_size=0.8 Language Models ^^^^^^^^^^^^^^^ Jarvis ASR supports decoding with an n-gram language model. The n-gram language model can be stored in a ``.arpa`` format or a KenLM binary format. To prepare the Jarvis JMIR configuration using an n-gram language model stored in ``arpa`` format, run: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_arpa= To use Jarvis ASR with a KenLM binary file, generate the Jarvis JMIR with: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_binary= The decoder language model hyper-parameters (``alpha``, ``beta``, and ``beam_search_width``) can also be set from the ``jarvis-build`` command. .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_binary= \ --lm_decoder_cpu.beam_search_width= \ --lm_decoder_cpu.language_model_alpha= \ --lm_decoder_cpu.language_model_beta= GPU-accelerated Decoder ^^^^^^^^^^^^^^^^^^^^^^^ The Jarvis ASR pipeline can also use a GPU-accelerated weighted finite-state transducer (WFST) decoder that was initially developed for `Kaldi `_. To use the GPU decoder, using a language model defined by an ``.arpa`` file, run: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_arpa= \ --gpu_decoder where ```` is the language model ``.arpa`` file that was used during the WFST decoding phase. **Note: Conversion from an .arpa file to a WFST graph can take a very long time, especially for large language models.** Also, large language models will increase GPU memory utilization. When using the GPU decoder, it is recommended to use different language models for the WFST decoding phase and the lattice rescoring phase. This can be achieved by using the following ``jarvis-build`` command: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_arpa= \ --rescoring_language_model_arpa= \ --gpu_decoder where: - ```` is the language model ``.arpa`` file that was used during the WFST decoding phase. - ```` is the language model used during the lattice rescoring phase. Typically, one would use a small language model for the WFST decoding phase (for example, a pruned 2 or 3-gram language model) and a larger language model for the lattice rescoring phase (for example, an unpruned 4-gram language model). For advanced users, it is also possible to configure the GPU decoder by specifying the decoding WFST file and the vocabulary directly, instead of using an ``.arpa`` file. For example: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_fst= \ --decoding_language_model_words= \ --gpu_decoder Furthermore, you can specify the ``.carpa`` files to use in the case where lattice rescoring is needed: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_fst= \ --decoding_language_model_carpa= \ --decoding_language_model_words= \ --rescoring_language_model_carpa= \ --gpu_decoder where: - ```` is the language model construct arpa representation to use during the WFST decoding phase. - ```` is the language model construct arpa representation to use during the lattice rescoring phase. The GPU decoder hyper-parameters (``default_beam``, ``lattice_beam``, ``word_insertion_penalty`` and ``acoustis_scale``) can be set with the ``jarvis-build`` command as follows: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --decoding_language_model_arpa= \ --lattice_beam= \ --lm_decoder_gpu.default_beam= \ --lm_decoder_gpu.acoustic_scale= \ --rescorer.word_insertion_penalty= \ --gpu_decoder Beginning/End of Utterance Detection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Jarvis ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in a 300ms window have non-blank characters, and the end of an utterance is flagged when 98% of the frames in a 800ms window are blank characters. Users can tune those values for their particular use case by using the following ``jarvis-build`` command: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --vad.vad_start_history=300 \ --vad.vad_start_th=0.2 \ --vad.vad_stop_history=800 \ --vad.vad_stop_th=0.98 Additionally, it is possible to disable the beginning/end of utterance detection with the following code: .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --vad.vad_type=none Note that in this case, the decoder state would only get reset once the full audio signal has been sent by the client. Similarly, the punctuator model would only get called once. Non-English Languages ^^^^^^^^^^^^^^^^^^^^^ The default parameters values that can be provided to the ``jarvis-build`` command will give accurate transcripts for most use cases. However, for some languages like Mandarin, those parameters values must be tuned. When transcribing Mandarin, the recommended values are: **For streaming recognition** .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --chunk_size=1.6 \ --padding_size=3.2 \ --padding_factor=4 \ --vad.vad_stop_history=1600 \ --vad.vad_start_history=200 \ --vad.vad_start_th=0.1 **For offline recognition** .. prompt:: bash :substitutions: jarvis-build speech_recognition \ /servicemaker-dev/: \ /servicemaker-dev/: \ --name= \ --offline \ --padding_size=3.2 \ --padding_factor=2 \ --vad.vad_stop_history=1600 \ --vad.vad_start_history=200 \ --vad.vad_start_th=0.1 Selecting Custom Model at Runtime ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When receiving requests from the client application, the Jarvis server selects which deployed ASR model to use based on the ``RecognitionConfig`` of the client request. If no models are available to fulfill the request, an error will be returned. In the case where multiple models might be able to fulfill the client request, one model will be selected at random. Users can also explicitely select which ASR model to use by setting the ``model`` field of the ``RecognitionConfig`` protobuf object to the value of ```` which was used with the ``jarvis-build`` command. This allows users to deploy multiple ASR pipelines concurrently and select which one to use at runtime. Pretrained Models ----------------- .. list-table:: :widths: 10 15 10 15 5 10 10 5 :header-rows: 1 * - Task - Architecture - Language - Dataset - Sampling Rate - Compatibility with TLT 3.0 - Compatibility with Nemo 1.0.0b4 - Link * - Transcription - Jasper - English - ASR Set 1.2 with Noisy (profiles: room reverb, echo, wind, keyboard, baby crying) - 7K hours - - Yes - Yes - `EJRVS `_ * - Transcription - QuartzNet - English - ASR Set 1.2 - - Yes - Yes - `EJRVS `_ \ `JMIR `_