Overview¶

Jarvis handles deployments of full pipelines, which could be composed of one or more NVIDIA Transfer Learning Toolkit (TLT) models and other pre-/post-processing components. Additionally, the TLT models have to be exported to an efficient inference engine and optimized for the target platform. Therefore, the Jarvis server cannot use NVIDIA NeMo or TLT models directly because they represent only a single model.

The process of gathering all the required artifacts (for example, models, files, configurations and user settings) and generating the inference engines, will be referred to as the Jarvis model repository generation. The Jarvis ServiceMaker Docker image has all the tools necessary to generate the Jarvis model repository and can be pulled from NGC as follows:

docker pull nvcr.io/nvidia/jarvis-service-maker:1.0.0b1-rc4

The Jarvis model repository generation is done in two phases:

Phase 1: The build phase. During the build, phase all the necessary artifacts (models, files, configurations, and user settings) required to deploy a Jarvis service are gathered together into an intermediate file called JMIR (Jarvis Model Intermediate Representation). For more information, continue to the next section.

Phase 2: The deploy phase. During the deploy phase, the JMIR file is converted into the Jarvis model repository and the neural networks in TLT or NeMo format are exported and optimized to run on the target platform. The deploy phase should be executed on the physical cluster on which the Jarvis server will be deployed. For more information, refer to the Jarvis-Deploy section.

Jarvis-Build¶

The jarvis-build tool is responsible for deployment preparation. It’s only output is an intermediate format (called a JMIR) of an end-to-end pipeline for the supported services within Jarvis. The tool can take multiple different types of models as inputs. Currently, the following pipelines are supported:

speech_recognition (for ASR)
speech_synthesis (for TTS)
qa (for question answering)
token_classification (for Named Entity Recognition)
intent_slot (for joint intent and slot classification)
text_classification
punctuation

Run the jarvis-build tool and launch an interactive session inside the Jarvis ServiceMaker image.
```
docker run --gpus all -it --rm -v <artifact_dir>:/servicemaker-dev -v <jarvis_repo_dir>:/data --entrypoint="/bin/bash" nvcr.io/nvidia/jarvis-service-maker:1.0.0b1-rc4
```
where:
- <artifact_dir> is the folder or Docker volume that contains the Jarvis .ejrvs file and other artifacts required to prepare the Jarvis model repository.
- <jarvis_repo_dir> is the folder or Docker volume where the Jarvis model repository will be generated.
Run the jarvis-build command from within the container.
```
jarvis-build <pipeline> /servicemaker-dev/<jmir_filename>:<encryption_key> /servicemaker-dev/<ejrvs_filename>:<encryption_key> <optional_args>
```
where:
- <pipeline> must be one of the following:
  - speech_recognition
  - speech_synthesis
  - qa
  - token_classification
  - intent_slot
  - text_classification
- <jmir_filename> is the name of the JMIR file that will be generated.
- <ejrvs_filename> is the name of the ejrvs file(s) to use as input.
- <args> are optional arguments that can be used to configure the Jarvis service. The next section will cover the different ways the ASR, NLP and TTS services can be configured.
- <encryption_key> is optional. In the case where the .ejrvs file was generated without an encryption key, the input/output files can be specified with <ejrvs_filename> instead of <ejrvs_filename>:<encryption_key>.

ASR¶

In the simplest use case, you can deploy an ASR model without any language model as follows:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key>  \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name>

where:

<encryption_key> is the encryption key used during the export of the .ejrvs file.
<acoustic_model_name> is the name is the acoustic model, which will be used to name the Jarvis ASR model in Triton.
<ejrvs_filename> is the name of the ejrvs file to use as input.
<jmir_filename> is the Jarvis jmir file that will be generated.

Upon succesful completion of this command, a file named <jmir_filename> will be created in the /servicemaker-dev/ folder. Since no language model is specified, the Jarvis greedy decoder will be used to predict the transcript based on the output of the acoustic model.

Language Models¶

Jarvis ASR supports decoding with an n-gram language model. The n-gram language model can be stored in a .arpa format or a KenLM binary format.

To prepare the Jarvis JMIR configuration using an n-gram language model stored in arpa format, run:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key>  \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_arpa=<arpa_filename>

To use Jarvis ASR with a KenLM binary file, generate the Jarvis JMIR with:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_binary=<KenLM_binary_filename>

The decoder language model hyper-parameters (alpha, beta, and beam_search_width) can also be set from the jarvis-build command.

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_binary=<KenLM_binary_filename> \
--lm_decoder_cpu.beam_search_width=<beam_search_width> \
--lm_decoder_cpu.language_model_alpha=<language_model_alpha> \
--lm_decoder_cpu.language_model_beta=<language_model_beta>

Streaming/Offline Configuration¶

By default, the Jarvis JMIR file is configured to be used with the Jarvis StreamingRecognize RPC call, for streaming use cases. To use the StreamingRecognize RPC call, generate the Jarvis JMIR file by adding the --offline option.

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--offline

Furthermore, the default streaming Jarvis JMIR configuration is to provide intermediate transcripts with very low latency. For use cases where being able to support additional concurrent audio streams is more important, run:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--chunk_size=0.8 \
--padding_factor=2 \
--padding_size=0.8

GPU-accelerated Decoder¶

The Jarvis ASR pipeline can also use a GPU-accelerated weighted finite-state transducer (WFST) decoder that was initially developed for Kaldi. To use the GPU decoder, using a language model defined by an .arpa file, run:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_arpa=<decoding_lm_arpa_filename> \
--gpu_decoder

where <decoding_lm_arpa_filename> is the language model .arpa file that was used during the WFST decoding phase.

Note: Conversion from an .arpa file to a WFST graph can take a long time, especially for large language models. Also, large language models will increase GPU memory utilization. When using the GPU decoder, it is recommended to use different language models for the WFST decoding phase and the lattice rescoring phase. This can be achieved by using the following jarvis-build command:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_arpa=<decoding_lm_arpa_filename> \
--rescoring_language_model_arpa=<rescoring_lm_arpa_filename> \
--gpu_decoder

where:

<decoding_lm_arpa_filename> is the language model .arpa file that was used during the WFST decoding phase.
<rescoring_lm_arpa_filename> is the language model used during the lattice rescoring phase.

Typically, one would use a small language model for the WFST decoding phase (for example, a pruned 2 or 3-gram language model) and a larger language model for the lattice rescoring phase (for example, an unpruned 4-gram language model).

For advanced users, it is also possible to configure the GPU decoder by specifying the decoding WFST file and the vocabulary directly, instead of using an .arpa file. For example:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_fst=<decoding_lm_fst_filename> \
--decoding_language_model_words=<decoding_lm_words_file> \
--gpu_decoder

Furthermore, you can specify the .carpa files to use in the case where lattice rescoring is needed:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_fst=<decoding_lm_fst_filename> \
--decoding_language_model_carpa=<decoding_lm_carpa_filename> \
--decoding_language_model_words=<decoding_lm_words_filename> \
--rescoring_language_model_carpa=<rescoring_lm_carpa_filename> \
--gpu_decoder

where:

<decoding_lm_carpa_filename> is the language model construct arpa representation to use during the WFST decoding phase.
<rescoring_lm_carpa_filename> is the language model construct arpa representation to use during the lattice rescoring phase.

The GPU decoder hyper-parameters (default_beam, lattice_beam, word_insertion_penalty and acoustis_scale) can be set with the jarvis-build command as follows:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--decoding_language_model_arpa=<decoding_lm_arpa_filename> \
--lattice_beam=<lattice_beam> \
--lm_decoder_gpu.default_beam=<default_beam> \
--lm_decoder_gpu.acoustic_scale=<acoustic_scale>
--rescorer.word_insertion_penalty=<word_insertion_penalty> \
--gpu_decoder

Beginning/End of Utterance Detection¶

Jarvis ASR uses an algorithm that detects the beginning and end of utterances. This algorithm is used to reset the ASR decoder state, and to trigger a call to the punctuator model. By default, the beginning of an utterance is flagged when 20% of the frames in a 300ms window have non-blank characters, and the end of an utterance is flagged when 98% of the frames in a 800ms window are blank characters. Users can tune those values for their particular use case by using the following jarvis-build command:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--vad.vad_start_history=300 \
--vad.vad_start_th=0.2 \
--vad.vad_stop_history=800 \
--vad.vad_stop_th=0.98

Additionally, it is possible to disable the beginning/end of utterance detection with the following code:

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name>  \
--vad.vad_type=none

Note that in this case, the decoder state would only get reset once the full audio signal has been sent by the client. Similarly, the punctuator model would only get called once.

Mandarin¶

The default parameters values that can be provided to the jarvis-build command will give accurate transcripts for most use cases. However, for some languages like Mandarin, those parameters values must be tuned. When transcribing Mandarin, the recommended values are:

For streaming recognition

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--chunk_size=1.6 \
--padding_size=3.2 \
--padding_factor=4  \
--vad.vad_stop_history=1600 \
--vad.vad_start_history=200 \
--vad.vad_start_th=0.1

For offline recognition

jarvis-build speech_recognition \
/servicemaker-dev/<jmir_filename>:<encryption_key> \
/servicemaker-dev/<ejrvs_filename>:<encryption_key> \
--acoustic_model_name=<acoustic_model_name> \
--padding_size=3.2 \
--padding_factor=2 \
--vad.vad_stop_history=1600 \
--vad.vad_start_history=200 \
--vad.vad_start_th=0.1

NLP¶

TODO

TTS¶

TODO

Jarvis-Deploy¶

The jarvis-deploy tool takes as input one or more Jarvis Model Intermediate Representation (JMIR) files and a target model repository directory. It is responsible for performing two functions:

Function 1: Generates a Triton ensemble, Triton configuration files, and writes them to the target model repository directory as necessary.

Function 2: Runs any commands (if specified) by the JMIR to perform the model conversion for TensorRT, and updates the configuration mapping for GPU compute capability to artifact.

The Jarvis model repository can be generated from the Jarvis .jmir file(s) with the following command:

jarvis-deploy /servicemaker-dev/<jmir_filename>:<encryption_key> /data/models

Direct NeMo to Jarvis ServiceMaker (no TLT):¶

Generate ONNX from your .nemo file using the convasr_to_enemo.py script.

python convasr_to_enemo.py --nemo_file=/NeMo/QuartzNet15x5Base-En.nemo --onnx_file=output/quartz.onnx  --enemo_file=/NeMo/quartznet_asr.enemo

Follow the Jarvis-Build documentation to use quartznet_asr.enemo instead of .ejrvs for the build phase.