# Overview¶

Riva handles deployments of full pipelines, which can be composed of one or more NVIDIA TAO Toolkit models and other pre-/post-processing components. Additionally, the TAO Toolkit models have to be exported to an efficient inference engine and optimized for the target platform. Therefore, the Riva server cannot use NVIDIA NeMo or TAO models directly because they represent only a single model.

The process of gathering all the required artifacts (for example, models, files, configurations, and user settings) and generating the inference engines, will be referred to as the Riva model repository generation. The Riva ServiceMaker Docker image has all the tools necessary to generate the Riva model repository and can be pulled from NGC as follows:

Data center

docker pull nvcr.io/nvidia/riva/riva-speech:2.1.0-servicemaker


Embedded

docker pull nvcr.io/nvidia/riva/riva-speech:2.1.0-arm64-servicemaker


The Riva model repository generation is done in three phases:

Phase 1: The development phase. To create a model in Riva, the model checkpoints must be converted to .riva format. You can further develop these .riva models using TAO Toolkit or NeMo. For more information, refer to the Model Development with TAO Toolkit and Model Development with NeMo sections.

Note

For embedded, Phase 1 must be performed on the Linux x86_64 workstation itself, and not on the NVIDIA Jetson platform. After the .riva files are generated, Phase 2 and Phase 3 must be performed on the Jetson platform.

Phase 2: The build phase. During the build phase, all the necessary artifacts (models, files, configurations, and user settings) required to deploy a Riva service are gathered together into an intermediate file called RMIR (Riva Model Intermediate Representation). For more information, refer to the Riva Build section.

Phase 3: The deploy phase. During the deploy phase, the RMIR file is converted into the Riva model repository and the neural networks in TAO Toolkit or NeMo format are exported and optimized to run on the target platform. The deploy phase should be executed on the physical cluster on which the Riva server is deployed. For more information, refer to the Riva Deploy section.

Many use cases require training new models or fine-tuning existing ones with new data. In these cases, there are a few best practices to follow. Many of these best practices also apply to inputs at inference time.

• Use lossless audio formats if possible. The use of lossy codecs such as MP3 can reduce quality.

• Augment training data. Adding background noise to audio training data can initially decrease accuracy, but increase robustness.

• Limit vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.

• Use a minimum sampling rate of 16kHz if possible, but do not resample.

• If using NeMo to fine tune ASR models, consult this tutorial. We recommend fine-tuning ASR models only with sufficient data approximately on the order of several hundred hours of speech. If such data is not available, it may be more useful to simply adapt the LM on in-domain text corpus rather than to train the ASR model.

• There is no guarantee that the ASR model will or won’t be stream-able after training. We see that with more training (thousands of hours of speech, 100-200 epochs), models generally obtain better offline scores and online scores do not degrade as severely (but still degrade to some extent due to differences between online and offline evaluation).

## Model Development with TAO Toolkit¶

Models trained from NVIDIA TAO Toolkit typically have the format .tao. To use these models in Riva, convert the model checkpoints to .riva format for building and deploying with Riva ServiceMaker using tao_export.

### Exporting Models¶

1. Follow the TAO Toolkit Launcher Quick Start Guide instructions to setup.

2. Configure the TAO Toolkit launcher. The TAO Toolkit launcher uses Docker containers for training and export tasks. The launcher instance can be configured in the ~/.tao_mounts.json file. Configuration requires mounting at least three separate directories where data, specification files, and results are stored. A sample is provided below.

{
"Mounts":[
{
"source": "~/tao/data",
"destination": "/data"
},
{
"source": "~/tao/specs",
"destination": "/specs"
},
{
"source": "~/tao/results",
"destination": "/results"
},
{
"source": "~/.cache",
"destination": "/root/.cache"
}
],
"DockerOptions":{
"shm_size": "16G",
"ulimits": {
"memlock": -1,
"stack": 67108864
}
}
}

3. Convert the TAO Toolkit checkpoints to the Riva format using tao … export. The example below demonstrates exporting a Citrinet model trained in TAO, where:

• -m is used to specify the Citrinet model checkpoints location

• -e is used to specify the path to an experiment spec file

• -r indicates where the experiment results (logs, output, model checkpoints, etc.) are stored

tao speech_to_text export -m /data/asr/citrinet.tao -e /specs/asr/speech_to_text/export.yaml -r /results/asr/speech_to_text/


Here is an example experiment spec file (export.yaml):

# Path and name of the input .nemo/.tao archive to be loaded/exported.
restore_from: /data/asr/citrinet.tao

# Name of output file (will land in the folder pointed by -r)
export_to: citrinet.riva


Note that TAO Toolkit comes with default experiment spec files that can be pulled by calling:

tao speech_to_text download_specs -o /specs/asr/speech_to_text/ -r /results/asr/speech_to_text/download_specs/


Besides speech_to_text from the ASR domain, TAO Toolkit also supports several conversational AI tasks from the NLP domain:

• intent_slot_classification

• punctuation_and_capitalization

• question_answering

• text_classification

• token_classification

More details can be found in tao --help.

## Model Development with NeMo¶

NeMo is an open source PyTorch-based toolkit for research in conversational AI. While TAO Toolkit is the recommended path for typical users of Riva, some developers may prefer to use NeMo because it exposes more of the model and PyTorch internals. Riva supports the ability to import models trained in NeMo.

### Export Models with NeMo2Riva¶

Models trained in NVIDIA NeMo have the format .nemo. To use these models in Riva, convert the model checkpoints to .riva format for building and deploying with Riva ServiceMaker using the nemo2riva tool. The nemo2riva tool is currently packaged and available via the Riva Quick Start scripts.

1. Follow the NeMo installation instructions to setup a NeMo environment; version 1.1.0 or greater. From within your NeMo environment:

pip3 install nvidia-pyindex
pip3 install nemo2riva-2.1.0-py3-none-any.whl
nemo2riva --out /NeMo/<MODEL_NAME>.riva /NeMo/<MODEL_NAME>.nemo

2. To export the HiFi-GAN model from NeMo to Riva format, run the following command after configuring the NeMo environment:

nemo2riva --out /NeMo/hifi.riva /NeMo/tts_hifigan.nemo


For additional information and usage, run:

 nemo2riva --help

Usage:


nemo2riva [-h] [--out OUT] [--validate] [--schema SCHEMA] [--format FORMAT] [--verbose VERBOSE] [--key KEY] source

When converting NeMo models to Riva .eff input format, passing the input .nemo as a parameter creates .riva.

If no --format is passed, the Riva-preferred format for the supplied model architecture is selected automatically.

The format is also derived from schema if the --schema argument is supplied, or if nemo2riva is able to find the schema for this NeMo model
among known models - there is a set of YAML files in the nemo2riva/validation_schemas directory, or you can add your own.

If the --key argument is passed, the model graph in the output EFF file is encrypted with that key.

positional arguments:

: source             Source .nemo file

optional arguments:

: -h

--help

Show this help message and exit

--out

OUT

Location to write resulting Riva EFF input to (default: None)

--validate

Validate using schemas (default: False)

--schema

SCHEMA

Schema file to use for validation (default: None)

--format

FORMAT

Force specific export format: ONNX|TS|CKPT (default: None)

--verbose

VERBOSE

Verbose level for logging, numeric (default: None)

--key

KEY

Encryption key or file (default: None)



## Riva Build¶

The riva-build tool is responsible for deployment preparation. It’s only output is an intermediate format (called a RMIR) of an end-to-end pipeline for the supported services within Riva. This tool can take multiple different types of models as inputs. Currently, the following pipelines are supported:

• speech_recognition (for ASR)

• speech_synthesis (for TTS)

• qa (for question answering)

• token_classification (for token level classification, for example, Named Entity Recognition)

• intent_slot (for joint intent and slot classification)

• text_classification

• punctuation

1. Launch an interactive session inside the Riva ServiceMaker image.

Data center

 docker run --gpus all -it --rm \
-v <artifact_dir>:/servicemaker-dev \
-v <riva_repo_dir>:/data \
--entrypoint="/bin/bash" \
nvcr.io/nvidia/riva/riva-speech:2.1.0-servicemaker


Embedded

 docker run --gpus all -it --rm \
-v <artifact_dir>:/servicemaker-dev \
-v <riva_repo_dir>:/data \
--entrypoint="/bin/bash" \
nvcr.io/nvidia/riva/riva-speech:2.1.0-arm64-servicemaker


where:

• <artifact_dir> is the folder or Docker volume that contains the .riva file and other artifacts required to prepare the Riva model repository.

• <riva_repo_dir> is the folder or Docker volume where the Riva model repository is generated.

2. Run the riva-build command from within the container.

riva-build <pipeline> \
/servicemaker-dev/<rmir_filename>:<encryption_key> \
/servicemaker-dev/<riva_filename>:<encryption_key> \
<optional_args>


where:

• <pipeline> must be one of the following:

• speech_recognition

• speech_synthesis

• qa

• token_classification

• intent_slot

• text_classification

• punctuation

• <rmir_filename> is the name of the RMIR file that is generated.

• <riva_filename> is the name of the riva file(s) to use as input.

• <args> are optional arguments to configure the Riva service. The following section discusses the different ways the ASR, NLP, and TTS services can be configured.

• <encryption_key> is optional. If the .riva file is generated without an encryption key, the input/output files are specified with <riva_filename> instead of <riva_filename>:<encryption_key>.

By default, if a file named <rmir_filename> already exists, it will not be overwritten. To force the <rmir_filename> to be overwritten, use the -f or --force argument. For example, riva-build <pipeline> -f ...

For details about the optional parameters that can be passed to riva-build to customize the Riva pipeline, run:

riva-build <pipeline> -h


## Riva Deploy¶

The riva-deploy tool takes as input one or more Riva Model Intermediate Representation (RMIR) files and a target model repository directory. It is responsible for performing the following functions:

Model Optimization: Optimize the frozen checkpoints for inference on the target GPU.

Configuration Generation: Generate configuration files for backend components including ensembles of models.

The Riva model repository can be generated from the Riva .rmir file(s) with the following command:

riva-deploy /servicemaker-dev/<rmir_filename>:<encryption_key> /data/models


By default, if the destination folder (i.e. /data/models/ in the above example) already exists, it will not be overwritten. To force the destination folder to be overwritten, use the -f or --force parameter. For example, riva-deploy -f ...

## Deploying Your Custom Model into Riva¶

This section provides a brief overview on the two main tools used in the deployment process:

1. The build phase using riva-build.

2. The deploy phase using riva-deploy.

## Build process¶

For your custom trained model, refer to the riva-build phase (ASR, NLP, TTS) for your model type. At the end of this phase, you’ll have the Riva Model Intermediate Representation (RMIR) archive for your custom model.

## Deploy process¶

At this point, you already have your RMIR archive. Now, you have two options for deploying this RMIR.

Option 1: Use the Quick Start scripts (riva_init.sh and riva_start.sh) with the appropriate parameters in config.sh.

Option 2: Manually run riva-deploy and then start riva-server with the target model repository.

### Using riva-deploy and Riva Speech Container (Advanced)¶

1. Execute riva-deploy. Refer to the Deploy section Riva Deploy for a brief overview on riva-deploy.

riva-deploy -f <rmir_filename>:<encryption_key> /data/models


If your .rmir archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename. Otherwise, this is unnecessary.

The above command creates the Riva model repository at /data/models. If you want to write to any other location other than /data/models, this will require additional manual changes in the embedded artifact directories within the configs within some of the model repositories that has model specific artifacts such as class labels. Therefore, stick with /data/models unless you are familiar with Triton Inference Server model repository configurations.

2. Manually start the riva-server Docker container using docker run.

After the Riva model repository for your custom model is generated, start the Riva server on that target repository. The following command assumes you generated the model repository at /data/models.

docker run -d --gpus 1 --init --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 \
-v /data:/data             \
-p 50051                            \
-e "CUDA_VISIBLE_DEVICES=0"   \
--name riva-speech                \
nvcr.io/nvidia/riva/riva-speech:2.1.0-server \
start-riva  --riva-uri=0.0.0.0:50051 --nlp_service=true --asr_service=true --tts_service=true


This command launches the Riva Speech Service API server similar to the Quick Start script riva_start.sh.

Example output:

Starting Riva Speech Services
> Waiting for Riva server to load all models...retrying in 10 seconds
> Waiting for Riva server to load all models...retrying in 10 seconds
> Waiting for Riva server to load all models...retrying in 10 seconds

3. Verify that the servers have started correctly and check that the output of docker logs riva-speech shows:
I0428 03:14:50.440955 1 riva_server.cc:71] Riva Conversational AI Server listening on 0.0.0.0:50051