Overview#

Riva handles deployments of full pipelines, which can be composed of one or more supported NVIDIA NeMo models and other pre-/post-processing components. These pipelines have to be exported to an efficient inference engine and optimized for the target platform. Therefore, the Riva server cannot use unsupported NVIDIA NeMo models directly because they represent only a single model.

The process of gathering all the required artifacts (for example, models, files, configurations, and user settings) and generating the inference engines, will be referred to as the Riva model repository generation. The Riva ServiceMaker Docker image has all the tools necessary to generate the Riva model repository and can be pulled from NGC as follows:

Data center

docker pull nvcr.io/nvidia/riva/riva-speech:2.13.0-servicemaker

Embedded

docker pull nvcr.io/nvidia/riva/riva-speech:2.13.0-servicemaker-l4t-aarch64

The Riva model repository generation is done in three phases:

Phase 1: The development phase. To create a model in Riva, the model checkpoints must be converted to .riva format. You can further develop these .riva models using NeMo. For more information, refer to the Model Development with NeMo sections.

Note

For embedded, Phase 1 must be performed on the Linux x86_64 workstation itself, and not on the NVIDIA Jetson platform. After the .riva files are generated, Phase 2 and Phase 3 must be performed on the Jetson platform.

Phase 2: The build phase. During the build phase, all the necessary artifacts (models, files, configurations, and user settings) required to deploy a Riva service are gathered into an intermediate file called RMIR (Riva Model Intermediate Representation). For more information, refer to the Riva Build section.

Phase 3: The deploy phase. During the deploy phase, the RMIR file is converted into the Riva model repository and the neural networks in the NeMo format is exported and optimized to run on the target platform. The deploy phase should be executed on the physical cluster on which the Riva server is deployed. For more information, refer to the Riva Deploy section.

Riva ServiceMaker workflow

Using Your Own Data#

Many use cases require training new models or fine-tuning existing ones with new data. In these cases, there are a few guidelines to follow. Many of these guidelines also apply to inputs at inference time.

  • Use lossless audio formats if possible. The use of lossy codecs such as MP3 can reduce quality.

  • Augment training data. Adding background noise to audio training data can initially decrease accuracy, but increase robustness.

  • Limit vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.

  • Use a minimum sampling rate of 16kHz if possible, but do not resample.

  • If using NeMo to fine-tune ASR models, consult this tutorial. We recommend fine-tuning ASR models only with sufficient data approximately on the order of several hundred hours of speech. If such data is not available, it may be more useful to simply adapt the LM on in-domain text corpus rather than to train the ASR model.

  • There is no guarantee that the ASR model will or will not be stream-able after training. We see that with more training (thousands of hours of speech, 100-200 epochs), models generally obtain better offline scores and online scores do not degrade as severely (but still degrade to some extent due to differences between online and offline evaluation).

Model Development with NeMo#

NeMo is an open source PyTorch-based toolkit for research in conversational AI that exposes more of the model and PyTorch internals. Riva supports the ability to import supported models trained in NeMo.

For more information, refer to the NeMo project page.

Export Models with NeMo2Riva#

Models trained in NVIDIA NeMo have the format .nemo. To use these models in Riva, convert the model checkpoints to .riva format for building and deploying with Riva ServiceMaker using the nemo2riva tool. The nemo2riva tool is currently packaged and available via the Riva Quick Start scripts.

  1. Follow the NeMo installation instructions to set up a NeMo environment; version 1.1.0 or greater. From within your NeMo environment, run:

    pip3 install nvidia-pyindex
    pip3 install nemo2riva-2.13.0-py3-none-any.whl
    nemo2riva --out /NeMo/<MODEL_NAME>.riva /NeMo/<MODEL_NAME>.nemo
    
  2. To export the HiFi-GAN model from NeMo to Riva format, run the following command after configuring the NeMo environment:

    nemo2riva --out /NeMo/hifi.riva /NeMo/tts_hifigan.nemo
    

For additional information and usage, run:

 nemo2riva --help
Usage:

```
nemo2riva [-h] [--out OUT] [--validate] [--schema SCHEMA] [--format FORMAT] [--verbose VERBOSE] [--key KEY] source

When converting NeMo models to Riva `.eff` input format, passing the input `.nemo` as a parameter creates `.riva`.

If no `--format` is passed, the Riva-preferred format for the supplied model architecture is selected automatically.

The format is also derived from schema if the `--schema` argument is supplied, or if `nemo2riva` is able to find the schema for this NeMo model
among known models - there is a set of YAML files in the `nemo2riva/validation_schemas` directory, or you can add your own.

If the `--key` argument is passed, the model graph in the output EFF file is encrypted with that key.

positional arguments:

: source             Source .nemo file

optional arguments:

: -h

    --help

    Show this help message and exit

    --out

    OUT

    Location to write resulting Riva EFF input to (default: `None`)

    --validate

    Validate using schemas (default: `False`)

    --schema

    SCHEMA

    Schema file to use for validation (default: `None`)

    --format

    FORMAT

    Force specific export format: `ONNX|TS|CKPT` (default: `None`)

    --verbose

    VERBOSE

    Verbose level for logging, numeric (default: `None`)

    --key

    KEY

    Encryption key or file (default: `None`)
```

Riva Build#

The riva-build tool is responsible for deployment preparation. It is only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. This tool can take multiple different types of models as inputs. Currently, the following pipelines are supported:

  • speech_recognition (for ASR)

  • speech_synthesis (for TTS)

  • qa (for question answering)

  • token_classification (for token level classification, for example, Named Entity Recognition)

  • intent_slot (for joint intent and slot classification)

  • text_classification

  • punctuation

  • translation

  1. Launch an interactive session inside the Riva ServiceMaker image.

    Data center

     docker run --gpus all -it --rm \
         -v <artifact_dir>:/servicemaker-dev \
         -v <riva_repo_dir>:/data \
         --entrypoint="/bin/bash" \
         nvcr.io/nvidia/riva/riva-speech:2.13.0-servicemaker
    

    Embedded

     docker run --gpus all -it --rm \
         -v <artifact_dir>:/servicemaker-dev \
         -v <riva_repo_dir>:/data \
         --entrypoint="/bin/bash" \
         nvcr.io/nvidia/riva/riva-speech:2.13.0-servicemaker-l4t-aarch64
    

    where:

    • <artifact_dir> is the folder or Docker volume that contains the .riva file and other artifacts required to prepare the Riva model repository.

    • <riva_repo_dir> is the folder or Docker volume where the Riva model repository is generated.

  2. Run the riva-build command from within the container.

    riva-build <pipeline> \
        /servicemaker-dev/<rmir_filename>:<encryption_key> \
        /servicemaker-dev/<riva_filename>:<encryption_key> \
        <optional_args>
    

    where:

    • <pipeline> must be one of the following:

      • speech_recognition

      • speech_synthesis

      • qa

      • token_classification

      • intent_slot

      • text_classification

      • punctuation

      • translation

    • <rmir_filename> is the name of the RMIR file that is generated.

    • <riva_filename> is the name of the riva file(s) to use as input.

    • <args> are optional arguments to configure the Riva service. The following section discusses the different ways the ASR, NLP, and TTS services can be configured.

    • <encryption_key> is optional. If the .riva file is generated without an encryption key, the input/output files are specified with <riva_filename> instead of <riva_filename>:<encryption_key>.

By default, if a file named <rmir_filename> already exists, it will not be overwritten. To force the <rmir_filename> to be overwritten, use the -f or --force argument. For example, riva-build <pipeline> -f ...

For details about the optional parameters that can be passed to riva-build to customize the Riva pipeline, run:

riva-build <pipeline> -h

Riva Deploy#

The riva-deploy tool takes as input one or more RMIR files and a target model repository directory. It is responsible for performing the following functions:

Model Optimization: Optimize the frozen checkpoints for inference on the target GPU.

Configuration Generation: Generate configuration files for backend components including ensembles of models.

The Riva model repository can be generated from the Riva .rmir file(s) with the following command:

riva-deploy /servicemaker-dev/<rmir_filename>:<encryption_key> /data/models

By default, if the destination folder (that is, /data/models/ in the above example) already exists, it will not be overwritten. To force the destination folder to be overwritten, use the -f or --force parameter. For example, riva-deploy -f ...

Deploying Your Custom Model into Riva#

This section provides a brief overview on the two main tools used in the deployment process:

  1. The build phase using riva-build.

  2. The deploy phase using riva-deploy.

Build Process#

For your custom trained model, refer to the riva-build phase (ASR, NLP, NMT, TTS) for your model type. At the end of this phase, you will have the RMIR archive for your custom model.

Deploy Process#

At this point, you already have your RMIR archive. Now, you have two options for deploying this RMIR.

Option 1: Use the Quick Start scripts (riva_init.sh and riva_start.sh) with the appropriate parameters in config.sh.

Option 2: Manually run riva-deploy and then start riva-server with the target model repository.

Using riva-deploy and Riva Speech Container (Advanced)#

  1. Execute riva-deploy. Refer to the Riva Deploy section for a brief overview on riva-deploy.

    riva-deploy -f <rmir_filename>:<encryption_key> /data/models
    

    If your .rmir archives are encrypted, you need to include :<encryption_key> at the end of the RMIR filename. Otherwise, this is unnecessary.

    The above command creates the Riva model repository at /data/models. If you want to write to any other location other than /data/models, this will require additional manual changes in the embedded artifact directories within the configuration files within some of the model repositories that has model-specific artifacts such as class labels. Therefore, stick with /data/models unless you are familiar with Triton Inference Server model repository configurations.

  2. Manually start the riva-server Docker container using docker run. After the Riva model repository for your custom model is generated, start the Riva server on that target repository. The following command assumes you generated the model repository at /data/models.

    docker run -d --gpus 1 --init --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 \
            -v /data:/data             \
            -p 50051:50051                \
            -e "CUDA_VISIBLE_DEVICES=0"   \
            --name riva-speech                \
            nvcr.io/nvidia/riva/riva-speech:2.13.0 \
            start-riva  --riva-uri=0.0.0.0:50051 --nlp_service=true --asr_service=true --tts_service=true
    

    This command launches the Riva Speech Service API server similar to the Quick Start script riva_start.sh. For example:

    Starting Riva Speech Services
    > Waiting for Riva server to load all models...retrying in 10 seconds
    > Waiting for Riva server to load all models...retrying in 10 seconds
    > Waiting for Riva server to load all models...retrying in 10 seconds
    > Riva server is ready…
    
  3. Verify that the servers have started correctly and check that the output of docker logs riva-speech shows:

    I0428 03:14:50.440955 1 riva_server.cc:71] Riva Conversational AI Server listening on 0.0.0.0:50051