How to deploy a NeMo-finetuned NMT model on Riva Speech Skills server?#

This tutorial walks you through how to deploy a NeMo-finetuned NMT model on Riva Speech Skills server.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

  • Automated speech recognition (ASR)

  • Text-to-Speech synthesis (TTS)

  • Neural Machine Translation (NMT)

  • A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will deploy a NeMo-finetun4ed NMT model on the Riva Speech Skills server.
Refer to the “How to fine-tune a Riva NMT Bilingual model with Nvidia NeMo” tutorial in Riva NMT Tutorials to learn about finetuning Riva NMT model.

For more information about Riva, refer to the Riva developer documentation.
For more information about Riva NMT, refer to the Riva NMT documentation

NVIDIA NeMo Overview#

NVIDIA NeMo is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures.

For more information about NeMo, refer to the NeMo product page and documentation. The open-source NeMo repository can be found here.


Before we get started, ensure you have:

  • access to NVIDIA NGC and are able to download the Riva Quick Start resources.

  • a .riva model file that you want to deploy. You can generate the .riva model file from .nemo file, with the nemo2riva tool, as explained in the “How to fine-tune a Riva NMT Bilingual model with Nvidia NeMo” tutorial in Riva NMT Tutorials.

Riva ServiceMaker#

Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components:


This step helps build a Riva-ready version of the model. It’s only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. Let’s consider a Riva NMT model.

riva-build is responsible for the combination of one or more exported models (.riva files) into a single file containing an intermediate format called Riva Model Intermediate Representation (.rmir). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the documentation.

from version import __riva_version__
import os

# ServiceMaker Docker
RIVA_SM_CONTAINER = "<add container name>"
# Example: 
# RIVA_SM_CONTAINER = f"{__riva_version__}-servicemaker"

# Directory where the .riva model is stored
MODEL_LOC = "<add path to model location>"
# Example:
# import os
# MODEL_LOC = os.getcwd() + "/NMTFinetuning/model"

# Name of the .riva file
MODEL_NAME = "<add model name>"
# Example:
# MODEL_NAME = "en_es_24x6.riva"

# Key that model is encrypted with, while exporting with TAO
KEY = "<add encryption key used for trained model>"
# Example:
# KEY = "tlt_encode"
# Get the ServiceMaker docker
! docker pull $RIVA_SM_CONTAINER
# Call riva-build command from the Riva Service Maker container.
# Example: 
! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER bash -c \
        "riva-build translation --name en_es \
            /data/en_es_24x6.rmir:key /data/en_es_24x6.riva:key"


The deployment tool takes as input one or more Riva Model Intermediate Representation (RMIR) files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

# Call riva-deploy command from the Riva Service Maker container.
#! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER bash -c \
#        "riva-deploy -f <rmir_filename>:$KEY <riva_model_target_repository>"
# Example: 
! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_SM_CONTAINER bash -c \
        "riva-deploy -f /data/en_es_24x6.rmir:key /data/models && chmod -R 777 /data"

Start the Riva Server#

After the model repository is generated, we are ready to start the Riva server. First, download the Riva Quick Start resource from NGC.

!ngc registry resource download-version "nvidia/riva/riva_quickstart:$__riva_version__"

Next, we set the path to the Riva Quick Start Guide directory here:

RIVA_QSG_DIR = "<add path to quickstart location>"
# Example:
# RIVA_QSG_DIR = f"riva_quickstart_v${__riva_version__}"

Next, we modify the file to enable the Riva NMT service (by setting service_enabled_nmt to true), provide the encryption key, and path to the model repository (riva_model_loc) generated in the previous step among other configurations.

For example, if above the model repository is generated at $MODEL_LOC/models, then you can specify riva_model_loc as the same directory as MODEL_LOC.

#### snippet -> DO NOT RUN THIS BLOCK

# Enable or Disable Riva Services 
service_enabled_asr=false                                                     ## MAKE CHANGES HERE
service_enabled_nlp=false                                                     ## MAKE CHANGES HERE
service_enabled_tts=false                                                     ## MAKE CHANGES HERE
service_enabled_nmt=true                                                     ## MAKE CHANGES HERE

# Enable Riva Enterprise
# If enrolled in Enterprise, enable Riva Enterprise by setting configuration
# here. You must explicitly acknowledge you have read and agree to the EULA.
# RIVA_API_KEY=<ngc api key>
# RIVA_API_NGC_ORG=<ngc organization>
# RIVA_EULA=accept

# Language code to fetch models of a specify language
# Currently only ASR supports languages other than English
# Supported language codes: en-US, de-DE, es-US, ru-RU, zh-CN
# for any language other than English, set service_enabled_nlp and service_enabled_tts to False
# for multiple languages enter space separated language codes.

# Specify one or more GPUs to use
# specifying more than one GPU is currently an experimental feature, and may result in undefined behaviours.

# Specify the encryption key to use to deploy models
MODEL_DEPLOY_KEY="tlt_encode"                                                     ## MAKE CHANGES HERE

# Locations to use for storing models artifacts
# If an absolute path is specified, the data will be written to that location
# Otherwise, a docker volume will be used (default).
# will create a `rmir` and `models` directory in the volume or
# path specified.
# RMIR ($riva_model_loc/rmir)
# Riva uses an intermediate representation (RMIR) for models
# that are ready to deploy but not yet fully optimized for deployment. Pretrained
# versions can be obtained from NGC (by specifying NGC models below) and will be
# downloaded to $riva_model_loc/rmir by ``
# Custom models produced by NeMo or TLT and prepared using riva-build
# may also be copied manually to this location $(riva_model_loc/rmir).
# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="riva-model-repo"                           ## MAKE CHANGES HERE (Replace with MODEL_LOC)                      
# Ensure you have permission to execute these scripts
! cd $RIVA_QSG_DIR && chmod +x ./ && chmod +x ./
# Run Riva Init. This will fetch the containers/models
! cd $RIVA_QSG_DIR && ./
# Run Riva Start. This will deploy your model(s).
! cd $RIVA_QSG_DIR && ./

Run Inference#

Once the Riva server is up-and-running with your models, you can send inference requests querying the server.

To send gRPC requests, you can install the Riva Python API bindings for the client. This is available as a pip .whl file with the Riva Quick Start.

To understand the basics of Riva NMT APIs, refer to the “How do I perform Language Translation using Riva NMT APIs with out-of-the-box models?” tutorial in Riva NMT Tutorials. We are going to use a simple code snippet from this tutorial to run inference on the Riva server.

# Install the Client API Bindings
! pip install nvidia-riva-client

Connect to the Riva Server and Run Inference#

Now we can actually query the Riva server. The following cell queries the Riva server (using gRPC) to yield a result.

import riva.client

auth = riva.client.Auth(uri='localhost:50051')
riva_nmt_client = riva.client.NeuralMachineTranslationClient(auth)

eng_text = "Molecular Biology is the field of biology that studies the composition, structure and interactions of cellular molecules – such as nucleic acids and proteins – that carry out the biological processes essential for the cell's functions and maintenance."

response = riva_nmt_client.translate([eng_text], 'en_es', 'en', 'es')
print("English Text: ", eng_text)
print("Translated Spanish Text: ", response.translations[0].text) # Fetch the translated text from the 1st entry of response.translations

You can stop all Docker containers before shutting down the Jupyter kernel. Caution: The following command will stop all running containers.

! docker stop $(docker ps -a -q)