How to Deploy a Custom Language Model (n-gram) Trained with NeMo on Riva#

This tutorial walks you through the deployment of a custom language model (n-gram) trained with NVIDIA NeMo on NVIDIA Riva.

NVIDIA Riva Overview#

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance.
Riva offers a rich set of speech and natural language understanding services such as:

Automated speech recognition (ASR).
Text-to-Speech synthesis (TTS).
A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will deploy an ASR language model (n-gram) trained with NeMo on Riva.
To understand the basics of Riva ASR APIs, refer to Getting started with Riva ASR in Python.
To see how to pretrain and fine-tune an n-gram language model for ASR with NeMo, refer to this tutorial.

For more information about Riva, refer to the Riva product page and Riva developer documentation.

NeMo (Neural Modules) and `nemo2riva`#

NVIDIA NeMo is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and natural language understanding (NLU) models with a simple Python interface. To fine-tune a Conformer-CTC acoustic model with NeMo, refer to the Conformer-CTC fine-tuning tutorial.

The nemo2riva command-line tool provides the capability to export your .nemo model in a format that can be deployed using NVIDIA Riva, a highly performant application framework for multi-modal conversational AI services using GPUs. A Python .whl file for nemo2riva is included in the Riva Quick Start resource folder. You can also install nemo2riva with pip, as shown in the Conformer-CTC fine-tuning tutorial.

This tutorial explores taking a .riva model — the result of invoking the nemo2riva CLI tool (refer to the Conformer-CTC fine-tuning tutorial) — and leveraging the Riva ServiceMaker framework to aggregate all the necessary artifacts for Riva deployment to a target environment. Once the model is deployed in Riva, you can issue inference requests to the server. We will demonstrate how quick and straightforward this whole process is. In this tutorial, you will learn how to:

Build an .rmir model pipeline from a .riva file with Riva ServiceMaker.
Deploy the model locally on the Riva server.
Send inference requests from a demo client using Riva API bindings.

Prerequisites#

Before we get started, ensure you have:

Access to NVIDIA NGC and are able to download the Riva Quick Start resources.
A language model file that you want to deploy.
- For more information on training and exporting an n-gram language model, refer to the NeMo Language Modeling documentation.
- The language model file can be in one of the three following formats:
  - .riva. You can convert a .nemo model file to a .riva model file with the nemo2riva command.
  - .binary. You can download a pre-trained version from the Riva ASR LM NGC model page.
  - .arpa. You can download a pre-trained version from the Riva ASR LM NGC model page.
An acoustic model file in the .riva format that you want to deploy. You can convert a .nemo model file to a .riva model file with the nemo2riva command.
- For more information on customizing a Conformer-CTC acoustic model with NeMo and exporting the resulting model with nemo2riva, refer to the Conformer-CTC fine-tuning tutorial.
- Alternatively, you can obtain a pre-trained Conformer-CTC .riva model for English ASR here.
- For more information on training NeMo models, refer to the Training section in the NeMo documentation.
- For more information on Conformer-CTC’s architecture, refer to the Conformer-CTC section of the NeMo ASR Models page.
- For more information on the configuration files necessary for training Conformer-CTC with NeMo, refer to the Conformer-CTC section of the NeMo ASR Model Configuration Files page.
Weighted Finite State Transducer (WFST) tokenizer and verbalizer files for Inverse Text Normalization (ITN).
- For more information on WFST and ITN, refer to the NeMo Inverse Text Normalization: From Development to Production paper.
- You can download pretrained WFST ITN model files from this NVIDIA GPU Cloud (NGC) model page.
A decoder vocabulary file. You can download one from the Riva ASR LM NGC model page.

Riva ServiceMaker#

Riva ServiceMaker is a set of tools that aggregates all the necessary artifacts (models, files, configurations, and user settings) for Riva deployment to a target environment. It has two main components:

Riva-Build#

This step helps build a Riva-ready version of the model. Its only output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva. Let’s consider an ASR n-gram language model.

riva-build is responsible for the combination of one or more exported models (.riva files) into a single file containing an intermediate format called Riva Model Intermediate Representation (.rmir). This file contains a deployment-agnostic specification of the whole end-to-end pipeline along with all the assets required for the final deployment and inference. For more information, refer to the documentation.

# IMPORTANT: UPDATE THESE PATHS 

# Riva Docker
RIVA_CONTAINER = "<add container name>"

# Example: 
# RIVA_CONTAINER = f"nvcr.io/nvidia/riva/riva-speech:{__riva_version__}"

# Directory where model files are stored, 
# e.g. $MODEL_LOC/$ACOUSTIC_MODEL_NAME.riva
MODEL_LOC = "<add path to model location>"

# Name of the acoustic model .riva file
ACOUSTIC_MODEL_NAME = "<add model name>"

# Name of the language model .riva (or .arpa or .binary) file
LANGUAGE_MODEL_NAME = "<add model name>"

# Name of the decoder vocab file
DECODER_VOCAB_NAME = "<add decoder vocab file name>"

# Name of the WFST tokenizer
WFST_TOKENIZER = "<add WFST tokenizer model name>"

# Name of the WFST verbalizer
WFST_VERBALIZER = "<add WFST verbalizer model name>"

# Get the Riva Docker container
! docker pull $RIVA_CONTAINER

If it doesn’t already exist, create a sub-directory inside MODEL_LOC to store your .rmir files.

! mkdir -p $MODEL_LOC/rmir

Build the `.rmir` file#

Notes

If you encrypted your acoustic model and/or language model by adding the --key flag when invoking nemo2riva, or you downloaded a pre-trained model from NGC dated before 2023, you’ll need to append a colon and then the key’s value to the model’s name in the riva-build command, as shown below. You might find it convenient to set a string variable named KEY and pass it into the appropriate riva-build arguments as $KEY. The standard encryption key for the older pre-trained models is tlt_encode.
If your language model is in the .arpa format, replace /servicemaker-dev/$LANGUAGE_MODEL_NAME:$KEY with --decoding_language_model_arpa=/servicemaker-dev/$LANGUAGE_MODEL_NAME when invoking riva-build.
If your language model is in the .binary format, replace /servicemaker-dev/$LANGUAGE_MODEL_NAME:$KEY with --decoding_language_model_binary=/servicemaker-dev/$LANGUAGE_MODEL_NAME when invoking riva-build.
Refer to the Riva ASR Pipeline Configuration documentation if you want to build an ASR pipeline for a supported language other than US English. To obtain the proper riva-build parameters for your particular application, select the acoustic model (the parameters below assume Conformer-CTC), language, and pipeline type (offline for the purposes of this tutorial) from the interactive web menu at the bottom of the first section of the page.

# Syntax: 
# riva-build <task-name> \
#     output-dir-for-rmir/model.rmir[:key] \
#     dir-for-riva/acoustic_model.riva[:key] \
#     dir-for-riva/lm_model.riva[:key]
! docker run --rm --gpus 1 -v $MODEL_LOC:/servicemaker-dev $RIVA_CONTAINER -- \
    riva-build speech_recognition \
        /servicemaker-dev/rmir/asr_offline_riva_ngram_lm.rmir \
        /servicemaker-dev/$ACOUSTIC_MODEL_NAME \
        /servicemaker-dev/$LANGUAGE_MODEL_NAME \
        --decoding_vocab=/servicemaker-dev/$DECODER_VOCAB_NAME \
        --wfst_tokenizer_model=/servicemaker-dev/$WFST_TOKENIZER \
        --wfst_verbalizer_model=/servicemaker-dev/$WFST_VERBALIZER \
        --name=offline_riva_ngram_lm_pipeline \
        --chunk_size=4.8 \
        --left_padding_size=1.6 \
        --right_padding_size=1.6 \
        --ms_per_timestep=40 \
        --max_batch_size=16 \
        --nn.fp16_needs_obey_precision_pass \
        --language_code=en-US \
        --decoder_type=flashlight \
        --flashlight_decoder.asr_model_delay=-1 \
        --flashlight_decoder.lm_weight=0.2 \
        --flashlight_decoder.word_insertion_score=0.2 \
        --flashlight_decoder.beam_threshold=20. \
        --featurizer.use_utterance_norm_params=False \
        --featurizer.precalc_norm_time_steps=0 \
        --featurizer.precalc_norm_params=False \
        --offline

Riva-Deploy#

The deployment tool takes as input one or more RMIR files and a target model repository directory. It creates an ensemble configuration specifying the pipeline for the execution and finally writes all those assets to the output model repository directory.

Note: If you added an encryption key to your .rmir file when building it with riva-build, make sure to append a colon and then the key’s value to the model’s name in the riva-deploy command, as shown below.

# Syntax: riva-deploy -f dir-for-rmir/model.rmir[:key] output-dir-for-repository
! docker run --rm --gpus 0 -v $MODEL_LOC:/data $RIVA_CONTAINER -- \
            riva-deploy -f  /data/rmir/asr_offline_riva_ngram_lm.rmir /data/models/

Start the Riva Server#

After the model repository is generated, we are ready to start the Riva server. First, download the Riva Quick Start resource from NGC. Set the path to the directory here:

# Set the Riva Quick Start directory
RIVA_DIR = "<Path to the uncompressed folder downloaded from quickstart(include the folder name)>"

Next, we modify the config.sh file to enable relevant Riva services (n-gram language model), provide the encryption key, and path to the model repository (riva_model_loc) generated in the previous step among other configurations.

For example, if the model repository is generated at $MODEL_LOC/models, then you can specify riva_model_loc as the same directory as MODEL_LOC.

Pretrained versions of models specified in models_asr/nlp/tts/nmt are fetched from NGC. Since we are using our custom model, we can comment it in models_asr (and any others that are not relevant to your use case).

config.sh snippet#

# Enable or Disable Riva Services
service_enabled_asr=true 
service_enabled_nlp=true # MAKE CHANGES HERE - SET TO FALSE
service_enabled_tts=true # MAKE CHANGES HERE - SET TO FALSE
service_enabled_nmt=true # MAKE CHANGES HERE - SET TO FALSE

...

# Locations to use for storing models artifacts
#
# If an absolute path is specified, the data will be written to that location
# Otherwise, a docker volume will be used (default).
#
# riva_init.sh will create a `rmir` and `models` directory in the volume or
# path specified.
#
# RMIR ($riva_model_loc/rmir)
# Riva uses an intermediate representation (RMIR) for models
# that are ready to deploy but not yet fully optimized for deployment. Pretrained
# versions can be obtained from NGC (by specifying NGC models below) and will be
# downloaded to $riva_model_loc/rmir by `riva_init.sh`
#
# Custom models produced by NeMo or TLT and prepared using riva-build
# may also be copied manually to this location $(riva_model_loc/rmir).
#
# Models ($riva_model_loc/models)
# During the riva_init process, the RMIR files in $riva_model_loc/rmir
# are inspected and optimized for deployment. The optimized versions are
# stored in $riva_model_loc/models. The riva server exclusively uses these
# optimized versions.
riva_model_loc="riva-model-repo"  # MAKE CHANGES HERE - REPLACE WITH $MODEL_LOC

if [[ $riva_target_gpu_family == "tegra" ]]; then
    riva_model_loc="`pwd`/model_repository"
fi

# The default RMIRs are downloaded from NGC by default in the above $riva_rmir_loc directory
# If you'd like to skip the download from NGC and use the existing RMIRs in the $riva_rmir_loc
# then set the below $use_existing_rmirs flag to true. You can also deploy your set of custom
# RMIRs by keeping them in the riva_rmir_loc dir and use this quickstart script with the
# below flag to deploy them all together.
use_existing_rmirs=false        # MAKE CHANGES HERE - set to true                    

ATTENTION: Make sure to do the following before moving forward:

Either carry out these tasks manually:

In the file navigator in Jupyter Lab, navigate to $RIVA_DIR and open config.sh
Configure settings as shown in the snippet above
- Set NLP, TTS, and NMT services to false
- Set the riva_model_loc path to the path also assigned to MODEL_LOC
- Set the variable use_existing_rmirs to true

Or run the cell below:

ENABLE_ASR = 'true'
ENABLE_NLP = 'false'
ENABLE_TTS = 'false'
ENABLE_NMT = 'false'

!sed -i "s|service_enabled_asr=.*|service_enabled_asr=$ENABLE_ASR|g" $RIVA_DIR/config.sh
!sed -i "s|service_enabled_nlp=.*|service_enabled_nlp=$ENABLE_NLP|g" $RIVA_DIR/config.sh
!sed -i "s|service_enabled_tts=.*|service_enabled_tts=$ENABLE_TTS|g" $RIVA_DIR/config.sh
!sed -i "s|service_enabled_nmt=.*|service_enabled_nmt=$ENABLE_NMT|g" $RIVA_DIR/config.sh

!sed -i "/\sriva_model_loc=.*/! s|riva_model_loc=.*|riva_model_loc=\"$MODEL_LOC\"|g" $RIVA_DIR/config.sh

!sed -i "s|use_existing_rmirs=.*|use_existing_rmirs=true|g" $RIVA_DIR/config.sh

# Ensure you have permission to execute these scripts
! cd $RIVA_DIR && chmod +x ./riva_init.sh && chmod +x ./riva_start.sh

# Run Riva Init. This will fetch the containers/models
# YOU CAN SKIP THIS STEP IF YOU ALREADY RAN RIVA DEPLOY
! cd $RIVA_DIR && ./riva_init.sh config.sh

# Run Riva Start. This will deploy your model.
! cd $RIVA_DIR && ./riva_start.sh config.sh

Run Inference#

After the Riva server is up and running with your models, you can send inference requests querying the server.

To send gRPC requests, you can install the Riva Python API bindings for the client. This is available as a Python module on PyPI.

# Install the Client API Bindings
! pip install nvidia-riva-client

import riva.client

Connect to the Riva Server and Run Inference#

Calling this inference function queries the Riva server (using gRPC) to transcribe an audio file.

def run_inference(audio_file, server='localhost:50051', print_full_response=False):
    with open(audio_file, 'rb') as fh:
        data = fh.read()
    
    auth = riva.client.Auth(uri=server)
    client = riva.client.ASRService(auth)
    config = riva.client.RecognitionConfig(
        language_code="en-US",
        max_alternatives=1,
        enable_automatic_punctuation=False,
    )
    
    response = client.offline_recognize(data, config)
    if print_full_response: 
        print(response)
    else:
        print(response.results[0].alternatives[0].transcript)

Now we can actually query the Riva server.

audio_file = "<add path to .wav (PCM-, A-Law-, or U-Law-encoded), .flac, .opus, or .ogg (Opus-encoded) file>"
run_inference(audio_file)

You can stop the Riva server before shutting down the Jupyter kernel.

! cd $RIVA_DIR && ./riva_stop.sh 

NVIDIA Riva

How to Deploy a Custom Language Model (n-gram) Trained with NeMo on Riva

Contents