Deploying Custom Models as NIM#

Riva ASR NIM handles deployments of full pipelines that can be composed of one or more supported NVIDIA NeMo models and other pre-/post-processing components. These pipelines must be exported to an efficient inference engine and optimized for the target platform. Therefore, the Riva NIM server cannot directly use unsupported NVIDIA NeMo models because they represent only a single model.

The process of gathering all the required artifacts (for example, models, files, configurations, and user settings) and generating the inference engines is referred to as the Riva model repository generation. The Riva ASR NIM Docker images have all the necessary tools to generate the Riva model repository.

Phase 1: The development phase. To create a model in Riva, the model checkpoints must be converted to .riva format. You can further develop these .riva models using NeMo. For more information, refer to the Model Development with NeMo section.

Phase 2: The build phase. During the build phase, all the necessary artifacts (models, files, configurations, and user settings) required to deploy a Riva service are gathered into an intermediate file called RMIR (Riva Model Intermediate Representation). For more information, refer to the Riva Build section.

Phase 3: The deploy phase. During the deploy phase, the RMIR file is converted into the Riva model repository and the neural networks in the NeMo format is exported and optimized to run on the target platform. The deploy phase should be executed on the physical cluster on which the Riva server is deployed. For more information, refer to the Riva Deploy section.

Model Development with NeMo#

NVIDIA NeMo is an open-source PyTorch-based toolkit for research in conversational AI that exposes more of the model and PyTorch internals. Riva supports importing supported models trained in NeMo.

Note

For more information, refer to the NVIDIA/NeMo GitHub repository.

Export Models with NeMo2Riva#

Models trained in NVIDIA NeMo are in the .nemo format. To use these models in Riva, convert the model checkpoints to the .riva format for building and deploying with Riva using the nemo2riva tool. The nemo2riva tool is packaged and available from the Python PyPI.

Refer to the nemo2riva README for detailed instructions on exporting different model architectures from .nemo to .riva format.

Riva Build#

The riva-build tool is responsible for deployment preparation. Its output is an intermediate format (called an RMIR) of an end-to-end pipeline for the supported services within Riva.

Launch an interactive session inside the NIM image of the model you want to deploy.

# Create the cache directory on the host machine:
 export NIM_EXPORT_PATH=~/nim_export
 mkdir -p $NIM_EXPORT_PATH
 chmod 777 $NIM_EXPORT_PATH
# Set the appropriate value
 export CONTAINER_ID=parakeet-1-1b-ctc-en-us
 docker run --gpus all -it --rm \
     -v <artifact_dir>:/riva_build_deploy \
     -v $NIM_EXPORT_PATH:/model_tar \
     --entrypoint="/bin/bash" \
     --name riva-build-deploy \
     nvcr.io/nim/nvidia/$CONTAINER_ID:latest

where:

<artifact_dir> is the folder or Docker volume that contains the .riva file and other artifacts required to prepare the Riva model repository.

Run the riva-build command from within the container.
```
riva-build <pipeline> \
    /riva_build_deploy/<rmir_filename>:<encryption_key> \
    /riva_build_deploy/<riva_filename>:<encryption_key> \
    <optional_args>
```
where:
- <pipeline> must be one of the following:
  - speech_recognition
  - punctuation
- <rmir_filename> is the name of the RMIR file that is generated.
- <riva_filename> is the name of the riva file(s) to use as input.
- <args> are optional arguments to configure the Riva service. The following section discusses the different ways to configure the ASR, NLP, and TTS services.
- <encryption_key> is optional. If the .riva file is generated without an encryption key, specify the input/output files with <riva_filename> instead of <riva_filename>:<encryption_key>.

By default, if a file named <rmir_filename> already exists, it will not be overwritten. To force the <rmir_filename> to be overwritten, use the -f or --force argument. For example, riva-build <pipeline> -f ...

For details about the parameters that you can pass to riva-build to customize the Riva pipeline, refer to the Pipeline Configuration section.

Riva Deploy#

The riva-deploy tool takes an RMIR file and a target model repository directory as input. It performs the following functions:

Model Optimization: Optimize the frozen checkpoints for inference on the target GPU.

Configuration Generation: Generate configuration files for backend components, including ensembles of models.

Generate the Riva model repository from the Riva .rmir file with the following command:

riva-deploy <rmir_filename>:<encryption_key> /data/models

This command must be triggered for all the RMIR files.

By default, if the destination folder (for example, /data/models/ in the above example) already exists, it will not be overwritten. To force the destination folder to be overwritten, use the -f or --force parameter. For example, riva-deploy -f ...

The command creates the Riva model repository at /data/models. If you write to any other location, you must make additional manual changes in the embedded artifact directories within the configuration files in some of the model repositories that have model-specific artifacts, such as class labels. Therefore, use /data/models unless you are familiar with Triton Inference Server model repository configurations.

After the Riva model repository is generated, prepare the tar archive.

cd /data/models
tar -czf /model_tar/custom_model.tar.gz *

Close the container.

docker stop riva-build-deploy
docker rm riva-build-deploy

Launch the Custom NIM#

After the Riva model repository tar archive for your custom model is generated, manually start the NIM Docker container using docker run.

# Run the container with the cache directory mounted in the appropriate location:
docker run -it --rm --name=$CONTAINER_ID \
   --runtime=nvidia \
   --gpus '"device=0"' \
   --shm-size=8GB \
   -e NGC_API_KEY \
   -e NIM_TAGS_SELECTOR \
   -e NIM_DISABLE_MODEL_DOWNLOAD=true \
   -e NIM_HTTP_API_PORT=9000 \
   -e NIM_GRPC_API_PORT=50051 \
   -p 9000:9000 \
   -p 50051:50051 \
   -v $NIM_EXPORT_PATH:/opt/nim/export \
   -e NIM_EXPORT_PATH=/opt/nim/export \
   nvcr.io/nim/nvidia/$CONTAINER_ID:latest