Kubernetes
Contents
Kubernetes#
This section is not supported for embedded platforms.
Included in the NGC Helm Repository is a chart designed to automate for push-button deployment to a Kubernetes cluster.
The Riva Speech AI Helm Chart deploys the ASR, NLP, and TTS services automatically. The Helm chart performs a number of functions:
Pulls Docker images from NGC for the Riva Speech AI server and utility containers for downloading and converting models.
Downloads the requested model artifacts from NGC as configured in the
values.yaml
file.Generates the Triton Inference Server model repository.
Starts the Riva Speech AI server as configured in a Kubernetes pod.
Exposes the Riva Speech AI server as a configured service.
Examples of pretrained models are released with Riva for each of the services. The Helm chart comes preconfigured for downloading and deploying all of these models.
Important
The Helm chart configuration can be modified for your use case by modifying
the values.yaml
file. In this file, you can change the settings related
to which models to deploy, where to store them, and how to expose the services.
Attention
To deploy Riva, a functioning Kubernetes environment with a GPU (NVIDIA Volta or later) is required. This can be either on-premise or in a cloud provider, or within a managed Kubernetes environment so long as the environment has GPU support enabled.
Installation with Helm#
Validate Kubernetes with NVIDIA GPU support.
Kubernetes with GPU is well supported by NVIDIA. For more information, refer to the Install Kubernetes instructions to ensure that your environment is properly setup.
If using an NVIDIA A100 GPU with Multi-Instance GPU (MIG) support, refer to MIG Support in Kubernetes.
Download and modify the Helm chart for your use case.
export NGC_API_KEY=<your_api_key> helm fetch https://helm.ngc.nvidia.com/nvidia/riva/charts/riva-api-2.13.1.tgz \ --username=\$oauthtoken --password=$NGC_API_KEY --untar
The above comment creates a new directory called
riva-api
in your current working directory. Within that directory is avalues.yaml
file that can be modified to suit your use case (refer to the Kubernetes Secrets and Riva Settings sections).After the
values.yaml
file has been updated to reflect the deployment requirements, Riva can be deployed to the Kubernetes cluster:helm install riva-api riva-api
Alternatively, use the
--set
option to install without modifying thevalues.yaml
file. Ensure you set the NGC API key, email, andmodel_key_string
to the appropriate values. By default,model_key_string
istlt_encode
.helm install riva-api \ --set ngcCredentials.password=`echo -n $NGC_API_KEY | base64 -w0` \ --set ngcCredentials.email=your_email@your_domain.com \ --set modelRepoGenerator.modelDeployKey=`echo -n model_key_string | base64 -w0`
Helm configuration. The following sections highlight key areas of the
values.yaml
file and considerations for deployment. Consult the individual service documentation for more details as well as the Helm chart’svalues.yaml
file, which contains inline comments explaining the configuration options.
Note
Depending on the number of models enabled, a higher value of failureThreshold
for Startup probe in deployment.yaml
might be needed to accomodate increased startup time.
Kubernetes Secrets#
The Helm deployment uses multiple Kubernetes secrets for obtaining access to NGC:
imagepullsecret
: one for Docker imagesmodelpullsecret
: one for model artifactsriva-model-deploy-key
: one for encrypted models
The names of the secrets can be modified in the values.yaml
file, however, if you are deploying into an NVIDIA EGX™ or NVIDIA Fleet Command™
managed environment, your environment must have support for imagepullsecret
and modelpullsecret
. These secrets
are managed by the chart and can be manipulated by setting the respective values within the ngcCredentials
section
within values.yaml
.
Riva Settings#
The values.yaml
for Riva is intended to provide maximum flexibility in deployment configurations.
The replicaCount
field is used to configure the number of
identical instances (or pods) of the services that are deployed. When
load-balanced appropriately, increasing this number (as resources
permit) enables horizontal scaling for increased load.
By default, the Riva API server and Triton server are deployed inside the same container in a
single pod. This works fine in most environments with 1 GPU and numerous models.
In the case of a multi-GPU environment, models can be distributed across GPUs for better utilization of GPUs.
In such a case, the modelRepoGenerator.useSeparateTriton
flag can be used to deploy
the Triton server in a separate pod with 1 GPU each. There can be as many Triton server pods
as the number of available GPUs. The number of Triton server pods is controlled by the number
of dict entries under modelRepoGenerator.ngcModelConfigs
. For each Triton server
entry, the models
value specifies the list of models to be loaded. Deployment of each Triton server can be controlled by dict entry enabled
under each Triton. By default, a
single Triton server pod modelRepoGenerator.ngcModelConfigs.tritonGroup0
is
configured to load the default list of models.
Prebuilt models not required for your deployment can be deleted from
the default list in modelRepoGenerator.ngcModelConfigs.tritonGroup0.models
.
We recommend you remove models that are not used to reduce deployment time and GPU memory usage.
By default, models are downloaded from NGC, optimized for TensorRT (if
necessary) before the service starts, and stored in a short-lived
location. When the pod terminates, these model artifacts are deleted
and the storage is freed for other workloads. This behavior is controlled
by the modelDeployVolume
field. Refer to the Kubernetes Volumes documentation
for alternative options that can be used for persistent storage. For scale-out deployments, having a model store shared across pods greatly improves
scale-up time since the models are prebuilt and already available to
the Riva container.
Persistent storage should only be used in homogenous deployments where GPU models are identical.
Currently, provided models nearly fill a T4’s memory (16 GB). We recommend running a subset of models/services if using a single GPU.
Deploying custom RMIR models with Helm#
If you have trained a custom model, you would have generated an .rmir
file.
Perform the following steps to deploy a custom RMIR model. It’s assumed that you are using the host path for storing the model repository.
Specify the host path for the model repository. Update the
modelRepoGenerator.modelDeployVolume.hostPath.path
parameter in the values.yaml file. It uses the default value of/data/riva
. You can configure it as needed.Create a directory at the specified host path. The default value is
/data/riva
.Create a directory
rmir
at path/data/riva/rmir
.Create a directory under
/data/riva/rmir/
for keeping the custom RMIR model. Directory name should follow<model_name>_v<model_version>
format. For example/data/riva/rmir/custom_asr_model_v1.0
, where model_name iscustom_asr_model
with version1.0
Copy the custom RMIR model file inside the directory created in the previous step. For example
/data/riva/rmir/custom_asr_model_v1.0/model.rmir
. The RMIR filename can be<any>.rmir
.Add a model entry in
values.yaml
undermodelRepoGenerator.ngcModelConfigs.tritonGroup0.models
. Use<model_name>:<model_version>
as the naming format. For example:
modelRepoGenerator:
ngcModelConfigs:
tritonGroup0:
enabled: true
models:
- custom_asr_model:1.0
Configure any other values as required and install the Helm chart as per Installation with Helm.
Ingress Controller#
There is a base configuration for a simple ingress controller
using Traefik. This can be configured through the values.yaml
,
or can be replaced with any controller supporting http2
and grpc
.
Ingress controllers are found in both on-premise and cloud-based deployments.
For this to work correctly, you must have a functional name resolution using
whatever mechanism (DNS, /etc/host
files, and so on).
For any sort of multi-pod scaling, you must have a correctly configured ingress controller performing HTTP/2 or gRPC load balancing including name resolution.
Further details can be found in the ingress:
section in the values.yaml
file.