How to Deploy Riva at Scale on AWS with EKS
Contents
How to Deploy Riva at Scale on AWS with EKS#
This is an example of deploying and scaling Riva Speech Skills on Amazon Web Services (AWS) Elastic Kubernetes Service (EKS) with Traefik-based load balancing. It includes the following steps:
Creating the EKS cluster
Deploying the Riva API service
Deploying the Traefik edge router
Creating the IngressRoute to handle incoming requests
Deploying a sample client
Scaling the cluster
Prerequisites#
Before continuing, ensure you have:
An AWS account with the appropriate user/role privileges to manage EKS
The AWS command-line tool, configured for your account
Access to NGC and the associated command-line interface
This sample has been tested on: eksctl
(v0.82.0), helm
(v3.6.3), kubectl
(v1.21.2), and traefik
(v2.5.3).
Creating the EKS Cluster#
The cluster contains three separate nodegroups:
gpu-linux-workers
: A GPU-equipped node where the main Riva service runs.g5.2xlarge
instances, each using an A10 GPU, provide good value and sufficient capacity for many applications. This nodegroup allows scaling from 1 to 8 nodes.cpu-linux-lb
: A general-purpose compute node for the Traefik load balancer, using anm6i.large
instance.cpu-linux-clients
: Another general-purpose node with anm6i.2xlarge
instance, for client applications accessing the Riva service. The node is used for benchmarking in this example, or could also be used for another service such as anode.js
application.
Build a configuration that defines each of these nodegroups and save it to a file called
eks_launch_conf.yaml
.apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: riva-cluster region: us-west-2 version: "1.21" iam: withOIDC: true managedNodeGroups: - name: gpu-linux-workers labels: { role: workers } instanceType: g5.2xlarge minSize: 1 maxSize: 8 volumeSize: 100 privateNetworking: true ssh: allow: true - name: cpu-linux-lb labels: { role: loadbalancers } instanceType: m6i.large desiredCapacity: 1 volumeSize: 100 privateNetworking: true ssh: allow: true - name: cpu-linux-clients labels: { role: clients } instanceType: m6i.2xlarge desiredCapacity: 1 volumeSize: 100 privateNetworking: true ssh: allow: true
Launch the cluster with the above configuration.
eksctl create cluster -f eks_launch_conf.yaml
This may take 30 minutes or more for AWS to provision all the necessary resources. When complete, you should see some changes in your default Kubernetes configuration file.
Verify that the nodes now appear in Kubernetes. If so, the cluster was successfully created.
cat .kube/config kubectl get pods -A kubectl get nodes --show-labels kubectl get nodes --selector role=workers kubectl get nodes --selector role=clients kubectl get nodes --selector role=loadbalancers
Deploying the Riva API#
The Riva Speech Skills Helm chart is designed to automate deployment to a Kubernetes cluster. After downloading the Helm chart, minor adjustments will adapt the chart to the way Riva will be used in the remainder of this tutorial.
Download and untar the Riva API Helm chart. Replace
VERSION_TAG
with the specific version needed.export NGC_CLI_API_KEY=<your NGC API key> export VERSION_TAG="2.17.0" helm fetch https://helm.ngc.nvidia.com/nvidia/riva/charts/riva-api-${VERSION_TAG}.tgz --username='$oauthtoken' --password=$NGC_CLI_API_KEY tar -xvzf riva-api-${VERSION_TAG}.tgz
In the
riva-api
folder, modify the following files:values.yaml
In
modelRepoGenerator.ngcModelConfigs
, comment or uncomment specific models or languages, as needed.Change
service.type
fromLoadBalancer
toClusterIP
. This directly exposes the service only to other services within the cluster, such as the proxy service to be installed below.
templates/deployment.yaml
Add a node selector constraint to ensure that Riva is only deployed on the correct GPU resources. In
spec.template.spec
, add:nodeSelector: eks.amazonaws.com/nodegroup: gpu-linux-workers
Enable the cluster to run containers needing NVIDIA GPUs using the nvidia device plugin:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update helm install \ --generate-name \ --set failOnInitError=false \ nvdp/nvidia-device-plugin
Ensure you are in a working directory with
riva-api
as a subdirectory, then install the Riva Helm chart. You can explicitly override variables from thevalues.yaml
file, such as themodelRepoGenerator.modelDeployKey
settings.helm install riva-api riva-api/ \ --set ngcCredentials.password=`echo -n $NGC_CLI_API_KEY | base64 -w0` \ --set modelRepoGenerator.modelDeployKey=`echo -n tlt_encode | base64 -w0`
The Helm chart runs two containers in order: a
riva-model-init
container that downloads and deploys the models, followed by ariva-speech-api
container to start the speech service API. Depending on the number of models, the initial model deployment could take an hour or more. To monitor the deployment, usekubectl
to describe theriva-api
pod and to watch the container logs.export pod=`kubectl get pods | cut -d " " -f 1 | grep riva-api` kubectl describe pod $pod kubectl logs -f $pod -c riva-model-init kubectl logs -f $pod -c riva-speech-api
Deploying the Traefik edge router#
Now that the Riva service is running, the cluster needs a mechanism to route requests into Riva.
If the service.type
is set to LoadBalancer
in the values.yaml
of the riva-api
Helm chart, this would have automatically created an AWS Classic Load Balancer to direct traffic into the Riva service. Instead, the open-source Traefik edge router will serve this purpose.
Download and untar the Traefik Helm chart.
helm repo add traefik https://helm.traefik.io/traefik helm repo update helm fetch traefik/traefik tar -zxvf traefik-*.tgz
Modify the
traefik/values.yaml
file.Set
service.type
toLoadBalancer
to expose the service on a external IP accessible from outside the cluster. If theservice.type
is set toClusterIP
, the service will only be exposed on a cluster-internal IP.Set
nodeSelector
to{ eks.amazonaws.com/nodegroup: cpu-linux-lb }
. Similar to what you did for the Riva API service, this tells the Traefik service to run on thecpu-linux-lb
nodegroup.
Deploy the modified
traefik
Helm chart.helm install traefik traefik/
Creating the IngressRoute#
An IngressRoute enables the Traefik load balancer to
recognize incoming requests and distribute them across multiple riva-api
services.
If you deployed the above traefik
Helm chart with service.type
set to ClusterIP
, Kubernetes automatically created a local DNS entry for that service: traefik.default.svc.cluster.local
. If you deployed the above traefik
Helm chart with service.type
set to LoadBalancer
, Kubernetes automatically created an external DNS entry for that service which can be obtained from kubectl get svc
command, e.g. a7153b60c6e7a44dab6f681d15e111b5-2140342794.us-west-2.elb.amazonaws.com
.
The IngressRoute definition below matches these DNS entries and directs requests to the riva-api
service. You can modify the entries to support a different DNS arrangement, depending on your requirements.
Create the following
riva-ingress.yaml
file. You need to replace<local_or_external_IP>
with the local or external DNS entry mentioned in the above instruction.apiVersion: traefik.containo.us/v1alpha1 kind: IngressRoute metadata: name: riva-ingressroute spec: entryPoints: - web routes: - match: "Host(`<local_or_external_IP>`)" kind: Rule services: - name: riva-api port: 50051 scheme: h2c
Deploy the IngressRoute.
kubectl apply -f riva-ingress.yaml
The Riva service is now able to serve gRPC requests from within or outside the cluster, depending on the service.type
field, at the local or external address as mentioned before. If you are planning to deploy your own client application in the cluster to communicate with Riva, you can send requests to that address. In the next section, you will deploy a Riva sample client and use it to test the deployment.
Deploying a Sample Client#
Riva provides a container with a set of pre-built sample clients to test the Riva services. The Riva C++ clients and Riva Python clients are also available on GitHub for those interested in adapting them.
Create the
client-deployment.yaml
file that defines the deployment and contains the following:apiVersion: apps/v1 kind: Deployment metadata: name: riva-client labels: app: "rivaasrclient" spec: replicas: 1 selector: matchLabels: app: "rivaasrclient" template: metadata: labels: app: "rivaasrclient" spec: nodeSelector: eks.amazonaws.com/nodegroup: cpu-linux-clients imagePullSecrets: - name: imagepullsecret containers: - name: riva-client image: "nvcr.io/nvidia/riva/riva-speech:2.17.0" command: ["/bin/bash"] args: ["-c", "while true; do sleep 5; done"]
Deploy the client service.
kubectl apply -f client-deployment.yaml
Connect to the client pod.
export cpod=`kubectl get pods | cut -d " " -f 1 | grep riva-client` kubectl exec --stdin --tty $cpod /bin/bash
From inside the shell of the client pod, run the sample ASR client on an example
.wav
file. Specify the<local_or_external_IP>
endpoint as mentioned before, with port 80, as the service address.riva_streaming_asr_client \ --audio_file=/opt/riva/wav/en-US_sample.wav \ --automatic_punctuation=true \ --riva_uri=<local_or_external_IP>:80
Scaling the cluster#
As deployed above, the EKS cluster only provisions a single GPU node, although the given configuration permits up to 8 nodes. While a single GPU can handle a large volume of requests, the cluster can easily be scaled with more nodes.
Scale the GPU nodegroup to the desired number of compute nodes (4 in this case).
eksctl scale nodegroup \ --name=gpu-linux-workers \ --cluster=riva-cluster \ --nodes=4 \ --region=us-west-2
Scale the
riva-api
deployment to use the additional nodes.kubectl scale deployments/riva-api --replicas=4
As with the original riva-api
deployment, each replica pod downloads and initializes the necessary models prior to starting the Riva service.