How to Deploy Riva at Scale on OCI with OKE#

This is an example of deploying Riva Speech Skills on Oracle Cloud Infrastructure (OCI) Oracle Container Engine for Kubernetes (OKE) with Traefik-based load balancing. It includes the following steps:

  1. Creating the OKE cluster

  2. Deploying the Riva API service

  3. Deploying the Traefik edge router

  4. Creating the IngressRoute to handle incoming requests

  5. Deploying a sample client

  6. Scaling the cluster

Prerequisites#

Before continuing, ensure you have:

Creating the OKE Cluster#

  1. Navigate and sign in to the OCI webpage. Go to the hamburger menu on the top left. Under the Developer Services tab, navigate to Kubernetes Clusters (OKE).

  2. Click ‘Create cluster’ to start the cluster creation.

    There will be two options for creating a Kubernetes Cluster: Quick create or Custom create. To make this tutorial as simple as possible, we will be using the Quick create option.

  3. After selecting the Quick create option, you can customize the cluster shape, number of nodes, etc. Set the Kubernetes worker nodes to ‘Public Workers’ to ssh into easily.

    Private worker nodes have private IP addresses and can only be accessed by other resources inside the VCN.

  4. To get the simplest deployment, create one GPU Node. The shape of the worker node tested in this example was one GPU Node of shape VM.GPU3.1 which has one V100 GPU. You will learn how to scale up to multiple nodes later.

Note Since we are using the Quick create option, we can only choose the shape for one node pool initially. A node pool is a group of nodes within a cluster that all have the same configuration. You will be able to add more node pools once you create the cluster.

Note The ideal cluster configuration for this example includes:

  • A GPU-equipped node where the main Riva service will run.

  • A general-purpose compute node for the Traefik load balancer.

  • Another general-purpose node for client applications accessing the Riva service.

  1. Increase the boot volume for each node or the Riva pods will not start up. For this example we increased it to 500 GB. You can specify this during cluster creation by clicking ‘Show advanced options’, checking the ‘Specify a custom boot volume size’, and typing in your desired boot volume size.

  2. After reviewing the resources, click ‘Create cluster.’ It’ll take around 15 min for the cluster creation to complete.

Note Once you click the ‘Create cluster’ button, it may say that the cluster is active, but it won’t actually be ready until the node pool status is active.

Accessing the OKE Cluster#

To access the cluster locally, navigate to the OKE page and click on the cluster name. From there press ‘Access Cluster’ and then ‘Local Access’. From there you can follow the prompts to get local access to the cluster.

You can verify the cluster creation on the OCI Console by navigating to ‘Kubernetes Clusters (OKE)’ under ‘Developer Services’. You will see your cluster name on the webpage. You can go to the ‘Instances’ section, to see your worker node instance.

Verify that the node now appears in Kubernetes. If so, the cluster was successfully created and you can access the cluster locally.

  cat .kube/config
  kubectl get pods -A
  kubectl get nodes

To ssh into a worker node, you will need the private key you downloaded when you created the cluster. You can ssh into each worker node by using the IP address and hostname. You can find the IP Address and hostname by going to the Console and clicking on each instance you want to access.

  chmod 600 private_key.key
  ssh -i private_key.key username@ip_address

If a compute instance is created with a boot volume that is greater than or equal to 50 GB, the instance does not automatically use the entire volume. Use the oci-growfs utility to expand the root partition to fully utilize the allocated boot volume size. You’ll want to ssh into each worker node and run the following commands:

  sudo /usr/libexec/oci-growfs -y
  sudo systemctl restart kubelet

Note If you are seeing an ‘Unable to expand /dev/sda3’, go to the OCI console and click hamburger menu on the top left. Under the Storage tab, navigate to ‘Block Storage’ and click on ‘Boot Volumes’ on the left hand column. From here you can click on each boot volume associated with a worker node and click ‘Edit’. Change the volume size in the window that just opened and click Save Changes. A message will pop up with rescan commands. You will need to ssh into each worker node and input the rescan commands given. From there, you can run the oci-growfs commands above.

When accessing the cluster for the first time, any GPU nodes that you create will be tainted by default to make sure that pods are not scheduled onto inappropriate nodes (non-gpu loads should not be scheduled on gpu nodes). With a node taint, no pod will be able to schedule onto that node unless you either remove the taint or add a matching toleration. If you run kubectl get pods -A and see that the CoreDNS pod is not running, this is usually due to a taint on the node.

To remove the taint:

kubectl taint nodes node1 nvidia.com/gpu=value1:NoSchedule-

To add a matching toleration:

  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"

Deploying the Riva API#

The Riva Speech Skills Helm chart is designed to automate deployment to a Kubernetes cluster. After downloading the Helm chart, minor adjustments will adapt the chart to the way Riva will be used in the remainder of this tutorial. Depending on the number of models, this initial model deployment could take an hour or more.

  1. Download and untar the Riva API Helm chart. Replace VERSION_TAG with the desired specific version. Check out the latest on NGC.

    export NGC_CLI_API_KEY=<your NGC API key>
    export VERSION_TAG="2.17.0"
    helm fetch https://helm.ngc.nvidia.com/nvidia/riva/charts/riva-api-${VERSION_TAG}.tgz --username='$oauthtoken' --password=$NGC_CLI_API_KEY
    tar -xvzf riva-api-${VERSION_TAG}.tgz
    
  2. In the riva-api folder, modify the following file:

    • values.yaml

      • In modelRepoGenerator.ngcModelConfigs, comment or uncomment specific models or languages, as needed.

      • Change service.type from LoadBalancer to ClusterIP. This directly exposes the service only to other services within the cluster, such as the proxy service to be installed below.

  3. Enable the cluster to run containers needing NVIDIA GPUs using the nvidia device plugin:

    helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
    helm repo update
    helm install \
        --namespace nvidia-device-plugin \
        --create-namespace nvidia-device-plugin \
        --set failOnInitError=false \
        nvdp/nvidia-device-plugin
    
  4. Ensure you are in a working directory with riva-api as a subdirectory, then install the Riva Helm chart. You can explicitly override variables from the values.yaml file, such as the modelRepoGenerator.modelDeployKey settings.

    helm install riva-api riva-api/ \
        --set ngcCredentials.password=`echo -n $NGC_CLI_API_KEY | base64 -w0` \
        --set modelRepoGenerator.modelDeployKey=`echo -n tlt_encode | base64 -w0`
    
  5. The Helm chart runs two containers in order: a riva-model-init container that downloads and deploys the models, followed by a riva-speech-api container to start the speech service API. To monitor the deployment, use kubectl to describe the riva-api pod and to watch the container logs.

    export pod=`kubectl get pods | cut -d " " -f 1 | grep riva-api`
    kubectl describe pod $pod
    
    kubectl logs -f $pod -c riva-model-init
    kubectl logs -f $pod -c riva-speech-api
    

Deploying the Traefik edge router#

Now that the Riva service is running, the cluster needs a mechanism to route requests into Riva.

In the default values.yaml of the riva-api Helm chart, service.type was set to LoadBalancer, which would have created an OCI Load Balancer to direct traffic into the Riva service. Instead, the open-source Traefik edge router will serve this purpose.

  1. Download and untar the Traefik Helm chart.

    helm repo add traefik https://helm.traefik.io/traefik
    helm repo update
    helm fetch traefik/traefik
    tar -zxvf traefik-*.tgz
    
  2. Modify the traefik/values.yaml file.

    Change service.type from LoadBalancer to ClusterIP. This exposes the service on a cluster-internal IP.

  3. Deploy the modified traefik Helm chart.

    helm install traefik traefik/
    

Creating the IngressRoute#

An IngressRoute enables the Traefik load balancer to recognize incoming requests and distribute them across multiple riva-api services.

When you deployed the traefik Helm chart above, Kubernetes automatically created a local DNS entry for that service: traefik.default.svc.cluster.local. The IngressRoute definition below matches these DNS entries and directs requests to the riva-api service. You can modify the entries to support a different DNS arrangement, depending on your requirements.

  1. Create the following riva-ingress.yaml file:

    apiVersion: traefik.containo.us/v1alpha1
    kind: IngressRoute
    metadata:
      name: riva-ingressroute
    spec:
      entryPoints:
        - web
      routes:
        - match: "Host(`traefik.default.svc.cluster.local`)"
          kind: Rule
          services:
            - name: riva-api
              port: 50051
              scheme: h2c
    
  2. Deploy the IngressRoute.

    kubectl apply -f riva-ingress.yaml
    

The Riva service is now able to serve gRPC requests from within the cluster at the address traefik.default.svc.cluster.local. If you are planning to deploy your own client application in the cluster to communicate with Riva, you can send requests to that address. In the next section, you will deploy a Riva sample client and use it to test the deployment.

Deploying a Sample Client#

Riva provides a container with a set of pre-built sample clients to test the Riva services. The clients are also available on GitHub for those interested in adapting them.

  1. Create the client-deployment.yaml file that defines the deployment. Replace the image path in the file below. Check out NGC for the latest image tag:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: riva-client
      labels:
        app: "rivaasrclient"
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: "rivaasrclient"
      template:
        metadata:
          labels:
            app: "rivaasrclient"
        spec:
          imagePullSecrets:
          - name: imagepullsecret
          containers:
          - name: riva-client
            image: "nvcr.io/nvidia/riva/riva-speech-client:2.17.0"
            command: ["/bin/bash"]
            args: ["-c", "while true; do sleep 5; done"]
    
  2. Deploy the client service.

    kubectl apply -f client-deployment.yaml
    
  3. Connect to the client pod.

    export cpod=`kubectl get pods | cut -d " " -f 1 | grep riva-client`
    kubectl exec --stdin --tty $cpod /bin/bash
    
  4. From inside the shell of the client pod, run the sample ASR client on an example .wav file. Specify the traefik.default.svc.cluster.local endpoint, with port 80, as the service address.

    riva_streaming_asr_client \
       --audio_file=/work/wav/sample.wav \
       --automatic_punctuation=true \
       --riva_uri=traefik.default.svc.cluster.local:80
    

    Depending on what audio file you choose, you will see an output similar to this:

    filename: /opt/riva/wav/en-US_sample.wav
    Done loading 1 files
    what
    what
    what is
    what is
    what is
    what is now
    what is natural
    what is natural
    what is natural language
    what is natural language
    what is natural language
    what is natural language
    what is natural language Processing
    what is natural language Processing
    what is natural language Processing
    what is natural language Processing
    what is natural language Processing
    what is language Processing
    what is language Processing
    What is Natural Language Processing?
    -----------------------------------------------------------
    File: /opt/riva/wav/en-US_sample.wav
    
    Final transcripts:
    0 : What is Natural Language Processing?
    
    Timestamps:
    Word                                    Start (ms)      End (ms)
    
    What                                    840             880
    is                                      1160            1200
    Natural                                 1800            2080
    Language                                2200            2520
    Processing?                             2720            3200
    
    Audio processed: 4 sec.
    -----------------------------------------------------------
    
    Not printing latency statistics because the client is run without the --simulate_realtime option and/or the number of requests sent is not equal to number of requests received. To get latency statistics, run with --simulate_realtime and set the --chunk_duration_ms to be the same as the server chunk duration
    Run time: 0.108666 sec.
    Total audio processed: 4.152 sec.
    Throughput: 38.2087 RTFX
    

Scaling the cluster#

As deployed above, the OKE cluster only provisions a single GPU node. While a single GPU can handle a large volume of requests, the cluster can easily be scaled with more nodes.

  1. Scale the GPU nodegroup to the desired number of compute nodes (4 in this case) through the Console.

  2. Scale the riva-api deployment to use the additional nodes.

    kubectl scale deployments/riva-api --replicas=4
    

As with the original riva-api deployment, each replica pod downloads and initializes the necessary models prior to starting the Riva service.