Kubernetes#

This section is not supported for embedded platforms.

Included in the NGC Helm Repository is a chart designed to automate for push-button deployment to a Kubernetes cluster.

The Riva AI Services Helm Chart deploys the ASR, NLP, and TTS services automatically. The Helm chart performs a number of functions:

  • Pulls Docker images from NGC for the Riva Speech Server and utility containers for downloading and converting models.

  • Downloads the requested model artifacts from NGC as configured in the values.yaml file.

  • Generates the Triton Inference Server model repository.

  • Starts the Riva Speech as configured in a Kubernetes pod.

  • Exposes the Riva Speech Server as a configured service.

Example pretrained models are released with Riva for each of the services. The Helm chart comes preconfigured for downloading and deploying all of these models.

Important

The Helm chart configuration can be modified for your use case by modifying the values.yaml file. In this file, you can change the settings related to which models to deploy, where to store them, and how to expose the services.

Attention

To deploy Riva, a functioning Kubernetes environment with a GPU (NVIDIA Volta or later) is required. This can be either on-premise or in a cloud provider, or within a managed Kubernetes environment so long as the environment has GPU support enabled.

Installation with Helm#

  1. Validate Kubernetes with NVIDIA GPU support.

    Kubernetes with GPU is well supported by NVIDIA. For more information, refer to the Install Kubernetes instructions to ensure that your environment is properly setup.

    If using an NVIDIA A100 GPU with Multi-Instance GPU (MIG) support, refer to MIG Support in Kubernetes.

  2. Download and modify the Helm chart for your use.

    export NGC_API_KEY=<your_api_key>
    helm fetch https://helm.ngc.nvidia.com/nvidia/riva/charts/riva-api-2.3.0.tgz \
            --username=\$oauthtoken --password=$NGC_API_KEY --untar
    

    The above comment creates a new directory called riva-api in your current working directory. Within that directory is a values.yaml file that can be modified to suit your use case (refer to kubernetes_secrets and riva_settings).

  3. After the values.yaml file has been updated to reflect the deployment requirements, Riva can be deployed to the Kubernetes cluster:

    helm install riva-api riva-api
    

    Alternatively, use the --set option to install without modifying the values.yaml file. Ensure you set the NGC API key, email, and model_key_string to the appropriate values. By default, model_key_string is tlt_encode.

    helm install riva-api \
        --set ngcCredentials.password=`echo -n $NGC_API_KEY | base64 -w0` \
        --set ngcCredentials.email=your_email@your_domain.com \
        --set modelRepoGenerator.modelDeployKey=`echo -n model_key_string | base64 -w0`
    
  4. Helm configuration. The following sections highlight key areas of the values.yaml file and considerations for deployment. Consult the individual service documentation for more details as well as the Helm chart’s values.yaml file, which contains inline comments explaining the configuration options.

Kubernetes Secrets#

The Helm deployment uses multiple Kubernetes secrets for obtaining access to NGC:

  • imagepullsecret: one for Docker images

  • modelpullsecret: one for model artifacts

  • riva-model-deploy-key: one for encrypted models

The names of the secrets can be modified in the values.yaml file, however, if you are deploying into an NVIDIA EGX™ or NVIDIA Fleet Command™ managed environment, your environment must have support for imagepullsecret and modelpullsecret. These secrets are managed by the chart and can be manipulated by setting the respective values within the ngcCredentials section within values.yaml.

Riva Settings#

The values.yaml for Riva is intended to provide maximum flexibility in deployment configurations.

The replicaCount field is used to configure the number of identical instances (or pods) of the services that are deployed. When load-balanced appropriately, increasing this number (as resources permit) enables horizontal scaling for increased load.

Individual speech services (ASR, NLP, or TTS) can be disabled by changing the riva.speechServices.[asr|nlp|tts] key to false.

Prebuilt models not required for your deployment can be deleted from the list in modelRepoGenerator.ngcModelConfigs. We recommend you remove models and disable services that are not used to reduce deployment time and GPU memory usage.

By default, models are downloaded from NGC, optimized for TensorRT (if necessary) before the service starts, and stored in a short-lived location. When the pod terminates, these model artifacts are deleted and the storage is freed for other workloads. This behavior is controlled by the modelDeployVolume field. Refer to the Kubernetes Volumes documentation for alternative options that can be used for persistent storage. For scale-out deployments, having a model store shared across pods greatly improves scale-up time since the models are prebuilt and already available to the Riva container.

  • Persistent storage should only be used in homogenous deployments where GPU models are identical.

  • Currently, provided models nearly fill a T4’s memory (16 GB). We recommend running a subset of models/services if using a single GPU.

Ingress Controller#

There is a base configuration for a simple ingress controller using Traefik. This can be configured through the values.yaml, or can be replaced with any controller supporting http2 and grpc.

Ingress controllers are found in both on-premise and cloud-based deployments. For this to work correctly, you must have a functional name resolution using whatever mechanism (DNS, /etc/host files, etc.).

For any sort of multi-pod scaling, you must have a correctly configured ingress controller performing HTTP/2 or gRPC load balancing including name resolution.

Further details can be found in the ingress: section in the values.yaml file.

Load Balancer#

For L2 load balancing, a barebones configuration using MetalLB has been supplied and is located in the loadbalancer: section in the values.yaml file.

This is useful in on-premise deployments, however, cloud-based deployments need to use the appropriate service from their provider as the networking is generally not exposed at this layer.

More details can be found in the loadbalancer: section in the values.yaml file.

Autoscaling Configurations#

After deploying, you can automatically scale allocated compute resources based on observed utilization. Within values.yaml of the Riva Helm Chart, replicaCount can be increased to enable the Horizontal Pod Autoscaler. This also requires a correctly configured ingress controller performing HTTP/2 and gRPC load-balancing including name resolution.

gRPC Streams with SSL/TLS#

Secure Socket Layer (SSL) and its successor, Transport Layer Security (TLS), are protocols for establishing and encrypting data exchanged between two endpoints, and are highly recommended with HTTP/2 and gRPC. While this protocol adds security, it also adds overhead. This can be mitigated by establishing a gRPC stream instead of creating unary calls where possible. Each time a new gRPC channel is created, there is an overhead required to exchange SSL/TLS keys, as well as establishing a TCP and HTTP/2 connection. If the client regularly sends or receives messages with the server, a stream can cut this overhead.

Monitoring GPU and CPU Telemetry#

When running a GPU intensive workload, it is important to monitor and factor in hardware telemetry into your compute jobs, and is in fact required to enable Horizontal Pod Autoscaling. NVIDIA Data Center GPU Manager (DCGM) is a suite of tools to monitor NVIDIA data center GPUs in cluster environments. Integrating GPU Telemetry into Kubernetes uses GPU temperatures and other telemetry to increase data center efficiency and minimize resource allocation. It is equally important to monitor other resources as well, including CPU core count utilization, or any custom metrics relevant to a use case.

Load-Balancing Types#

Load-balancing is the process of allocating a fixed number of resources to an arbitrary number of incoming tasks. Most notably for a scalable server-client application, a load balancer distributes network traffic over a set of nodes. There are common classes of load-balancing, each with their own pros and cons.

A barebones implementation of Layer 2 (Data-Link) load-balancing using MetalLB is provided (but not enabled by default). In this method, one node takes all responsibility of handling traffic, which is then spread to the pods from that node. If the node fails, this acts as a failover mechanism. However, this severely limits bandwidth. Additionally, Layer 2 is usually not exposed by cloud-based providers, in which case this is not usable.

Level 4 (Transport) load-balancing uses network information from the transport layer such as application ports and protocol to direct traffic. L4 load-balancing operates on a connection level; however, gRPC uses HTTP/2, which multiplexes multiple calls on a single connection, funneling all calls on that connection to one endpoint.

Level 7 (Application) load-balancing uses the high-level application information, the content of messages, to direct traffic. This generally allows for “smarter” load-balancing decisions, as the algorithm can use additional information. Additionally, this does not suffer from the same problem as L4 load-balancing, but it comes at the cost of added latency to gRPC calls.