Kubernetes#

This section is not supported for embedded platforms.

Included in the NGC Helm Repository is a chart designed to automate for push-button deployment to a Kubernetes cluster.

The Riva Speech AI Helm Chart deploys the ASR, NLP, and TTS services automatically. The Helm chart performs a number of functions:

  • Pulls Docker images from NGC for the Riva Speech AI server and utility containers for downloading and converting models.

  • Downloads the requested model artifacts from NGC as configured in the values.yaml file.

  • Generates the Triton Inference Server model repository.

  • Starts the Riva Speech AI server as configured in a Kubernetes pod.

  • Exposes the Riva Speech AI server as a configured service.

Examples of pretrained models are released with Riva for each of the services. The Helm chart comes pre-configured for downloading and deploying all of these models.

Important

The Helm chart configuration can be modified for your use case by modifying the values.yaml file. In this file, you can change the settings related to which models to deploy, where to store them, and how to expose the services.

Attention

To deploy Riva, a functioning Kubernetes environment with a GPU (NVIDIA Volta or later) is required. This can be either on-premise or in a cloud provider, or within a managed Kubernetes environment so long as the environment has GPU support enabled.

Installation with Helm#

  1. Validate Kubernetes with NVIDIA GPU support.

    Kubernetes with GPU is well supported by NVIDIA. For more information, refer to the Install Kubernetes instructions to ensure that your environment is properly setup.

    If using an NVIDIA A100 GPU with Multi-Instance GPU (MIG) support, refer to MIG Support in Kubernetes.

  2. Download and modify the Helm chart for your use case.

    export NGC_API_KEY=<your_api_key>
    helm fetch https://helm.ngc.nvidia.com/nvidia/riva/charts/riva-api-2.16.0.tgz \
            --username=\$oauthtoken --password=$NGC_API_KEY --untar
    

    The above comment creates a new directory called riva-api in your current working directory. Within that directory is a values.yaml file that can be modified to suit your use case (refer to the Kubernetes Secrets and Riva Settings sections).

  3. After the values.yaml file has been updated to reflect the deployment requirements, Riva can be deployed to the Kubernetes cluster:

    helm install riva-api riva-api
    

    Alternatively, use the --set option to install without modifying the values.yaml file. Ensure you set the NGC API key, email, and model_key_string to the appropriate values. By default, model_key_string is tlt_encode.

    helm install riva-api \
        --set ngcCredentials.password=`echo -n $NGC_API_KEY | base64 -w0` \
        --set ngcCredentials.email=your_email@your_domain.com \
        --set modelRepoGenerator.modelDeployKey=`echo -n model_key_string | base64 -w0` riva-api
    
  4. Helm configuration. The following sections highlight key areas of the values.yaml file and considerations for deployment. Consult the individual service documentation for more details as well as the Helm chart’s values.yaml file, which contains inline comments explaining the configuration options.

Note

Depending on the number of models enabled, a higher value of failureThreshold for Startup probe in deployment.yaml might be needed to accommodate increased startup time.

Kubernetes Secrets#

The Helm deployment uses multiple Kubernetes secrets for obtaining access to NGC:

  • imagepullsecret: one for Docker images

  • modelpullsecret: one for model artifacts

  • riva-model-deploy-key: one for encrypted models

The names of the secrets can be modified in the values.yaml file, however, if you are deploying into an NVIDIA EGX™ or NVIDIA Fleet Command™ managed environment, your environment must have support for imagepullsecret and modelpullsecret. These secrets are managed by the chart and can be manipulated by setting the respective values within the ngcCredentials section within values.yaml.

Riva Settings#

The values.yaml for Riva is intended to provide maximum flexibility in deployment configurations.

The replicaCount field is used to configure the number of identical instances (or pods) of the services that are deployed. When load-balanced appropriately, increasing this number (as resources permit) enables horizontal scaling for increased load.

Riva API server and Triton server deployed as separate Pods. In the case of a multi-GPU environment, this allows model distribution across GPUs for better utilization of GPUs. There can be as many Triton server pods as the number of available GPUs. The number of Triton server pods is controlled by the number of entries under modelRepoGenerator.ngcModelConfigs. For each Triton config, the enabled flag controls if the Triton Pod is enabled or disabled. Each Triton config has a replicaCount field to control number of replicas. The models field specifies the list of models to be loaded. By default, a single Triton server pod modelRepoGenerator.ngcModelConfigs.tritonGroup0 is configured to load the default list of models.

Prebuilt models not required for your deployment can be deleted from the default list in modelRepoGenerator.ngcModelConfigs.tritonGroup0.models. We recommend you remove models that are not used to reduce deployment time and GPU memory usage.

By default, models are downloaded from NGC, optimized for TensorRT (if necessary) before the service starts, and stored in a short-lived location. When the pod terminates, these model artifacts are deleted and the storage is freed for other workloads. This behavior is controlled by the modelDeployVolume field. Refer to the Kubernetes Volumes documentation for alternative options that can be used for persistent storage. For scale-out deployments, having a model store shared across pods greatly improves scale-up time since the models are prebuilt and already available to the Riva container.

  • Persistent storage should only be used in homogenous deployments where GPU models are identical.

  • Currently, provided models nearly fill a T4’s memory (16 GB). We recommend running a subset of models/services if using a single GPU.

Deploying Custom RMIR Models with Helm#

If you have trained a custom model, you would have generated an .rmir file. Perform the following steps to deploy a custom RMIR model.

  1. Specify the mount point for the model repository. If you are using the host path, update the modelRepoGenerator.modelDeployVolume.hostPath.path parameter in the values.yaml file. It uses the default value of /data/riva and you can configure it as needed. If you are using external PVC, set the persistentVolumeClaim.usePVC parameter to true in the values.yaml file and configure the required fields to mount external PVC directory inside the container. For both these methods, host path or external PVC, the directories are mounted inside the container as /data.

  2. Create a directory rmir at /data/rmir inside the model repository volume.

  3. Create a directory under /data/rmir/ for keeping the custom RMIR model. Directory name should follow <model_name>_v<model_version> format. For example /data/rmir/custom_asr_model_v1.0, where model_name is custom_asr_model with version 1.0

  4. Copy the custom RMIR model file inside the directory created in the previous step. For example /data/rmir/custom_asr_model_v1.0/model.rmir. The RMIR filename can be <any>.rmir.

  5. Add a model entry in values.yaml under modelRepoGenerator.ngcModelConfigs.tritonGroup0.models. Use <model_name>:<model_version> as the naming format. For example:

    modelRepoGenerator:
      ngcModelConfigs:
        tritonGroup0:
          enabled: true
          models:
          - custom_asr_model:1.0
  1. Configure any other values as required and install the Helm chart as per Installation with Helm.

Enabling Model Cache for Scalable Deployment#

Model cache can be used for scalable deployment. It can help to address issues mentioned below.

  • Generation of optimized models for target GPU takes time and increases server start up time.

  • Without model cache, by default, the model repository is stored on a host storage via modelRepoGenerator.modelDeployVolume.hostPath or on a Persistent Volume if persistentVolumeClaim.usePVC is set to true. Unless host storage and Persistent volume is shareable across nodes, Triton pod can not be scaled across nodes.

  • Deploying on another cluster requires re-generation of optimized models

Cloud based model cache using AWS S3 can speed up deployment and provide flexible scaling by reusing pre-generated model repository from cache. Follow steps below to enable model caching.

Note

Optimized models are GPU specific. So Model cache should be used only for homogeneous GPU nodes.

Prepare Model Cache#

  1. Specify required models in the Triton group under modelRepoGenerator.ngcModelConfigs

  2. At first, set cacheConfig.cacheMode to “ReadWrite” and set cacheConfig.gpuProduct as per target GPU. This will perform one time generation of the required models. Setting cacheConfig.gpuProduct value ensures that correct set of models are loaded on the target GPU. Value to be set for cacheConfig.gpuProduct can be obtained using executing the command below on your target cluster.

    kubectl get node --label-columns nvidia.com/gpu.product
    
  3. Set parameters in awsCredentials as per your AWS S3 bucket configuration. At minimum you need to configure below parameters.

    • defaultRegion

    • accessKeyIdRW - With Read + Write permission

    • secretAccessKeyRW - With Read + Write permission

    • accessKeyIdRO - With Read permission. This can be same as accessKeyIdRW.

    • secretAccessKeyRO - With Read permission. This can be same as secretAccessKeyRW.

  4. Install Riva on the node with target GPU via Helm install command. This will generate optimized models for the target GPU and upload models to S3 bucket. This steps needs to be done only once and will have to be repeated only if additional models need to be enabled or the target GPU changes.

  5. Uninstall the deployment after it is successful.

Deployment with Model Cache#

  1. Set cacheConfig.cacheMode cacheMode to “ReadOnly”

  2. Keep rest of the configuration same as previous section

  3. Install Riva on the node with target GPU. Each Triton server downloads required models from S3. Host storage or Persistent volume is not used.

Ingress Controller#

There is a base configuration for a simple ingress controller using Traefik. This can be configured through the values.yaml, or can be replaced with any controller supporting http2 and grpc.

Ingress controllers are found in both on-premise and cloud-based deployments. For this to work correctly, you must have a functional name resolution using whatever mechanism (DNS, /etc/host files, and so on).

For any sort of multi-pod scaling, you must have a correctly configured ingress controller performing HTTP/2 or gRPC load balancing including name resolution.

Further details can be found in the ingress: section in the values.yaml file.