Best Practices

Using Your Own Data

Many use cases require training new models or fine-tuning existing ones with new data. In these cases, there are a few best practices to follow. Many of these best practices also apply to inputs at inference time.

  • Use lossless audio formats if possible. The use of lossy codecs such as MP3 can reduce quality.

  • Augment training data. Adding background noise to audio training data can initially decrease accuracy, but increase robustness.

  • Limit vocabulary size if using scraped text. Many online sources contain typos or ancillary pronouns and uncommon words. Removing these can improve the language model.

  • Use a minimum sampling rate of 16kHz if possible, but do not resample.

Autoscaling Configurations

After deploying, you can automatically scale allocated compute resources based on observed utilization. Within values.yaml of the Riva Helm Chart, replicaCount can be increased to enable the Horizontal Pod Autoscaler. This also requries a correctly configured ingress controller performing HTTP/2 and gRPC load blanacing including name resolution.

gRPC Streams with SSL/TLS

Secure Socket Layer (SSL) and its successor, Transport Layer Security (TLS), are protocols for establishing and encrypting data exchanged between two endpoints, and are highly recommended with HTTP/2 and gRPC. While this protocol adds security, it also adds overhead. This can be mitigated by establishing a gRPC stream instead of creating unary calls where possible. Each time a new gRPC channel is created, there is an overhead required to exchange SSL/TLS keys, as well as establishing a TCP and HTTP/2 connection. If the client regularly sends or receives messages with the server, a stream can cut this overhead.


TLS support will be added in a future release.

Monitoring GPU and CPU Telemetry

When running a GPU intensive workload, it is important to monitor and factor in hardware telemetry into your compute jobs, and is in fact required to enable Horizontal Pod Autoscaling. NVIDIA Data Center GPU Manager (DCGM) is a suite of tools to monitor NVIDIA data center GPUs in cluster environments. Integrating GPU Telemetry into Kubernetes uses GPU temperatures and other telemetry to increase data center efficiency and minimize resource allocation. It is equally important to monitor other resources as well, including CPU core count utilization, or any custom metrics relevant to a use case.

Load Balancing Types

Load balancing is the process of allocating a fixed amount of resources to an arbitrary amount of incoming tasks. Most notably for a scalable server-client application, a load balancer distributes network traffic over a set of nodes. There are common classes of load balancing, each with their own pros and cons.

A barebones implementation of Layer 2 (Data-Link) load balancing using MetalLB is provided (but not enabled by default). In this method, one node takes all responsibility of handling traffic, which is then spread to the pods from that node. If the node fails, this acts as a failover mechanism. However, this severely limits bandwidth. Additionally, Layer 2 is usually not exposed by cloud-based providers, in which case this is not usable.

Level 4 (Transport) load balancing uses network information from the transport layer such as application ports and protocol to direct traffic. L4 load balancing operates on a connection level; however, gRPC uses HTTP/2 which multiplexes multiple calls on a single connection, funneling all calls on that connection to one endpoint.

Level 7 (Application) load balancing uses the high-level application information, the content of messages, to direct traffic. This generally allows for “smarter” load balancing decisions, as the algorithm can use additional information. Additionally, this doesn’t suffer from the same problem as L4 load balancing, but it comes at the cost of added latency to gRPC calls.