Best Practices#

Security#

Secure Socket Layer (SSL) and its successor, Transport Layer Security (TLS), are protocols for establishing and encrypting data exchanged between two endpoints, and are highly recommended with HTTP/2 and gRPC. While this protocol adds security, it also adds overhead. This can be mitigated by establishing a gRPC stream instead of creating unary calls where possible. Each time a new gRPC channel is created, there is an overhead required to exchange SSL/TLS keys, as well as establishing a TCP and HTTP/2 connection. If the client regularly sends or receives messages with the server, a stream can cut this overhead.

Enabling Secure Connections in Server#

Riva servers can be deployed with SSL/TLS certificates to authenticate the Riva server and encrypt all data exchanged between client and the server. In order to use SSL/TLS, you must provide the paths to the SSL/TLS certificates and keys in the config.sh configuration file of the Quick Start scripts.

If either the ssl_server_cert or ssl_server_key variables are empty, an insecure gRPC connection is used.

Connecting to Secured Servers#

To connect to a secure Riva server, clients must provide a certificate in the chain of trust of the server. Explicitly, this means that the client must be configured with the certificate deployed on the server or any certificate belonging to a service used to generate the server certificate when establishing a connection with the Riva server.

With gRPC switching from plain-text to secure, modifications to how the communication channel is created is required. In Python, for example:

# required imports
import grpc

# establish connection to Riva server and initialize client
ssl_cert="my_ssl_cert.crt"
with open(ssl_cert, 'rb') as f:
  certificates = f.read()
creds = grpc.ssl_channel_credentials(certificates)
channel = grpc.secure_channel("localhost:50051", creds)

Autoscaling Configurations#

After deploying, you can automatically scale allocated compute resources based on observed utilization. Within values.yaml of the Riva Helm Chart, replicaCount can be increased to enable the Horizontal Pod Autoscaler. This also requires a correctly configured ingress controller performing HTTP/2 and gRPC load-balancing including name resolution.

Monitoring GPU and CPU Telemetry#

When running a GPU intensive workload, it is important to monitor and factor in hardware telemetry into your compute jobs, and is in fact required to enable Horizontal Pod Autoscaling. NVIDIA Data Center GPU Manager (DCGM) is a suite of tools to monitor NVIDIA data center GPUs in cluster environments. Integrating GPU Telemetry into Kubernetes uses GPU temperatures and other telemetry to increase data center efficiency and minimize resource allocation. It is equally important to monitor other resources as well, including CPU core count utilization, or any custom metrics relevant to a use case.

Riva exposes the Triton Inference Server’s metrics API from both the Quick Start scripts and Helm deployments options. Triton provides Prometheus metrics indicating GPU and request statistics. These metrics are available at http://<hostname>:<metrics_port>/metrics where <metrics_port> is the port number that is mapped to port 8002 inside the Riva container. For example, if Riva is deployed locally and container port 8002 is mapped to port 49183, the metrics are available at http://localhost:49183/metrics.

For more information, refer to Triton Inference Server Metrics.

Load-Balancing Types#

Load-balancing is the process of allocating a fixed number of resources to an arbitrary number of incoming tasks. Most notably for a scalable server-client application, a load balancer distributes network traffic over a set of nodes. There are common classes of load-balancing, each with their own pros and cons.

A barebones implementation of Layer 2 (Data-Link) load-balancing using MetalLB is provided (but not enabled by default). In this method, one node takes all responsibility of handling traffic, which is then spread to the pods from that node. If the node fails, this acts as a failover mechanism. However, this severely limits bandwidth. Additionally, Layer 2 is usually not exposed by cloud-based providers, in which case this is not usable.

Level 4 (Transport) load-balancing uses network information from the transport layer such as application ports and protocol to direct traffic. L4 load-balancing operates on a connection level; however, gRPC uses HTTP/2, which multiplexes multiple calls on a single connection, funneling all calls on that connection to one endpoint.

Level 7 (Application) load-balancing uses the high-level application information, the content of messages, to direct traffic. This generally allows for “smarter” load-balancing decisions, as the algorithm can use additional information. Additionally, this does not suffer from the same problem as L4 load-balancing, but it comes at the cost of added latency to gRPC calls.

Optimizing Resource Usage for Embedded Platforms#

On embedded platforms, it is crucial to have models with the lowest possible memory footprint. To achieve this, use max_batch_size of 1 in the riva-build command while deploying models on embedded platforms. Refer to the ASR Pipeline Configuration and TTS Pipeline Configuration sections where the riva-build arguments to use max_batch_size as 1 are explained in detail.

Monitoring the resource usage for the models can be done using the built-in tegrastats utility on the Jetson platforms. Launch the utility with the sudo tegrastats --interval 100 command. Here are some log snippets to help you identify the instantaneous resource usage.

Metric

Log snippet

CPU

RAM 7912/15817MB (lfb 55x4MB) SWAP 8/7908MB (cached 0MB) CPU [0%@2265,0%@2265,83%@2265,46%@2265,83%@2265,0%@2265,9%@2265,16%@2265] EMC_FREQ 1%@2133 GR3D_FREQ 62%@1338 APE 150 MTS fg 0% bg 33% AO@41.5C GPU@42.5C Tdiode@43.25C PMIC@50C AUX@40.5C CPU@43.5C thermal@40.95C Tboard@42C GPU 2005/521 CPU 2005/675 SOC 1851/1000 CV 0/0 VDDRQ 462/173 SYS5V 1898/1564

GPU

RAM 7912/15817MB (lfb 55x4MB) SWAP 8/7908MB (cached 0MB) CPU [0%@2265,0%@2265,83%@2265,46%@2265,83%@2265,0%@2265,9%@2265,16%@2265] EMC_FREQ 1%@2133 GR3D_FREQ 62%@1338 APE 150 MTS fg 0% bg 33% AO@41.5C GPU@42.5C Tdiode@43.25C PMIC@50C AUX@40.5C CPU@43.5C thermal@40.95C Tboard@42C GPU 2005/521 CPU 2005/675 SOC 1851/1000 CV 0/0 VDDRQ 462/173 SYS5V 1898/1564

RAM

RAM 7912/15817MB (lfb 55x4MB) SWAP 8/7908MB (cached 0MB) CPU [0%@2265,0%@2265,83%@2265,46%@2265,83%@2265,0%@2265,9%@2265,16%@2265] EMC_FREQ 1%@2133 GR3D_FREQ 62%@1338 APE 150 MTS fg 0% bg 33% AO@41.5C GPU@42.5C Tdiode@43.25C PMIC@50C AUX@40.5C CPU@43.5C thermal@40.95C Tboard@42C GPU 2005/521 CPU 2005/675 SOC 1851/1000 CV 0/0 VDDRQ 462/173 SYS5V 1898/1564