gRPC Load Test SLI Guide

View as Markdown

(grpc-load-test-sli-guide)=

gRPC Load Test SLI Guide

This document describes which metrics to watch when load testing a self-hosted NVCF gRPC deployment, what each metric indicates, and how to interpret the saturation sequence. Values are hardware-dependent — what is transferable is the order in which signals appear and what they mean.

For run commands and cluster setup, see {ref}self-managed-grpc-load-test.


How Self-Hosted NVCF Handles Load

Understanding the request path helps interpret the metrics:

  • The gRPC proxy holds in-flight requests. It does not reject requests until maxRequestConcurrency is exhausted — it queues them.
  • The worker sidecar is the throughput ceiling. Its concurrency limit (maxRequestConcurrency) and inference time per request set the maximum sustainable req/s.
  • NATS dispatches work between components. It is downstream of the proxy’s internal queue and will not show pressure until the proxy itself is saturated.
  • NVCA only acts on scaling when scale-out is configured (minInstances < maxInstances).

SLIs to Monitor

Group 1: Leading Indicators

These rise before errors appear. Use them to predict saturation.

nvcf_grpc_proxy_service_active_connections_total

What it is: Number of active worker connections held by the gRPC proxy.

What to look for:

  • Rises with load during healthy operation.
  • Decouples from throughput at the saturation point — connections keep rising while req/s flattens. This is the earliest saturation signal.
1nvcf_grpc_proxy_service_active_connections_total

nvcf_grpc_proxy_service_session_init_seconds_total (p95)

What it is: Time for the proxy to establish a worker session (first contact for a new connection).

What to look for:

  • Low at idle, rises when the proxy is busy competing for worker slots.
  • A rising p95 means new requests are waiting longer to get a worker session
  • Check bucket distribution: are requests piling up in the higher latency buckets (>100ms, >250ms)?
1histogram_quantile(0.95,
2 rate(nvcf_grpc_proxy_service_session_init_seconds_total_bucket{is_reconnect="false"}[1m]))

Group 2: Throughput and Capacity

function_request_total

What it is: Cumulative completed requests for a specific function, scraped from the gRPC Proxy (job=grpc). Filter by function_id to isolate a single function’s throughput. Labels: function_id, function_version_id, nca_id.

What to look for:

  • rate(function_request_total[1m]) gives req/s. Plot alongside VU count.
  • Throughput plateau = capacity wall. If req/s stops growing while VUs keep increasing, the system is saturated.
1rate(function_request_total{job="grpc", function_id="<your-function-id>"}[1m])

nvca_instance_type_allocatable

What it is: Available worker slots in the cluster fleet.

What to look for:

  • Drops as workers are allocated to new deployments
  • If allocatable reaches 0 on a fixed cluster: new worker deployments will fail with a no-capacity error
1nvca_instance_type_allocatable{instance_type="<your-instance-type>"}

Group 3: Lagging Indicators

These confirm saturation after it has occurred. Not useful for early warning, but confirm the failure mode. k6 is the primary source for these signals.

grpc_req_duration p95 (k6)

What it is: End-to-end gRPC request latency measured by k6.

What to look for:

  • Rises steeply after the throughput plateau.
  • Use p95 > 5s as a lagging SLO threshold. By the time it rises, the capacity wall has already been hit.

k6 metric: grpc_req_duration (watch p90, p95 in k6 Cloud)

grpc_req_failed (k6)

What it is: k6 metric tracking the rate of failed gRPC requests.

What to look for:

  • Stays near zero through moderate overload. The proxy holds connections and queues requests rather than rejecting them — failures only appear once requests have been held long enough to hit the k6 client timeout.

  • Non-zero grpc_req_failed is a breaking-point signal, not an early warning. By the time it rises, the system is well past the capacity wall.

  • Error type matters:

    • context deadline exceeded — overload, expected at extreme VU counts.
    • UNAVAILABLE or connection errors — proxy or network issue unrelated to capacity.

k6 metric: grpc_req_failed (rate or count in k6 Cloud)

function_request_latency p95 (worker-side)

What it is: Per-request latency as measured by the worker itself. The time spent inside the function from the moment the worker picks up the request.

What to look for:

  • Complements grpc_req_duration (client-side). If k6 p95 is high but worker p95 is low, the bottleneck is queuing at the proxy, not inference time.
  • Rising worker latency under load indicates the worker itself is the throughput ceiling.
1histogram_quantile(0.95, rate(function_request_latency_bucket[1m]))

Group 4: Stability Signals

These should remain at zero during a clean load test. Any non-zero value warrants investigation.

MetricThresholdWhat it means
nvcf_grpc_proxy_service_nats_error_total> 0Proxy lost connectivity to NATS
nvcf_grpc_proxy_service_nats_reconnect_total> 0NATS connection instability
nvca_event_error_total{nvca_event_name="TICK_ACKNOWLEDGE_REQUEST"}> 0NVCA failing to acknowledge worker heartbeats
nvca_container_crash_total> 0Worker pod OOM or crash
nvca_controller_runtime_reconcile_errors_total> 0k8s controller errors in NVCA
nvca_event_queue_lengthsustained > 0NVCA falling behind processing heartbeat/scaling events

NATS JetStream

NATS is the message bus between the gRPC proxy and the worker.

Early-warning signal: nvcf_grpc_proxy_service_active_connections_total decoupling from throughput is still the earliest proxy-side saturation indicator.

Envoy Gateway

Useful envoy signals during a gRPC test:

1# Active downstream connections on the gRPC listener
2envoy_listener_downstream_cx_active{envoy_listener_address="0.0.0.0_10081"}
3
4# Overflow -- TCP connection ceiling hit (should stay 0 unless saturated)
5envoy_listener_downstream_cx_overflow{envoy_listener_address="0.0.0.0_10081"}
6
7# Envoy pod restart count
8sum(increase(kube_pod_container_status_restarts_total{namespace="envoy-gateway-system"}[$__range]))

The Saturation Sequence

Regardless of hardware, saturation follows this order:

11. active_connections_total rises with load
2
32. active_connections_total growth decouples from throughput ← LEADING SIGNAL
4
53. Throughput (req/s) plateaus despite more VUs ← CAPACITY WALL
6
74. grpc_req_duration p95 rises steeply ← LAGGING SIGNAL
8
95. Client timeouts (context deadline exceeded) ← FAILURE VISIBLE TO CLIENTS

Steps 1-3 are observable before errors reach clients. Steps 4-5 confirm saturation is underway.


These are relative thresholds to calibrate against your baseline — not absolute values. Hardware, workload, and deployment configuration all affect where these numbers land.

SignalThresholdAction
nvcf_grpc_proxy_service_active_connections_total> 50% of maxRequestConcurrency sustained 2 minWarning: approaching saturation
nvcf_grpc_proxy_service_active_connections_total> 80% of maxRequestConcurrency sustained 1 minCritical: at capacity
Throughput plateaureq/s flat while VUs still increasingCapacity wall reached
session_init_seconds p95> 100msProxy contention — investigate
nats_error_total> 0Immediate investigation