gRPC Load Test SLI Guide#
This document describes which metrics to watch when load testing a self-hosted NVCF gRPC deployment, what each metric indicates, and how to interpret the saturation sequence. Values are hardware-dependent – what is transferable is the order in which signals appear and what they mean.
For run commands and cluster setup, see Self-Managed NVCF gRPC Load Test.
How Self-Hosted NVCF Handles Load#
Understanding the request path helps interpret the metrics:
The gRPC proxy holds in-flight requests. It does not reject requests until
maxRequestConcurrencyis exhausted – it queues them.The worker sidecar is the throughput ceiling. Its concurrency limit (
maxRequestConcurrency) and inference time per request set the maximum sustainable req/s.NATS dispatches work between components. It is downstream of the proxy’s internal queue and will not show pressure until the proxy itself is saturated.
NVCA only acts on scaling when scale-out is configured (
minInstances < maxInstances).
SLIs to Monitor#
Group 1: Leading Indicators#
These rise before errors appear. Use them to predict saturation.
nvcf_grpc_proxy_service_active_connections_total#
What it is: Number of active worker connections held by the gRPC proxy.
What to look for:
Rises with load during healthy operation.
Decouples from throughput at the saturation point – connections keep rising while req/s flattens. This is the earliest saturation signal.
nvcf_grpc_proxy_service_active_connections_total
nvcf_grpc_proxy_service_session_init_seconds_total (p95)#
What it is: Time for the proxy to establish a worker session (first contact for a new connection).
What to look for:
Low at idle, rises when the proxy is busy competing for worker slots.
A rising p95 means new requests are waiting longer to get a worker session
Check bucket distribution: are requests piling up in the higher latency buckets (>100ms, >250ms)?
histogram_quantile(0.95,
rate(nvcf_grpc_proxy_service_session_init_seconds_total_bucket{is_reconnect="false"}[1m]))
Group 2: Throughput and Capacity#
function_request_total#
What it is: Cumulative completed requests for a specific function, scraped
from the gRPC Proxy (job=grpc). Filter by function_id to isolate a
single function’s throughput. Labels: function_id, function_version_id,
nca_id.
What to look for:
rate(function_request_total[1m])gives req/s. Plot alongside VU count.Throughput plateau = capacity wall. If req/s stops growing while VUs keep increasing, the system is saturated.
rate(function_request_total{job="grpc", function_id="<your-function-id>"}[1m])
nvca_instance_type_allocatable#
What it is: Available worker slots in the cluster fleet.
What to look for:
Drops as workers are allocated to new deployments
If allocatable reaches 0 on a fixed cluster: new worker deployments will fail with a no-capacity error
nvca_instance_type_allocatable{instance_type="<your-instance-type>"}
Group 3: Lagging Indicators#
These confirm saturation after it has occurred. Not useful for early warning, but confirm the failure mode. k6 is the primary source for these signals.
grpc_req_duration p95 (k6)#
What it is: End-to-end gRPC request latency measured by k6.
What to look for:
Rises steeply after the throughput plateau.
Use p95 > 5s as a lagging SLO threshold. By the time it rises, the capacity wall has already been hit.
k6 metric: grpc_req_duration (watch p90, p95 in k6 Cloud)
grpc_req_failed (k6)#
What it is: k6 metric tracking the rate of failed gRPC requests.
What to look for:
Stays near zero through moderate overload. The proxy holds connections and queues requests rather than rejecting them – failures only appear once requests have been held long enough to hit the k6 client timeout.
Non-zero
grpc_req_failedis a breaking-point signal, not an early warning. By the time it rises, the system is well past the capacity wall.Error type matters:
context deadline exceeded– overload, expected at extreme VU counts.UNAVAILABLEor connection errors – proxy or network issue unrelated to capacity.
k6 metric: grpc_req_failed (rate or count in k6 Cloud)
function_request_latency p95 (worker-side)#
What it is: Per-request latency as measured by the worker itself. The time spent inside the function from the moment the worker picks up the request.
What to look for:
Complements
grpc_req_duration(client-side). If k6 p95 is high but worker p95 is low, the bottleneck is queuing at the proxy, not inference time.Rising worker latency under load indicates the worker itself is the throughput ceiling.
histogram_quantile(0.95, rate(function_request_latency_bucket[1m]))
Group 4: Stability Signals#
These should remain at zero during a clean load test. Any non-zero value warrants investigation.
Metric |
Threshold |
What it means |
|---|---|---|
|
> 0 |
Proxy lost connectivity to NATS |
|
> 0 |
NATS connection instability |
|
> 0 |
NVCA failing to acknowledge worker heartbeats |
|
> 0 |
Worker pod OOM or crash |
|
> 0 |
k8s controller errors in NVCA |
|
sustained > 0 |
NVCA falling behind processing heartbeat/scaling events |
NATS JetStream#
NATS is the message bus between the gRPC proxy and the worker.
Early-warning signal: nvcf_grpc_proxy_service_active_connections_total
decoupling from throughput is still the earliest proxy-side saturation indicator.
Envoy Gateway#
Useful envoy signals during a gRPC test:
# Active downstream connections on the gRPC listener
envoy_listener_downstream_cx_active{envoy_listener_address="0.0.0.0_10081"}
# Overflow -- TCP connection ceiling hit (should stay 0 unless saturated)
envoy_listener_downstream_cx_overflow{envoy_listener_address="0.0.0.0_10081"}
# Envoy pod restart count
sum(increase(kube_pod_container_status_restarts_total{namespace="envoy-gateway-system"}[$__range]))
The Saturation Sequence#
Regardless of hardware, saturation follows this order:
1. active_connections_total rises with load
↓
2. active_connections_total growth decouples from throughput ← LEADING SIGNAL
↓
3. Throughput (req/s) plateaus despite more VUs ← CAPACITY WALL
↓
4. grpc_req_duration p95 rises steeply ← LAGGING SIGNAL
↓
5. Client timeouts (context deadline exceeded) ← FAILURE VISIBLE TO CLIENTS
Steps 1-3 are observable before errors reach clients. Steps 4-5 confirm saturation is underway.
Recommended Thresholds#
These are relative thresholds to calibrate against your baseline – not absolute values. Hardware, workload, and deployment configuration all affect where these numbers land.
Signal |
Threshold |
Action |
|---|---|---|
|
> 50% of |
Warning: approaching saturation |
|
> 80% of |
Critical: at capacity |
Throughput plateau |
req/s flat while VUs still increasing |
Capacity wall reached |
|
> 100ms |
Proxy contention – investigate |
|
> 0 |
Immediate investigation |