gRPC Load Test SLI Guide
(grpc-load-test-sli-guide)=
gRPC Load Test SLI Guide
This document describes which metrics to watch when load testing a self-hosted NVCF gRPC deployment, what each metric indicates, and how to interpret the saturation sequence. Values are hardware-dependent — what is transferable is the order in which signals appear and what they mean.
For run commands and cluster setup, see {ref}self-managed-grpc-load-test.
How Self-Hosted NVCF Handles Load
Understanding the request path helps interpret the metrics:
- The gRPC proxy holds in-flight requests. It does not reject requests until
maxRequestConcurrencyis exhausted — it queues them. - The worker sidecar is the throughput ceiling. Its concurrency limit
(
maxRequestConcurrency) and inference time per request set the maximum sustainable req/s. - NATS dispatches work between components. It is downstream of the proxy’s internal queue and will not show pressure until the proxy itself is saturated.
- NVCA only acts on scaling when scale-out is configured
(
minInstances < maxInstances).
SLIs to Monitor
Group 1: Leading Indicators
These rise before errors appear. Use them to predict saturation.
nvcf_grpc_proxy_service_active_connections_total
What it is: Number of active worker connections held by the gRPC proxy.
What to look for:
- Rises with load during healthy operation.
- Decouples from throughput at the saturation point — connections keep rising while req/s flattens. This is the earliest saturation signal.
nvcf_grpc_proxy_service_session_init_seconds_total (p95)
What it is: Time for the proxy to establish a worker session (first contact for a new connection).
What to look for:
- Low at idle, rises when the proxy is busy competing for worker slots.
- A rising p95 means new requests are waiting longer to get a worker session
- Check bucket distribution: are requests piling up in the higher latency buckets (>100ms, >250ms)?
Group 2: Throughput and Capacity
function_request_total
What it is: Cumulative completed requests for a specific function, scraped
from the gRPC Proxy (job=grpc). Filter by function_id to isolate a
single function’s throughput. Labels: function_id, function_version_id,
nca_id.
What to look for:
rate(function_request_total[1m])gives req/s. Plot alongside VU count.- Throughput plateau = capacity wall. If req/s stops growing while VUs keep increasing, the system is saturated.
nvca_instance_type_allocatable
What it is: Available worker slots in the cluster fleet.
What to look for:
- Drops as workers are allocated to new deployments
- If allocatable reaches 0 on a fixed cluster: new worker deployments will fail with a no-capacity error
Group 3: Lagging Indicators
These confirm saturation after it has occurred. Not useful for early warning, but confirm the failure mode. k6 is the primary source for these signals.
grpc_req_duration p95 (k6)
What it is: End-to-end gRPC request latency measured by k6.
What to look for:
- Rises steeply after the throughput plateau.
- Use p95 > 5s as a lagging SLO threshold. By the time it rises, the capacity wall has already been hit.
k6 metric: grpc_req_duration (watch p90, p95 in k6 Cloud)
grpc_req_failed (k6)
What it is: k6 metric tracking the rate of failed gRPC requests.
What to look for:
-
Stays near zero through moderate overload. The proxy holds connections and queues requests rather than rejecting them — failures only appear once requests have been held long enough to hit the k6 client timeout.
-
Non-zero
grpc_req_failedis a breaking-point signal, not an early warning. By the time it rises, the system is well past the capacity wall. -
Error type matters:
context deadline exceeded— overload, expected at extreme VU counts.UNAVAILABLEor connection errors — proxy or network issue unrelated to capacity.
k6 metric: grpc_req_failed (rate or count in k6 Cloud)
function_request_latency p95 (worker-side)
What it is: Per-request latency as measured by the worker itself. The time spent inside the function from the moment the worker picks up the request.
What to look for:
- Complements
grpc_req_duration(client-side). If k6 p95 is high but worker p95 is low, the bottleneck is queuing at the proxy, not inference time. - Rising worker latency under load indicates the worker itself is the throughput ceiling.
Group 4: Stability Signals
These should remain at zero during a clean load test. Any non-zero value warrants investigation.
NATS JetStream
NATS is the message bus between the gRPC proxy and the worker.
Early-warning signal: nvcf_grpc_proxy_service_active_connections_total
decoupling from throughput is still the earliest proxy-side saturation indicator.
Envoy Gateway
Useful envoy signals during a gRPC test:
The Saturation Sequence
Regardless of hardware, saturation follows this order:
Steps 1-3 are observable before errors reach clients. Steps 4-5 confirm saturation is underway.
Recommended Thresholds
These are relative thresholds to calibrate against your baseline — not absolute values. Hardware, workload, and deployment configuration all affect where these numbers land.