(grpc-load-test-sli-guide)=
This document describes which metrics to watch when load testing a self-hosted NVCF gRPC deployment, what each metric indicates, and how to interpret the saturation sequence. Values are hardware-dependent — what is transferable is the order in which signals appear and what they mean.
For run commands and cluster setup, see {ref}self-managed-grpc-load-test.
Understanding the request path helps interpret the metrics:
maxRequestConcurrency is exhausted — it queues them.maxRequestConcurrency) and inference time per request set the maximum
sustainable req/s.minInstances < maxInstances).These rise before errors appear. Use them to predict saturation.
nvcf_grpc_proxy_service_active_connections_totalWhat it is: Number of active worker connections held by the gRPC proxy.
What to look for:
nvcf_grpc_proxy_service_session_init_seconds_total (p95)What it is: Time for the proxy to establish a worker session (first contact for a new connection).
What to look for:
function_request_totalWhat it is: Cumulative completed requests for a specific function, scraped
from the gRPC Proxy (job=grpc). Filter by function_id to isolate a
single function’s throughput. Labels: function_id, function_version_id,
nca_id.
What to look for:
rate(function_request_total[1m]) gives req/s. Plot alongside VU count.nvca_instance_type_allocatableWhat it is: Available worker slots in the cluster fleet.
What to look for:
These confirm saturation after it has occurred. Not useful for early warning, but confirm the failure mode. k6 is the primary source for these signals.
grpc_req_duration p95 (k6)What it is: End-to-end gRPC request latency measured by k6.
What to look for:
k6 metric: grpc_req_duration (watch p90, p95 in k6 Cloud)
grpc_req_failed (k6)What it is: k6 metric tracking the rate of failed gRPC requests.
What to look for:
Stays near zero through moderate overload. The proxy holds connections and queues requests rather than rejecting them — failures only appear once requests have been held long enough to hit the k6 client timeout.
Non-zero grpc_req_failed is a breaking-point signal, not an early
warning. By the time it rises, the system is well past the capacity wall.
Error type matters:
context deadline exceeded — overload, expected at extreme VU counts.UNAVAILABLE or connection errors — proxy or network issue unrelated
to capacity.k6 metric: grpc_req_failed (rate or count in k6 Cloud)
function_request_latency p95 (worker-side)What it is: Per-request latency as measured by the worker itself. The time spent inside the function from the moment the worker picks up the request.
What to look for:
grpc_req_duration (client-side). If k6 p95 is high but
worker p95 is low, the bottleneck is queuing at the proxy, not inference time.These should remain at zero during a clean load test. Any non-zero value warrants investigation.
NATS is the message bus between the gRPC proxy and the worker.
Early-warning signal: nvcf_grpc_proxy_service_active_connections_total
decoupling from throughput is still the earliest proxy-side saturation indicator.
Useful envoy signals during a gRPC test:
Regardless of hardware, saturation follows this order:
Steps 1-3 are observable before errors reach clients. Steps 4-5 confirm saturation is underway.
These are relative thresholds to calibrate against your baseline — not absolute values. Hardware, workload, and deployment configuration all affect where these numbers land.