(http-load-test-sli-guide)=
This document describes which metrics to watch when load testing a self-hosted NVCF deployment using direct HTTP invocations, what each metric indicates, and how to interpret the failure sequence. Values are hardware-dependent — what is transferable is the order in which signals appear and what they mean.
For run commands and cluster setup, see {ref}self-managed-http-load-test.
Host header and is in
the return path for responses. It does not queue — requests pass through
immediately or are rejected. Envoy enforces TCP connection timeouts: if a
connection is held open beyond the LB timeout, Envoy closes the socket
directly, producing an EOF error at the client with no HTTP status code.maxRequestConcurrency) and inference time per request set the maximum
sustainable req/s.minInstances < maxInstances).Primary data source: the Invocation Service (Prometheus job=invocation).
These rise before errors appear. Use them to predict saturation.
axum_http_requests_total (rate)What it is: Request rate at the IS per method and path.
What to look for:
rate(axum_http_requests_total{job="invocation", method="POST"}[1m])
gives req/s. Should rise with load during healthy operation.axum_http_requests_pendingWhat it is: In-flight HTTP requests currently held at the IS, waiting for the worker to respond.
What to look for:
axum_http_requests_total rate = requests
are stacking at the IS faster than workers can drain them:nvca_instance_type_allocatableWhat it is: Number of worker slots available in the cluster fleet.
What to look for:
These confirm saturation after it has occurred. k6 is the primary source.
http_req_duration p95 (k6)What it is: End-to-end request latency measured by k6.
What to look for:
k6 metric: http_req_duration (watch p90, p95 in k6 Cloud)
http_req_failed (k6)What it is: k6 metric tracking the rate of failed http requests.
What to look for:
EOF — TCP connection closed by Envoy, not the IS.
The IS holds each direct HTTP invocation connection open while waiting for
the worker. When the Envoy connection timeout fires first, the socket is
closed directly. The IS never sends an HTTP error response. The client sees
EOF with no status code.http_req_failed is a breaking-point signal, not an early
warning. The system is well past the capacity wall by the time EOF errors
appear.k6 metric: http_req_failed (rate or count in k6 Cloud)
function_request_latency p95 (worker-side)What it is: Per-request latency as measured by the worker itself — the time spent inside the function from the moment the worker picks up the request.
What to look for:
http_req_duration (client-side). If k6 p95 is high but
worker p95 is low, the bottleneck is queuing at the IS or Envoy, not
inference time.These should be zero during a clean load test. Any non-zero value warrants investigation.
NATS is the message bus between the IS and the worker.
Early-warning signal: axum_http_requests_pending is still the earliest
IS-side queuing indicator. NATS connection counts and stream lag now provide
independent cross-checks.
Envoy Gateway sits in both the request and return path for all HTTP invocations. It enforces TCP connection timeouts and is the direct cause of EOF failures at overload.
Useful envoy signals during an HTTP test:
Regardless of hardware, HTTP saturation follows this order:
Steps 1-4 are observable before errors reach clients. Steps 5-6 confirm saturation is underway.
These are starting-point thresholds to calibrate against your baseline — not absolute values. Hardware, workload, and deployment configuration all affect where these numbers land.