For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Overview
    • Quickstart
  • Before You Deploy
    • Infrastructure Sizing
    • Manifest
  • Deployment
    • Installation Overview
    • Image Mirroring
    • Helmfile Installation
  • GPU Cluster Setup
    • GPU Cluster Setup
    • Self-Managed Clusters
  • Configuration
    • Optional Enhancements
    • LLM Function Enablement
    • Gateway Routing
    • Third-Party Registries
    • Registry Allowlist
    • Cluster Configuration
    • KAI Scheduler
  • Using Cloud Functions
    • API
    • Service Keys
    • Function Creation
    • LLM Gateway
    • Generic HTTP Function Invocation
    • gRPC Function Invocation
    • Container Functions
    • Helm Functions
    • Streaming Functions
    • Configure Autoscaling
    • CLI
  • Function Autoscaling
    • Function Autoscaling Overview
    • Architecture
    • Operations
    • Observability
  • Observability
    • Observability
    • Example Dashboards
  • Operations
    • Control Plane Operations
    • Cluster Monitoring
    • Troubleshooting
  • Runbooks
    • Runbooks
    • Key Rotation
  • Reference
    • Cluster Reference
    • gRPC Load Testing
    • gRPC Load Test SLI Guide
    • HTTP Load Testing
    • HTTP Load Test SLI Guide
    • HTTP Soak Testing
  • Development
    • Architecture Overview
    • Fake GPU Operator
    • Release Process
  • Managed (Legacy)
    • Function Lifecycle
    • Observability
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoCloud Functions
On this page
  • gRPC Load Test SLI Guide
  • How Self-Hosted NVCF Handles Load
  • SLIs to Monitor
  • Group 1: Leading Indicators
  • nvcf_grpc_proxy_service_active_connections_total
  • nvcf_grpc_proxy_service_session_init_seconds_total (p95)
  • Group 2: Throughput and Capacity
  • function_request_total
  • nvca_instance_type_allocatable
  • Group 3: Lagging Indicators
  • grpc_req_duration p95 (k6)
  • grpc_req_failed (k6)
  • function_request_latency p95 (worker-side)
  • Group 4: Stability Signals
  • NATS JetStream
  • Envoy Gateway
  • The Saturation Sequence
  • Recommended Thresholds
Reference

gRPC Load Test SLI Guide

||View as Markdown|
Previous

gRPC Load Testing

Next

HTTP Load Testing

(grpc-load-test-sli-guide)=

gRPC Load Test SLI Guide

This document describes which metrics to watch when load testing a self-hosted NVCF gRPC deployment, what each metric indicates, and how to interpret the saturation sequence. Values are hardware-dependent — what is transferable is the order in which signals appear and what they mean.

For run commands and cluster setup, see {ref}self-managed-grpc-load-test.


How Self-Hosted NVCF Handles Load

Understanding the request path helps interpret the metrics:

  • The gRPC proxy holds in-flight requests. It does not reject requests until maxRequestConcurrency is exhausted — it queues them.
  • The worker sidecar is the throughput ceiling. Its concurrency limit (maxRequestConcurrency) and inference time per request set the maximum sustainable req/s.
  • NATS dispatches work between components. It is downstream of the proxy’s internal queue and will not show pressure until the proxy itself is saturated.
  • NVCA only acts on scaling when scale-out is configured (minInstances < maxInstances).

SLIs to Monitor

Group 1: Leading Indicators

These rise before errors appear. Use them to predict saturation.

nvcf_grpc_proxy_service_active_connections_total

What it is: Number of active worker connections held by the gRPC proxy.

What to look for:

  • Rises with load during healthy operation.
  • Decouples from throughput at the saturation point — connections keep rising while req/s flattens. This is the earliest saturation signal.
1nvcf_grpc_proxy_service_active_connections_total

nvcf_grpc_proxy_service_session_init_seconds_total (p95)

What it is: Time for the proxy to establish a worker session (first contact for a new connection).

What to look for:

  • Low at idle, rises when the proxy is busy competing for worker slots.
  • A rising p95 means new requests are waiting longer to get a worker session
  • Check bucket distribution: are requests piling up in the higher latency buckets (>100ms, >250ms)?
1histogram_quantile(0.95,
2 rate(nvcf_grpc_proxy_service_session_init_seconds_total_bucket{is_reconnect="false"}[1m]))

Group 2: Throughput and Capacity

function_request_total

What it is: Cumulative completed requests for a specific function, scraped from the gRPC Proxy (job=grpc). Filter by function_id to isolate a single function’s throughput. Labels: function_id, function_version_id, nca_id.

What to look for:

  • rate(function_request_total[1m]) gives req/s. Plot alongside VU count.
  • Throughput plateau = capacity wall. If req/s stops growing while VUs keep increasing, the system is saturated.
1rate(function_request_total{job="grpc", function_id="<your-function-id>"}[1m])

nvca_instance_type_allocatable

What it is: Available worker slots in the cluster fleet.

What to look for:

  • Drops as workers are allocated to new deployments
  • If allocatable reaches 0 on a fixed cluster: new worker deployments will fail with a no-capacity error
1nvca_instance_type_allocatable{instance_type="<your-instance-type>"}

Group 3: Lagging Indicators

These confirm saturation after it has occurred. Not useful for early warning, but confirm the failure mode. k6 is the primary source for these signals.

grpc_req_duration p95 (k6)

What it is: End-to-end gRPC request latency measured by k6.

What to look for:

  • Rises steeply after the throughput plateau.
  • Use p95 > 5s as a lagging SLO threshold. By the time it rises, the capacity wall has already been hit.

k6 metric: grpc_req_duration (watch p90, p95 in k6 Cloud)

grpc_req_failed (k6)

What it is: k6 metric tracking the rate of failed gRPC requests.

What to look for:

  • Stays near zero through moderate overload. The proxy holds connections and queues requests rather than rejecting them — failures only appear once requests have been held long enough to hit the k6 client timeout.

  • Non-zero grpc_req_failed is a breaking-point signal, not an early warning. By the time it rises, the system is well past the capacity wall.

  • Error type matters:

    • context deadline exceeded — overload, expected at extreme VU counts.
    • UNAVAILABLE or connection errors — proxy or network issue unrelated to capacity.

k6 metric: grpc_req_failed (rate or count in k6 Cloud)

function_request_latency p95 (worker-side)

What it is: Per-request latency as measured by the worker itself. The time spent inside the function from the moment the worker picks up the request.

What to look for:

  • Complements grpc_req_duration (client-side). If k6 p95 is high but worker p95 is low, the bottleneck is queuing at the proxy, not inference time.
  • Rising worker latency under load indicates the worker itself is the throughput ceiling.
1histogram_quantile(0.95, rate(function_request_latency_bucket[1m]))

Group 4: Stability Signals

These should remain at zero during a clean load test. Any non-zero value warrants investigation.

MetricThresholdWhat it means
nvcf_grpc_proxy_service_nats_error_total> 0Proxy lost connectivity to NATS
nvcf_grpc_proxy_service_nats_reconnect_total> 0NATS connection instability
nvca_event_error_total{nvca_event_name="TICK_ACKNOWLEDGE_REQUEST"}> 0NVCA failing to acknowledge worker heartbeats
nvca_container_crash_total> 0Worker pod OOM or crash
nvca_controller_runtime_reconcile_errors_total> 0k8s controller errors in NVCA
nvca_event_queue_lengthsustained > 0NVCA falling behind processing heartbeat/scaling events

NATS JetStream

NATS is the message bus between the gRPC proxy and the worker.

Early-warning signal: nvcf_grpc_proxy_service_active_connections_total decoupling from throughput is still the earliest proxy-side saturation indicator.

Envoy Gateway

Useful envoy signals during a gRPC test:

1# Active downstream connections on the gRPC listener
2envoy_listener_downstream_cx_active{envoy_listener_address="0.0.0.0_10081"}
3
4# Overflow -- TCP connection ceiling hit (should stay 0 unless saturated)
5envoy_listener_downstream_cx_overflow{envoy_listener_address="0.0.0.0_10081"}
6
7# Envoy pod restart count
8sum(increase(kube_pod_container_status_restarts_total{namespace="envoy-gateway-system"}[$__range]))

The Saturation Sequence

Regardless of hardware, saturation follows this order:

11. active_connections_total rises with load
2 ↓
32. active_connections_total growth decouples from throughput ← LEADING SIGNAL
4 ↓
53. Throughput (req/s) plateaus despite more VUs ← CAPACITY WALL
6 ↓
74. grpc_req_duration p95 rises steeply ← LAGGING SIGNAL
8 ↓
95. Client timeouts (context deadline exceeded) ← FAILURE VISIBLE TO CLIENTS

Steps 1-3 are observable before errors reach clients. Steps 4-5 confirm saturation is underway.


Recommended Thresholds

These are relative thresholds to calibrate against your baseline — not absolute values. Hardware, workload, and deployment configuration all affect where these numbers land.

SignalThresholdAction
nvcf_grpc_proxy_service_active_connections_total> 50% of maxRequestConcurrency sustained 2 minWarning: approaching saturation
nvcf_grpc_proxy_service_active_connections_total> 80% of maxRequestConcurrency sustained 1 minCritical: at capacity
Throughput plateaureq/s flat while VUs still increasingCapacity wall reached
session_init_seconds p95> 100msProxy contention — investigate
nats_error_total> 0Immediate investigation