Self-Managed NVCF HTTP Soak Test

View as Markdown

This guide walks through running a sustained HTTP soak test against a self-managed NVCF cluster using k6. The test sends a constant arrival-rate load to one or more deployed functions and reports success rate, latency percentiles, and throughput over an extended period (default 48 hours).

Prerequisites

Self-hosted CLI

You need a working nvcf-cli configured against your self-managed cluster. If you have not set this up yet, follow the self-hosted-cli guide to install the binary and the cli-configuration section to point it at your gateway.

Verify the CLI can reach the cluster before continuing:

$./nvcf-cli init

Deploy the load test function

Use the load_tester_supreme container for soak testing. It is purpose-built for high-throughput benchmarking and includes:

  • gRPC + HTTP + SSE endpoints in a single image
  • Tunable repeats, delay, and size fields to shape request/response profiles
  • Built-in OpenTelemetry tracing

The source, build instructions, and registry push examples are in the nv-cloud-function-helpers repository. Build and push the image to whichever container registry your cluster has credentials for:

$git clone https://github.com/NVIDIA/nv-cloud-function-helpers.git
$cd nv-cloud-function-helpers/examples/function_samples/load_tester_supreme
$
$# Build (multi-arch)
$docker buildx build --platform linux/amd64,linux/arm64 -t load_tester_supreme .
$
$# Tag and push (replace with your registry -- ECR, NGC, Docker Hub, etc.)
$docker tag load_tester_supreme <your-registry>/load_tester_supreme:latest
$docker push <your-registry>/load_tester_supreme:latest

To check which registries your cluster recognises, run ./nvcf-cli registry list.

Then create the function and deploy it using the CLI. For HTTP soak testing you can create multiple functions to simulate broader load:

$# Create the function (HTTP)
$./nvcf-cli function create \
> --name "load-tester-supreme" \
> --image "<your-registry>/load_tester_supreme:latest" \
> --inference-url "/echo" \
> --inference-port 8000 \
> --health-uri "/health" \
> --health-port 8000 \
> --health-timeout PT30S
$
$# Deploy (adjust GPU type and instance type for your cluster)
$./nvcf-cli function deploy create \
> --gpu L40S \
> --instance-type NCP.GPU.L40S_1x \
> --min-instances 1 \
> --max-instances 1
$
$# Generate an API key for invocations (default expiry: 24h)
$./nvcf-cli api-key generate
$
$# For soak tests longer than 24 hours, set a longer expiry:
$./nvcf-cli api-key generate --expires-in "7d"

Export the key so it can be passed to k6:

$export API_KEY=<your-nvapi-key>

Repeat the function create and function deploy create steps to create additional functions if you want to distribute load across multiple function endpoints.

Once deployed, note the following — you will need them for the k6 script:

  • Function ID — the UUID returned by function create
  • Function Version ID — the UUID of the specific deployed version
  • Invocation host — the Host header used for invocation routing
  • API key — from ./nvcf-cli api-key generate (begins with nvapi-); export it as $API_KEY

Obtain the gateway address

Your gateway address is the external address of the Envoy Gateway deployed with the control plane. To retrieve it:

$export GATEWAY_ADDR=$(kubectl get gateway nvcf-gateway -n envoy-gateway \
> -o jsonpath='{.status.addresses[0].value}')
$echo "Gateway Address: $GATEWAY_ADDR"

On AWS EKS this is an ELB hostname (e.g. a1b2c3d4.us-east-1.elb.amazonaws.com). For a local deployment (Kind, k3d, Docker Desktop) it is typically localhost or 127.0.0.1.

The gateway uses Host header routing to direct traffic:

Host prefixRoutes to
api.$GATEWAY_ADDRNVCF management API (function CRUD)
invocation.$GATEWAY_ADDRInvocation / inference service
api-keys.$GATEWAY_ADDRAPI Keys service

For the soak test, set:

  • BASE_URL = http://$GATEWAY_ADDR
  • INVOKE_HOST = invocation.$GATEWAY_ADDR

The CLI saves the function and version IDs automatically. Run ./nvcf-cli status to view them at any time.

Install k6

Install k6 if you don’t have it:

$# macOS
$brew install k6
$# Linux (Debian/Ubuntu)
$sudo gpg -k
$sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg \
> --keyserver hkp://keyserver.ubuntu.com:80 \
> --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
$echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" \
> | sudo tee /etc/apt/sources.list.d/k6.list
$sudo apt-get update && sudo apt-get install k6

The k6 test script

Save the following script as k6-nvcf-http-soak.js. The script uses the constant-arrival-rate executor to guarantee an exact request rate per second regardless of response time, which is critical for soak testing where you want a steady, predictable load.

The latest version of this script is maintained in the nv-cloud-function-helpers repository.

1/**
2 * NVCF HTTP soak test script.
3 *
4 * Pass a FUNCTIONS JSON array with one entry per function you want to
5 * load-test (one function per GPU node is typical). There is no limit
6 * on the number of functions -- each iteration sends one request to
7 * every function via http.batch().
8 *
9 * Total TPS = TPS_PER_FUNC × number of functions.
10 *
11 * Required env vars: BASE_URL, INVOKE_HOST, FUNCTIONS
12 * Optional: API_KEY, TPS_PER_FUNC, PRE_VUS, MAX_VUS, REPEATS, DURATION
13 *
14 * Example:
15 * k6 run -e BASE_URL=http://$GATEWAY_ADDR \
16 * -e INVOKE_HOST=invocation.$GATEWAY_ADDR \
17 * -e 'FUNCTIONS=[{"funcId":"uuid1","verId":"uuid2"}]' \
18 * -e DURATION=48h \
19 * -e API_KEY=$API_KEY \
20 * k6-nvcf-http-soak.js
21 */
22import http from 'k6/http';
23import { check } from 'k6';
24import { Rate, Trend, Counter } from 'k6/metrics';
25
26const invokeSuccess = new Rate('invoke_success');
27const invokeLatency = new Trend('invoke_latency_ms');
28const totalRequests = new Counter('total_requests');
29
30// ---------- Required config ----------
31const DURATION = __ENV.DURATION || '48h';
32const BASE_URL = __ENV.BASE_URL || '';
33const INVOKE_HOST = __ENV.INVOKE_HOST || '';
34
35const FUNCTIONS = (() => {
36 const raw = __ENV.FUNCTIONS || '';
37 if (!raw) return [];
38 try {
39 const arr = JSON.parse(raw);
40 if (!Array.isArray(arr) || arr.length === 0) return [];
41 return arr;
42 } catch (e) {
43 console.warn('FUNCTIONS JSON parse failed: ' + e.message);
44 return [];
45 }
46})();
47
48// ---------- Optional parameters ----------
49const API_KEY = __ENV.API_KEY || '';
50const REPEATS = parseInt(__ENV.REPEATS || '100', 10);
51const TPS_PER_FUNC = parseInt(__ENV.TPS_PER_FUNC || '125', 10);
52const PRE_VUS = parseInt(__ENV.PRE_VUS || '250', 10);
53const MAX_VUS = parseInt(__ENV.MAX_VUS || '500', 10);
54
55export const options = {
56 scenarios: {
57 invoke_all_functions: {
58 executor: 'constant-arrival-rate',
59 rate: TPS_PER_FUNC,
60 timeUnit: '1s',
61 duration: DURATION,
62 preAllocatedVUs: PRE_VUS,
63 maxVUs: MAX_VUS,
64 exec: 'invoke_all_functions',
65 },
66 },
67 thresholds: {
68 http_req_duration: ['p(95)<5000'],
69 invoke_success: ['rate>0.999'],
70 },
71};
72
73export function invoke_all_functions() {
74 const payload = JSON.stringify({ message: 'randomString', repeats: REPEATS });
75
76 const requests = FUNCTIONS.map((f, i) => ({
77 method: 'POST',
78 url: `${BASE_URL}/echo`,
79 body: payload,
80 params: {
81 headers: {
82 'Host': INVOKE_HOST,
83 'Content-Type': 'application/json',
84 'Authorization': `Bearer ${API_KEY}`,
85 'Function-Id': f.funcId,
86 'Function-Version-Id': f.verId,
87 'Nvcf-Poll-Seconds': '5',
88 },
89 tags: { func: `func_${i}` },
90 },
91 }));
92
93 const responses = http.batch(requests);
94 responses.forEach((res, i) => {
95 totalRequests.add(1);
96 invokeSuccess.add(res.status === 200);
97 if (res.status === 200) invokeLatency.add(res.timings.duration);
98
99 check(res, {
100 [`func_${i} status 200`]: (r) => r.status === 200,
101 [`func_${i} fulfilled`]: (r) => r.headers['Nvcf-Status'] === 'fulfilled',
102 [`func_${i} body not empty`]: (r) => r.body && r.body.length > 0,
103 });
104 });
105}

Running the soak test

A typical soak test uses multiple functions at a sustained TPS for an extended period. For example, multiple functions at 125 TPS each for 48 hours:

$k6 run k6-nvcf-http-soak.js \
> -e BASE_URL=http://$GATEWAY_ADDR \
> -e INVOKE_HOST=invocation.$GATEWAY_ADDR \
> -e 'FUNCTIONS=[{"funcId":"<id-1>","verId":"<ver-1>"},{"funcId":"<id-2>","verId":"<ver-2>"},{"funcId":"<id-3>","verId":"<ver-3>"},{"funcId":"<id-4>","verId":"<ver-4>"}]' \
> -e DURATION=48h \
> -e TPS_PER_FUNC=125 \
> -e API_KEY=$API_KEY

Run the soak test inside tmux or screen so it survives SSH disconnections:

$tmux new -s soak
$# run k6 command above
$# Ctrl-B D to detach, tmux attach -t soak to re-attach

Tune the load

TPS per function

The constant-arrival-rate executor guarantees a fixed number of requests per second. The total TPS is TPS_PER_FUNC × number of functions.

TPS_PER_FUNCFunctionsTotal TPS
111 (smoke test)
254100 (light load)
1254500 (moderate soak)

Default control plane sizing: The default resource sizing that ships with nvcf-base is designed to handle roughly 100 concurrent users. If you need to test beyond that, you will need to scale the control plane components first. Starting with 100 TPS total is a good baseline for validating a default self-managed deployment.

VU allocation

Each VU is a virtual user that can hold one in-flight request. The executor creates new VUs if existing ones are busy. If you see insufficient VUs, consider increasing maxVUs warnings, increase PRE_VUS and MAX_VUS:

$-e PRE_VUS=500 -e MAX_VUS=1000

A good rule of thumb: PRE_VUS ≈ TPS_PER_FUNC × avg_latency_seconds × 2.

Environment variables reference

VariableDescriptionDefault
BASE_URLHTTP base URL of the gateway / load balancer (http://$GATEWAY_ADDR)(required)
INVOKE_HOSTHost header for invocation routing (invocation.$GATEWAY_ADDR)(required)
FUNCTIONSJSON array of function objects: [{"funcId":"…","verId":"…"}, ...](required)
DURATIONTest duration (e.g. 30s, 1h, 48h)48h
API_KEYnvapi-* bearer token (exported from ./nvcf-cli api-key generate, see above)(required)
TPS_PER_FUNCRequests per second per function125
PRE_VUSPre-allocated virtual users per function scenario250
MAX_VUSMaximum virtual users per function scenario500
REPEATSNumber of times the echo endpoint repeats the payload string100

Interpreting results

k6 prints a summary to stdout at the end of each run. Key metrics to monitor during a soak test:

MetricWhat to look for
invoke_successShould stay above 99.9 %. Drops indicate gateway or function errors.
invoke_latency_ms (p95)Should stay below 5 000 ms. Rising latency over time can signal memory leaks, connection exhaustion, or pod restarts.
http_req_duration (p50 / p95 / p99)Overall request round-trip time including network. Compare to invoke_latency_ms to isolate gateway overhead.
total_requestsTotal requests completed. Divide by test duration to verify actual TPS.
http_req_failedPercentage of non-2xx responses. Should be 0 % for a healthy cluster.

To save results for offline analysis:

$# Plain text log
$k6 run k6-nvcf-http-soak.js ... 2>&1 | tee soak-run.log
$
$# JSON output for Grafana / post-processing
$k6 run --out json=soak-results.json k6-nvcf-http-soak.js ...

Verifying the endpoint manually

Before running the soak test, verify the endpoint works with curl:

$curl -X POST http://$GATEWAY_ADDR/echo \
> -H "Host: invocation.$GATEWAY_ADDR" \
> -H "Content-Type: application/json" \
> -H "Function-Id: <your-function-id>" \
> -H "Function-Version-Id: <your-function-version-id>" \
> -H "Authorization: Bearer $API_KEY" \
> -d '{"message": "hello", "repeats": 3}'

You should receive a 200 OK response with the Nvcf-Status: fulfilled header and the message repeated three times.