Inference Gateway (GAIE)

Inference Gateway Setup with Dynamo

Inference Gateway (GAIE)

Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.

EPP’s default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in helm/dynamo-gaie/epp-config-dynamo.yaml per EPP convention.

Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. The epp config is the same for both. If no prefill workers found the service degrades gracefully to perform aggregated serving. If you want to use LoRA deploy Dynamo without the Inference Gateway.

Currently, these setups are only supported with the kGateway based Inference Gateway.

Prerequisites

Kubernetes cluster with kubectl configured
NVIDIA GPU drivers installed on worker nodes

Installation Steps

1. Install Dynamo Platform

See Quickstart Guide to install Dynamo Kubernetes Platform.

2. Deploy Inference Gateway

First, deploy an inference gateway service. In this example, we’ll install kgateway based gateway implementation.

$ cd deploy/inference-gateway
$ export NAMESPACE=my-model # You can put the inference gateway into another namespace and then adjust your http-route.yaml
$ ./scripts/install_gaie_crd_kgateway.sh

Note: The manifest at config/manifests/gateway/kgateway/gateway.yaml uses gatewayClassName: agentgateway, but kGateway’s helm chart creates a GatewayClass named kgateway. The patch command in the script fixes this mismatch.

f. Verify the Gateway is running

$ kubectl get gateway inference-gateway
$ 
$ # Sample output
$ # NAME                CLASS      ADDRESS   PROGRAMMED   AGE
$ # inference-gateway   kgateway             True         1m

3. Setup secrets

Do not forget docker registry secret if needed.

$ kubectl create secret docker-registry docker-imagepullsecret \
>   --docker-server=$DOCKER_SERVER \
>   --docker-username=$DOCKER_USERNAME \
>   --docker-password=$DOCKER_PASSWORD \
>   --namespace=$NAMESPACE

Do not forget to include the HuggingFace token.

$ export HF_TOKEN=your_hf_token
$ kubectl create secret generic hf-token-secret \
>   --from-literal=HF_TOKEN=${HF_TOKEN} \
>   -n ${NAMESPACE}

Create a model configuration file similar to the vllm_agg_qwen.yaml for your model. This file demonstrates the values needed for the vLLM aggregated setup in agg.yaml Take a note of the model’s block size provided in the model card.

4. Build EPP image (Optional)

You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.

$ # export env vars
$ export DOCKER_SERVER=ghcr.io/nvidia/dynamo	# Container registry
$ export IMAGE_TAG=YOUR-TAG # Or auto from git tag
$ cd deploy/inference-gateway/epp
$ make all # Do everything in one command
$ # or make all-push to also push
$ 
$ 
$ # Or step-by-step
$ make dynamo-lib # Build Dynamo library and copy to project
$ make image-load # Build Docker image and load locally
$ make image-push # Build and push to registry
$ make info # Check image tag

All-in-one Targets

Target	Description
`make dynamo-lib`	Build Dynamo static library and copy to project
`make all`	Build Dynamo lib + Docker image + load locally
`make all-push`	Build Dynamo lib + Docker image + push to registry

5. Deploy

We recommend deploying Inference Gateway’s Endpoint Picker as a Dynamo operator’s managed component. Alternatively, you could deploy it as a standalone pod. Note that when deploying Dynamo with the Inference Gateway Extension each worker must have the FrontEnd as a sidecar.

5.a. Deploy as a DGD component (recommended)

We provide an example for the Qwen vLLM below. You have to deploy the Dynamo Graph and the HttpRoute service. For the HttpRoute service make sure to specify the namespace where your gateway (i.e. kGateway was deployed) as shown below.

$   parentRefs:
$     - group: gateway.networking.k8s.io
$       kind: Gateway
$       name: inference-gateway
$       namespace: my-model # the namespace where your gateway is deployed.

$ cd <dynamo-source-root>
$ kubectl apply -f examples/backends/vllm/deploy/gaie/agg.yaml -n my-model
$ kubectl apply -f examples/backends/vllm/deploy/gaie/http-route.yaml -n my-model

Examples for other models can be found in the recipes folder.

$ # Deploy PVC, having first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying
$ kubectl apply -f recipes/llama-3-70b/model-cache/model-cache.yaml  -n ${NAMESPACE}
$ kubectl apply -f recipes/llama-3-70b/model-cache/model-download.yaml  -n ${NAMESPACE}

We provide examples for llama-3-70b vLLM under the recipes/llama-3-70b/vllm/agg/gaie/ for aggregated and recipes/llama-3-70b/vllm/disagg-single-node/gaie/ for disaggregated serving. Use the proper folder in commands below.

$ # Deploy your Dynamo Graph.
$ 
$ # agg
$ kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml -n ${NAMESPACE}
$ # Deploy the GAIE http-route CR.
$ kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/http-route.yaml -n ${NAMESPACE}
$ 
$ # or disagg
$ kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/gaie/deploy.yaml  -n ${NAMESPACE}
$ kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/gaie/http-route.yaml -n ${NAMESPACE}

When using GAIE the FrontEnd does not choose the workers. The routing is determined in the EPP.
The FrontEnd must run with --router-mode direct so that it respects the EPP’s routing decisions passed via request headers.
Use the frontendSidecar field on a worker service to have the operator automatically inject a fully configured frontend sidecar container with all required Dynamo env vars, probes, and ports:

1 frontendSidecar:
2   image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
3   args:
4     - --router-mode
5     - direct
6   envFromSecret: hf-token-secret

The pre-selected worker (decode and prefill in case of the disaggregated serving) are passed in the request headers.
The --router-mode direct flag ensures the routing respects this selection.

Startup Probe Timeout: The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures). If your model takes longer to load, increase the failureThreshold in the EPP’s startupProbe. For example, to allow 60 minutes for startup:

1 extraPodSpec:
2   mainContainer:
3     startupProbe:
4       failureThreshold: 360  # 10s × 360 = 60 minutes

Gateway Namespace Note that this assumes your gateway is installed into NAMESPACE=my-model (examples’ default) If you installed it into a different namespace, you need to adjust the HttpRoute entry in http-route.yaml.

5.b. Deploy as a standalone pod

We do not recommend this method but there are hints on how to do this here.

5.b.1 Deploy Your Model

5.b.2 Install Dynamo GIE helm chart

$ cd deploy/inference-gateway/standalone
$ 
$ # Export the EPP image - use the Dynamo FrontEnd image or build your own EPP image (see section 4)
$ export EPP_IMAGE=<the-epp-image>

$ helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml --set-string extension.image=$EPP_IMAGE

By default, the Kubernetes discovery mechanism is used. If you prefer etcd, please use the --set epp.dynamo.useEtcd=true flag below.

$ helm upgrade --install dynamo-gaie ./helm/dynamo-gaie -n my-model -f ./vllm_agg_qwen.yaml --set-string extension.image=$EPP_IMAGE --set epp.dynamo.useEtcd=true

Key configurations include:

An InferenceModel resource for the Qwen model
A service for the inference gateway
Required RBAC roles and bindings
RBAC permissions
dynamoGraphDeploymentName - the name of the Dynamo Graph where your model is deployed.

Configuration You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your values.yaml.

Common Vars for Routing Configuration:

Set DYN_BUSY_THRESHOLD to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.
Set DYN_ENFORCE_DISAGG=true to strictly enforce disaggregated mode. When enabled, requests fail if prefill workers have not registered yet. Without this, requests arriving before prefill workers are discovered fall through to decode-only routing. Prefill errors always fail requests regardless of this setting.
Set DYN_OVERLAP_SCORE_WEIGHT to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)
Set DYN_ROUTER_TEMPERATURE to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).
Set DYN_USE_KV_EVENTS=false if you want to disable the router listening for KV events while using kv-routing (default: true). SGLang workers require --kv-events-config and TRT-LLM workers require --publish-events-and-metrics to publish KV events. For vLLM, KV events are auto-configured when prefix caching is active (deprecated — use --kv-events-config explicitly)
DYN_ROUTER_TEMPERATURE — Temperature for worker sampling via softmax (default: 0.0)
DYN_ROUTER_REPLICA_SYNC — Enable replica synchronization (default: false)
DYN_ROUTER_TRACK_ACTIVE_BLOCKS — Track active blocks (default: true)
DYN_ROUTER_TRACK_OUTPUT_BLOCKS — Track output blocks during generation (default: false)
See the KV cache routing design for details.

Stand-Alone installation only:

Overwrite the DYN_NAMESPACE env var if needed to match your model’s dynamo namespace.

6. Verify Installation

Check that all resources are properly deployed:

$ kubectl get inferencepool
$ kubectl get httproute
$ kubectl get service
$ kubectl get gateway

Sample output:

$ # kubectl get inferencepool
$ NAME        AGE
$ qwen-pool   33m
$ 
$ # kubectl get httproute
$ NAME        HOSTNAMES   AGE
$ qwen-route               33m

7. Usage

The Inference Gateway provides HTTP endpoints for model inference.

1: Populate gateway URL for your k8s cluster

To test the gateway in minikube, use the following command: a. User minikube tunnel to expose the gateway to the host This requires sudo access to the host machine. alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).

$ # in first terminal
$ ps aux | grep "minikube tunnel" | grep -v grep # make sure minikube tunnel is not already running.
$ minikube tunnel # start the tunnel
$ 
$ # in second terminal where you want to send inference requests
$ GATEWAY_URL=$(kubectl get svc inference-gateway -n my-model -o jsonpath='{.spec.clusterIP}') && echo $GATEWAY_URL

b. use port-forward to expose the gateway to the host

$ # in first terminal
$ kubectl port-forward svc/inference-gateway 8000:80 -n {NAMESPACE} # for NAMESPACE put wherever you installed thee gateway i.e. kgateway-system
$ 
$ # in second terminal where you want to send inference requests
$ GATEWAY_URL=http://localhost:8000

2: Check models deployed to inference gateway

a. Query models:

$ # in the second terminal where you GATEWAY_URL is set
$ 
$ curl $GATEWAY_URL/v1/models | jq .

Sample output:

1 {
2   "data": [
3     {
4       "created": 1753768323,
5       "id": "Qwen/Qwen3-0.6B",
6       "object": "object",
7       "owned_by": "nvidia"
8     }
9   ],
10   "object": "list"
11 }

b. Send inference request to gateway:

$ MODEL_NAME="Qwen/Qwen3-0.6B"
$ curl $GATEWAY_URL/v1/chat/completions \
>   -H "Content-Type: application/json" \
>   -d '{
>       "model": "'"${MODEL_NAME}"'",
>       "messages": [
>       {
>           "role": "user",
>           "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
>       }
>       ],
>       "stream":false,
>       "max_tokens": 30,
>       "temperature": 0.0
>     }'

Sample inference output:

1 {
2   "choices": [
3     {
4       "finish_reason": "stop",
5       "index": 0,
6       "logprobs": null,
7       "message": {
8         "audio": null,
9         "content": "<think>\nOkay, I need to develop a character background for the user's query. Let me start by understanding the requirements. The character is an",
10         "function_call": null,
11         "refusal": null,
12         "role": "assistant",
13         "tool_calls": null
14       }
15     }
16   ],
17   "created": 1753768682,
18   "id": "chatcmpl-772289b8-5998-4f6d-bd61-3659b684b347",
19   "model": "Qwen/Qwen3-0.6B",
20   "object": "chat.completion",
21   "service_tier": null,
22   "system_fingerprint": null,
23   "usage": {
24     "completion_tokens": 29,
25     "completion_tokens_details": null,
26     "prompt_tokens": 196,
27     "prompt_tokens_details": null,
28     "total_tokens": 225
29   }
30 }

If you have more than one HttpRoute running on the cluster Add the host to your HttpRoute.yaml and add the header curl -H "Host: llama3-70b-agg.example.com" ... to every request.

$ spec:
$   hostnames:
$     - llama3-70b-agg.example.com

8. Deleting the installation

If you need to uninstall run:

$ kubectl delete dynamoGraphDeployment vllm-agg
$ helm uninstall dynamo-gaie -n my-model
$ 
$ # To uninstall GAIE
$ # 1. Delete the inference-gateway
$ kubectl delete gateway inference-gateway --ignore-not-found
$ 
$ # 2. Uninstall kgateway helm releases
$ helm uninstall kgateway -n kgateway-system
$ helm uninstall kgateway-crds -n kgateway-system
$ 
$ # 3. Delete the kgateway-system namespace (optional, cleans up everything in it)
$ helm uninstall kgateway --namespace kgateway-system
$ kubectl delete namespace kgateway-system --ignore-not-found
$ 
$ # 4. Delete the Inference Extension CRDs
$ IGW_LATEST_RELEASE=v1.2.1
$ kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${IGW_LATEST_RELEASE}/manifests.yaml --ignore-not-found
$ 
$ # 5. Delete the Gateway API CRDs
$ GATEWAY_API_VERSION=v1.4.1
$ kubectl delete -f https://github.com/kubernetes-sigs/gateway-api/releases/download/$GATEWAY_API_VERSION/standard-install.yaml --ignore-not-found

Gateway API Inference Extension Integration

This section documents the updated plugin implementation for Gateway API Inference Extension v1.2.1.

Router bookkeeping operations

EPP performs Dynamo router book keeping operations so the FrontEnd’s Router does not have to sync its state.

Header Routing Hints

Since v1.2.1, the EPP uses a header-only approach for communicating routing decisions. The plugins set HTTP headers that are forwarded to the backend workers.

Headers Set by Dynamo Plugins

Header	Description	Set By
`x-worker-instance-id`	Primary worker ID (decode worker in disagg mode)	kv-aware-scorer
`x-prefill-instance-id`	Prefill worker ID (disaggregated mode only)	kv-aware-scorer

$	cd deploy/inference-gateway
$	export NAMESPACE=my-model # You can put the inference gateway into another namespace and then adjust your http-route.yaml
$	./scripts/install_gaie_crd_kgateway.sh

$	kubectl get gateway inference-gateway
$
$	# Sample output
$	# NAME CLASS ADDRESS PROGRAMMED AGE
$	# inference-gateway kgateway True 1m

$	kubectl create secret docker-registry docker-imagepullsecret \
>	--docker-server=$DOCKER_SERVER \
>	--docker-username=$DOCKER_USERNAME \
>	--docker-password=$DOCKER_PASSWORD \
>	--namespace=$NAMESPACE

$	export HF_TOKEN=your_hf_token
$	kubectl create secret generic hf-token-secret \
>	--from-literal=HF_TOKEN=${HF_TOKEN} \
>	-n ${NAMESPACE}

$	# export env vars
$	export DOCKER_SERVER=ghcr.io/nvidia/dynamo # Container registry
$	export IMAGE_TAG=YOUR-TAG # Or auto from git tag
$	cd deploy/inference-gateway/epp
$	make all # Do everything in one command
$	# or make all-push to also push
$
$
$	# Or step-by-step
$	make dynamo-lib # Build Dynamo library and copy to project
$	make image-load # Build Docker image and load locally
$	make image-push # Build and push to registry
$	make info # Check image tag

$	parentRefs:
$	- group: gateway.networking.k8s.io
$	kind: Gateway
$	name: inference-gateway
$	namespace: my-model # the namespace where your gateway is deployed.

$	cd <dynamo-source-root>
$	kubectl apply -f examples/backends/vllm/deploy/gaie/agg.yaml -n my-model
$	kubectl apply -f examples/backends/vllm/deploy/gaie/http-route.yaml -n my-model

$	# Deploy PVC, having first Update `storageClassName` in recipes/llama-3-70b/model-cache/model-cache.yaml to match your cluster before deploying
$	kubectl apply -f recipes/llama-3-70b/model-cache/model-cache.yaml -n ${NAMESPACE}
$	kubectl apply -f recipes/llama-3-70b/model-cache/model-download.yaml -n ${NAMESPACE}

$	# Deploy your Dynamo Graph.
$
$	# agg
$	kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/deploy.yaml -n ${NAMESPACE}
$	# Deploy the GAIE http-route CR.
$	kubectl apply -f recipes/llama-3-70b/vllm/agg/gaie/http-route.yaml -n ${NAMESPACE}
$
$	# or disagg
$	kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/gaie/deploy.yaml -n ${NAMESPACE}
$	kubectl apply -f recipes/llama-3-70b/vllm/disagg-single-node/gaie/http-route.yaml -n ${NAMESPACE}

1	frontendSidecar:
2	image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
3	args:
4	- --router-mode
5	- direct
6	envFromSecret: hf-token-secret

1	extraPodSpec:
2	mainContainer:
3	startupProbe:
4	failureThreshold: 360 # 10s × 360 = 60 minutes

$	cd deploy/inference-gateway/standalone
$
$	# Export the EPP image - use the Dynamo FrontEnd image or build your own EPP image (see section 4)
$	export EPP_IMAGE=<the-epp-image>

$	kubectl get inferencepool
$	kubectl get httproute
$	kubectl get service
$	kubectl get gateway

$	# kubectl get inferencepool
$	NAME AGE
$	qwen-pool 33m
$
$	# kubectl get httproute
$	NAME HOSTNAMES AGE
$	qwen-route 33m

$	# in first terminal
$	ps aux \| grep "minikube tunnel" \| grep -v grep # make sure minikube tunnel is not already running.
$	minikube tunnel # start the tunnel
$
$	# in second terminal where you want to send inference requests
$	GATEWAY_URL=$(kubectl get svc inference-gateway -n my-model -o jsonpath='{.spec.clusterIP}') && echo $GATEWAY_URL

$	# in first terminal
$	kubectl port-forward svc/inference-gateway 8000:80 -n {NAMESPACE} # for NAMESPACE put wherever you installed thee gateway i.e. kgateway-system
$
$	# in second terminal where you want to send inference requests
$	GATEWAY_URL=http://localhost:8000

$	# in the second terminal where you GATEWAY_URL is set
$
$	curl $GATEWAY_URL/v1/models \| jq .

1	{
2	"data": [
3	{
4	"created": 1753768323,
5	"id": "Qwen/Qwen3-0.6B",
6	"object": "object",
7	"owned_by": "nvidia"
8	}
9	],
10	"object": "list"
11	}

$	MODEL_NAME="Qwen/Qwen3-0.6B"
$	curl $GATEWAY_URL/v1/chat/completions \
>	-H "Content-Type: application/json" \
>	-d '{
>	"model": "'"${MODEL_NAME}"'",
>	"messages": [
>	{
>	"role": "user",
>	"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
>	}
>	],
>	"stream":false,
>	"max_tokens": 30,
>	"temperature": 0.0
>	}'

1	{
2	"choices": [
3	{
4	"finish_reason": "stop",
5	"index": 0,
6	"logprobs": null,
7	"message": {
8	"audio": null,
9	"content": "<think>\nOkay, I need to develop a character background for the user's query. Let me start by understanding the requirements. The character is an",
10	"function_call": null,
11	"refusal": null,
12	"role": "assistant",
13	"tool_calls": null
14	}
15	}
16	],
17	"created": 1753768682,
18	"id": "chatcmpl-772289b8-5998-4f6d-bd61-3659b684b347",
19	"model": "Qwen/Qwen3-0.6B",
20	"object": "chat.completion",
21	"service_tier": null,
22	"system_fingerprint": null,
23	"usage": {
24	"completion_tokens": 29,
25	"completion_tokens_details": null,
26	"prompt_tokens": 196,
27	"prompt_tokens_details": null,
28	"total_tokens": 225
29	}
30	}

$	kubectl delete dynamoGraphDeployment vllm-agg
$	helm uninstall dynamo-gaie -n my-model
$
$	# To uninstall GAIE
$	# 1. Delete the inference-gateway
$	kubectl delete gateway inference-gateway --ignore-not-found
$
$	# 2. Uninstall kgateway helm releases
$	helm uninstall kgateway -n kgateway-system
$	helm uninstall kgateway-crds -n kgateway-system
$
$	# 3. Delete the kgateway-system namespace (optional, cleans up everything in it)
$	helm uninstall kgateway --namespace kgateway-system
$	kubectl delete namespace kgateway-system --ignore-not-found
$
$	# 4. Delete the Inference Extension CRDs
$	IGW_LATEST_RELEASE=v1.2.1
$	kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${IGW_LATEST_RELEASE}/manifests.yaml --ignore-not-found
$
$	# 5. Delete the Gateway API CRDs
$	GATEWAY_API_VERSION=v1.4.1
$	kubectl delete -f https://github.com/kubernetes-sigs/gateway-api/releases/download/$GATEWAY_API_VERSION/standard-install.yaml --ignore-not-found