Inference Gateway (GAIE)
Inference Gateway (GAIE)
Inference Gateway (GAIE)
Integrate Dynamo with the Gateway API Inference Extension for intelligent KV-aware request routing at the gateway layer.
EPP’s default kv-routing approach is not token-aware because the prompt is not tokenized. But the Dynamo plugin uses a token-aware KV algorithm. It employs the dynamo router which implements kv routing by running your model’s tokenizer inline. The EPP plugin configuration lives in helm/dynamo-gaie/epp-config-dynamo.yaml per EPP convention.
Dynamo Integration with the Inference Gateway supports Aggregated and Disaggregated Serving. The epp config is the same for both. If no prefill workers found the service degrades gracefully to perform aggregated serving. If you want to use LoRA deploy Dynamo without the Inference Gateway.
Currently, these setups are only supported with the kGateway based Inference Gateway.
See Quickstart Guide to install Dynamo Kubernetes Platform.
First, deploy an inference gateway service. In this example, we’ll install kgateway based gateway implementation.
Note: The manifest at config/manifests/gateway/kgateway/gateway.yaml uses gatewayClassName: agentgateway, but kGateway’s helm chart creates a GatewayClass named kgateway. The patch command in the script fixes this mismatch.
Do not forget docker registry secret if needed.
Do not forget to include the HuggingFace token.
Create a model configuration file similar to the vllm_agg_qwen.yaml for your model. This file demonstrates the values needed for the vLLM aggregated setup in agg.yaml Take a note of the model’s block size provided in the model card.
You can either use the provided Dynamo FrontEnd image for the EPP image or you need to build your own Dynamo EPP custom image following the steps below.
We recommend deploying Inference Gateway’s Endpoint Picker as a Dynamo operator’s managed component. Alternatively, you could deploy it as a standalone pod. Note that when deploying Dynamo with the Inference Gateway Extension each worker must have the FrontEnd as a sidecar.
We provide an example for the Qwen vLLM below. You have to deploy the Dynamo Graph and the HttpRoute service. For the HttpRoute service make sure to specify the namespace where your gateway (i.e. kGateway was deployed) as shown below.
Examples for other models can be found in the recipes folder.
We provide examples for llama-3-70b vLLM under the recipes/llama-3-70b/vllm/agg/gaie/ for aggregated and recipes/llama-3-70b/vllm/disagg-single-node/gaie/ for disaggregated serving.
Use the proper folder in commands below.
--router-mode direct so that it respects the EPP’s routing decisions passed via request headers.frontendSidecar field on a worker service to have the operator automatically inject a fully configured frontend sidecar container with all required Dynamo env vars, probes, and ports:--router-mode direct flag ensures the routing respects this selection.Startup Probe Timeout: The EPP has a default startup probe timeout of 30 minutes (10s × 180 failures).
If your model takes longer to load, increase the failureThreshold in the EPP’s startupProbe. For example,
to allow 60 minutes for startup:
Gateway Namespace
Note that this assumes your gateway is installed into NAMESPACE=my-model (examples’ default)
If you installed it into a different namespace, you need to adjust the HttpRoute entry in http-route.yaml.
We do not recommend this method but there are hints on how to do this here.
By default, the Kubernetes discovery mechanism is used. If you prefer etcd, please use the --set epp.dynamo.useEtcd=true flag below.
Key configurations include:
Configuration You can configure the plugin by setting environment variables in the EPP component of your DGD in case of the operator-managed installation or in your values.yaml.
Common Vars for Routing Configuration:
DYN_BUSY_THRESHOLD to configure the upper bound on how “full” a worker can be (often derived from kv_active_blocks or other load metrics) before the router skips it. If the selected worker exceeds this value, routing falls back to the next best candidate. By default the value is negative meaning this is not enabled.DYN_ENFORCE_DISAGG=true to strictly enforce disaggregated mode. When enabled, requests fail if prefill workers have not registered yet. Without this, requests arriving before prefill workers are discovered fall through to decode-only routing. Prefill errors always fail requests regardless of this setting.DYN_OVERLAP_SCORE_WEIGHT to weigh how heavily the score uses token overlap (predicted KV cache hits) versus other factors (load, historical hit rate). Higher weight biases toward reusing workers with similar cached prefixes. (default: 1)DYN_ROUTER_TEMPERATURE to soften or sharpen the selection curve when combining scores. Low temperature makes the router pick the top candidate deterministically; higher temperature lets lower-scoring workers through more often (exploration).DYN_USE_KV_EVENTS=false if you want to disable the router listening for KV events while using kv-routing (default: true). SGLang workers require --kv-events-config and TRT-LLM workers require --publish-events-and-metrics to publish KV events. For vLLM, KV events are auto-configured when prefix caching is active (deprecated — use --kv-events-config explicitly)DYN_ROUTER_TEMPERATURE — Temperature for worker sampling via softmax (default: 0.0)DYN_ROUTER_REPLICA_SYNC — Enable replica synchronization (default: false)DYN_ROUTER_TRACK_ACTIVE_BLOCKS — Track active blocks (default: true)DYN_ROUTER_TRACK_OUTPUT_BLOCKS — Track output blocks during generation (default: false)Stand-Alone installation only:
DYN_NAMESPACE env var if needed to match your model’s dynamo namespace.Check that all resources are properly deployed:
Sample output:
The Inference Gateway provides HTTP endpoints for model inference.
To test the gateway in minikube, use the following command:
a. User minikube tunnel to expose the gateway to the host
This requires sudo access to the host machine. alternatively, you can use port-forward to expose the gateway to the host as shown in alternative (b).
b. use port-forward to expose the gateway to the host
a. Query models:
Sample output:
b. Send inference request to gateway:
Sample inference output:
If you have more than one HttpRoute running on the cluster
Add the host to your HttpRoute.yaml and add the header curl -H "Host: llama3-70b-agg.example.com" ... to every request.
If you need to uninstall run:
This section documents the updated plugin implementation for Gateway API Inference Extension v1.2.1.
EPP performs Dynamo router book keeping operations so the FrontEnd’s Router does not have to sync its state.
Since v1.2.1, the EPP uses a header-only approach for communicating routing decisions. The plugins set HTTP headers that are forwarded to the backend workers.