EKS Dynamo Networking Prerequisites
For *-eks-ubuntu-inference-dynamo recipes, AICR configures
dynamo-platform with Kubernetes-native discovery and the standard NATS
event plane for KV-cache and runtime events:
natson TCP4222
This NATS dependency is new as of the Dynamo 1.2 bump, which switched discovery
to the NATS event plane. A cluster whose system-node security group only
allowlisted the pre-1.2 control-plane ports will not have 4222 open, so a
bundle that worked on Dynamo 1.0.x can start failing purely from the version
bump — add the 4222 rule below.
Frontend-to-worker inference request/response traffic is separate: Dynamo 1.2
defaults DYN_REQUEST_PLANE to TCP, and AICR does not override it to NATS. The
worker runtime relays local vLLM ZMQ KV-cache events onto the NATS-backed event
plane so the KV router or EPP can consume live cache state.
If system components and GPU workloads are on different node groups/security groups, these ports may be blocked from GPU nodes to system nodes. Typical symptoms:
JetStream not available(NATS unreachable)- Dynamo frontend and vLLM worker pods stuck in
CrashLoopBackOff, withException: Failed to connect to NATS: timed outin the frontend log - Worker startup probes failing with
connection refusedbecause the process exits before serving - The
inference-perfperformance validator failing after its workload-readiness (10 min) and health (5 min) gates lapse — roughly 15 min — whiledeploymentandconformancepass; the workload never reaches a ready state
You can confirm reachability directly from a GPU node before re-running. The
toleration is required because the GPU node groups on these clusters are
tainted (NoSchedule/NoExecute); without it the probe pod stays Pending
and never runs:
The conformance validator’s ai-service-metrics check adds a third requirement:
it dials Prometheus over the cluster Service (typically
kube-prometheus-prometheus.monitoring.svc:9090). The orchestrator Job that
runs the check tolerates every taint and now sets a preferred
dependencyAffinity toward Prometheus, so the scheduler co-locates it with the
Prometheus pod when possible. The preference is best-effort, not required, so it
can still fall back to any worker node (e.g. if the Prometheus node is
unschedulable) — including one whose ENI is in a security group that cannot
reach the Prometheus pod.
When that happens, the dial times out at 5 s and the check is marked failed:
On a fallback placement the outcome can be non-deterministic from run to run: scheduling tie-breaks and image-locality scoring decide which node wins, so a re-run on a “freshly working” cluster is not a reliable signal that the SG topology is correct.
The preferred dependencyAffinity (issue #933,
resolved) makes this far less likely, but because it is best-effort the 9090
SG rule below remains the reliable cluster-side guarantee.
Required Security Group Rules
Allow ingress from the GPU node security group to the system node security group on:
- TCP
4222- NATS event plane (dynamo-platform) - TCP
9090- Prometheus (required for theai-service-metricsconformance check)
The 9090 rule is required as a fallback guarantee: the orchestrator prefers
to co-locate with Prometheus, but that preference is best-effort, so it can
still land on any worker node. Every node group whose pods can host the
orchestrator must therefore be able to reach the Prometheus pod’s IP on 9090.
On clusters with separate customer/system ENI subnets (e.g. DGXC EKS), this
means the system SG must accept ingress from the customer SG (and any other
worker SG), not only from itself.
If the cluster has more than two worker security groups (e.g. a separate
inference node group), repeat the 9090 rule for each non-system SG that can
host pods — on a fallback placement the orchestrator may land on any of them.
Example: