Installation Guide
This guide walks you through installing everything needed to deploy models with Dynamo on Kubernetes. Follow the steps in order — each builds on the previous one.
Prerequisites
Before you begin, make sure you have:
- A Kubernetes cluster (v1.24+) with GPU-capable nodes. See the cloud provider guides if you need to create one:
- Amazon EKS | Azure AKS | Google GKE
- For local development: Minikube Setup
- kubectl v1.24+ — Install kubectl
- Helm v3.0+ — Install Helm
Cloud provider GPU drivers: The GPU Operator (Step 1) installs GPU drivers for you. When creating your cluster’s GPU node pools, do not enable provider-managed GPU driver installation (e.g., skip AKS GPU driver install, don’t use GKE --accelerator gpu-driver-version=latest). If your nodes already have provider-managed drivers, see the GPU Operator step for how to handle this.
Verify your tools:
Overview
Every Dynamo deployment requires two Helm charts: the GPU Operator (Step 1) and the Dynamo Platform (Step 2). Everything else is optional. Decide what optional components you need before starting so you can install them in Step 3.
Grove + KAI Scheduler — Grove is the default multinode orchestrator. The operator returns a hard error on multinode deployments if neither Grove nor LeaderWorkerSet (LWS) is available. KAI Scheduler is optional but recommended alongside Grove for GPU-aware scheduling. See Grove for details.
Network Operator / RDMA — Without RDMA, disaggregated inference falls back to TCP automatically, but with severe performance degradation (~98s TTFT vs ~200-500ms with RDMA). Required for any production disaggregated deployment. Setup is cloud-provider-specific — see the Disaggregated Communication Guide and your cloud provider guide.
kube-prometheus-stack — Required for the Planner’s sla optimization mode (it reads live TTFT/ITL metrics from Prometheus). Also required for KEDA/HPA-based autoscaling. The Planner’s throughput mode can function without it using internal queue depth signals, but metrics-driven features will not work. See Metrics for details.
Shared storage — Prevents each pod from downloading model weights independently. Without it, large models (>70B) take hours to download per pod, and many replicas will hit HuggingFace rate limits. Not enforced by the operator — this is an operational concern. See Model Caching for the full walkthrough.
Step 1: Install the GPU Operator
The NVIDIA GPU Operator automates deployment of all NVIDIA software components needed to provision GPUs — drivers, container toolkit, device plugin, and monitoring.
If your GPU nodes already have provider-managed drivers installed (e.g., you used GKE’s --accelerator gpu-driver-version=latest), uncomment the driver.enabled=false line above so the operator doesn’t conflict with the existing drivers.
Some cloud providers require additional GPU Operator configuration. See your provider guide for details:
- AKS GPU Operator setup — skip AKS-managed GPU driver install on node pools
- EKS GPU Operator setup
- GKE GPU Operator setup —
LD_LIBRARY_PATHandldconfiginit requirements
Verify the GPU Operator is running:
Step 2: Install the Dynamo Platform
Set your environment variables:
All helm install commands can be customized with your own values file: helm install ... -f your-values.yaml
Shared/Multi-Tenant Clusters: If a cluster-wide Dynamo operator is already running, do not install another one. Check with:
Namespace-restricted mode (namespaceRestriction.enabled=true) is deprecated and will be removed in a future release. Use the default cluster-wide mode for all new deployments.
Verify the Dynamo platform is running:
Step 3: Install Optional Components
The Dynamo install command above includes commented flags for each optional component. Install the component first, then uncomment the corresponding flag before running helm install in Step 2 (or run helm upgrade --reuse-values with the flag if you’ve already installed Dynamo).
Multinode:
Multinode deployments require either Grove + KAI Scheduler or an alternative orchestrator setup (LeaderWorkerSet + Volcano) to enable gang scheduling for workloads that span multiple nodes. See the Multinode Deployment Guide for details on orchestrator selection and configuration.
Grove + KAI Scheduler
There are two ways to enable Grove and KAI Scheduler, controlled by which flags you uncomment in the Dynamo install command:
install=true— Dynamo installs and manages Grove/KAI as bundled subcharts. Simplest path; recommended for dev/testing.enabled=true— Tells Dynamo that Grove/KAI are already installed and externally managed. Use this when you install Grove/KAI separately (e.g., to manage their lifecycle independently or share them across namespaces). Recommended for production.
For the enabled=true path, install Grove and KAI Scheduler separately first. See the Grove installation guide and KAI Scheduler deployment guide for instructions.
Compatibility matrix:
LWS + Volcano
If you are not using Grove for multinode, you can use LeaderWorkerSet (LWS) (>= v0.7.0) with Volcano for gang scheduling. Both must be installed before deploying multinode workloads.
- Install Volcano:
- Install LWS (>= v0.7.0) with Volcano gang scheduling enabled:
See the LWS docs and Volcano docs for configuration options, and the Multinode Deployment Guide for orchestrator selection.
Network Operator / RDMA
RDMA setup is cloud-provider-specific. See the Disaggregated Communication Guide for transport options, UCX configuration, and performance expectations, and your cloud provider guide for setup instructions:
- AKS — InfiniBand + Network Operator
- EKS — EFA device plugin (also see the EFA configuration guide)
- GKE — GPUDirect-TCPXO
kube-prometheus-stack
Install Prometheus before running the Dynamo install command so you can set the endpoint in one pass:
Then uncomment the prometheusEndpoint line in the Dynamo install command. The Dynamo operator automatically creates PodMonitors for its components. See Metrics for dashboard setup and available metrics, and Logging for the Grafana Loki + Alloy logging stack.
Shared Storage for Model Caching
Set up a ReadWriteMany PVC so all pods share downloaded model weights instead of each downloading independently. No Dynamo chart flags are needed — storage is configured in your deployment spec. Setup is cloud-provider-specific:
- AKS — Azure Files / Managed Lustre
- EKS — EFS
- GKE — Cloud Filestore (see GKE guide)
For large clusters with frequent model updates, consider Model Express for P2P model distribution. See Model Caching for the full walkthrough including the download Job and mount configuration.
Step 4: Pre-Deployment Check
Run the pre-deployment check script to validate your cluster is ready for deployments:
This checks kubectl connectivity, default StorageClass configuration, GPU node availability, and GPU Operator status. See Pre-Deployment Checks for details.
Next Steps
Your cluster is ready. Follow the Model Deployment Guide to deploy a model using DGDR.
Troubleshooting
“VALIDATION ERROR: Cannot install cluster-wide Dynamo operator”
Cause: Attempting cluster-wide install on a shared cluster with existing namespace-restricted operators.
Solution: Migrate the existing namespace-restricted operators to cluster-wide mode. Namespace-restricted mode is deprecated.
CRDs already exist
Cause: Installing CRDs on a cluster where they’re already present (common on shared clusters).
Solution: CRDs are installed automatically by the Helm chart. If you encounter conflicts, check existing CRDs with kubectl get crd | grep dynamo.
Pods not starting?
Bitnami etcd “unrecognized” image?
Add to the helm install command:
Clean uninstall?
Advanced: Build from Source
If you need to contribute to Dynamo or use the latest unreleased features from the main branch: