This guide explains how to deploy Dynamo workloads across multiple nodes. Multinode deployments enable you to scale compute-intensive LLM workloads across multiple physical machines, maximizing GPU utilization and supporting larger models.
Dynamo supports multinode deployments through the multinode section in resource specifications. This allows you to:
For sophisticated multinode deployments, Dynamo integrates with advanced Kubernetes orchestration systems:
These systems provide enhanced scheduling capabilities including topology-aware placement, gang scheduling, and coordinated auto-scaling across multiple nodes.
Features Enabled with Grove:
KAI-Scheduler is a Kubernetes native scheduler optimized for AI workloads at large scale.
Features Enabled with KAI-Scheduler:
dynamo created. If no queue annotation is specified on the DGD resource, the operator uses the dynamo queue by default. Custom queue names can be specified via the nvidia.com/kai-scheduler-queue annotation, but the queue must exist in the cluster before deployment.KAI-Scheduler is optional but recommended for advanced scheduling capabilities.
LWS is a simple multinode deployment mechanism that allows you to deploy a workload across multiple nodes.
Volcano is a Kubernetes native scheduler optimized for AI workloads at scale. It is used in conjunction with LWS to provide gang scheduling support.
Dynamo automatically selects the best available orchestrator for multinode deployments using the following logic:
nvidia.com/enable-grove: "false" annotation on your DGD resourcenvidia.com/kai-scheduler-queue annotationDefault (Grove with KAI-Scheduler):
Note: The
nvidia.com/kai-scheduler-queueannotation defaults to"dynamo". If you specify a custom queue name, ensure the queue exists in your cluster before deploying. You can verify available queues withkubectl get queues.
Force LWS usage:
multinode SectionThe multinode section in a resource specification defines how many physical nodes the workload should span:
The relationship between multinode.nodeCount and gpu is multiplicative:
multinode.nodeCount: Number of physical nodesgpu: Number of GPUs per nodemultinode.nodeCount × gpuExample:
multinode.nodeCount: "2" + gpu: "4" = 8 total GPUs (4 GPUs per node across 2 nodes)multinode.nodeCount: "4" + gpu: "8" = 32 total GPUs (8 GPUs per node across 4 nodes)The tensor parallelism (tp-size or --tp) in your command/args must match the total number of GPUs:
When you deploy a multinode workload, the Dynamo operator automatically applies backend-specific configurations to enable distributed execution. Understanding these automatic modifications helps troubleshoot issues and optimize your deployments.
For vLLM multinode deployments, the operator automatically selects and configures the appropriate distributed execution mode based on your parallelism settings:
The operator automatically determines the deployment mode based on your parallelism configuration:
1. Tensor/Pipeline Parallelism Mode (Single model across nodes)
world_size > GPUs_per_node where world_size = tensor_parallel_size × pipeline_parallel_sizeThe operator uses Ray for multi-node tensor/pipeline parallel deployments. Ray provides automatic placement group management and worker spawning across nodes.
Leader Node:
ray start --head --port=6379 && <original-vllm-command> --distributed-executor-backend rayWorker Nodes:
ray start --address=<leader-hostname>:6379 --blockvLLM’s Ray executor automatically creates a placement group and spawns workers across the cluster. The --nnodes flag is NOT used with Ray - it’s only compatible with the mp backend.
2. Data Parallel Mode (Multiple model instances across nodes)
world_size × data_parallel_size > GPUs_per_nodeAll Nodes (Leader and Workers):
--data-parallel-address <leader-hostname> - Address of the coordination server--data-parallel-size-local <value> - Number of data parallel workers per node--data-parallel-rpc-port 13445 - RPC port for data parallel coordination--data-parallel-start-rank <value> - Starting rank for this node (calculated automatically)Note: The operator intelligently injects these flags into your command regardless of command structure (direct Python commands or shell wrappers)
vLLM supports two distributed executor backends: ray and mp. For multi-node deployments:
--nnodes, --node-rank, --master-addr, --master-port. This approach is more complex to orchestrate.The Dynamo operator uses Ray because:
multi-node-serving.sh)When a volume mount is configured with useAsCompilationCache: true, the operator automatically sets:
VLLM_CACHE_ROOT: Environment variable pointing to the cache mount pointFor SGLang multinode deployments, the operator injects distributed training parameters:
--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank 0--dist-init-addr <leader-hostname>:29500 --nnodes <count> --node-rank <dynamic-rank>
node-rank is automatically determined from the pod’s stateful identityNote: The operator intelligently injects these flags regardless of your command structure (direct Python commands or shell wrappers).
For TensorRT-LLM multinode deployments, the operator configures MPI-based communication:
mpirun command with:
OMPI_MCA_orte_keep_fqdn_hostnames=1 is added to all nodesmpirun-ssh-key-<deployment-name>)DynamoGraphDeployment. No manual secret creation is required.The operator supports compilation cache volumes for backend-specific optimization:
To enable compilation cache, add a volume mount with useAsCompilationCache: true in your component specification. For vLLM, the operator will automatically configure the necessary environment variables. For other backends, volume mounts are created, but additional environment configuration may be required until upstream support is added.
For additional support and examples, see the working multinode configurations in:
These examples demonstrate proper usage of the multinode section with corresponding gpu limits and correct tp-size configuration.