Multi-Node Deployment#
Multi-node deployment enables very large models (for example, Llama 3.1 405B or DeepSeek R1) to run across multiple physical nodes when a single node’s GPU capacity is insufficient. NIM LLM uses Ray for cluster formation and vLLM for distributed model execution across the cluster.
Overview#
Multi-node deployment splits model weights across nodes using two parallelism strategies:
Strategy |
Description |
Example |
|---|---|---|
Tensor Parallelism (TP) |
Splits model layers across GPUs |
TP=8 across 8 GPUs |
Pipeline Parallelism (PP) |
Splits model stages across nodes |
PP=2 across 2 nodes |
The most common configuration sets the tensor-parallel size to the number of GPUs per node and the pipeline-parallel size to the number of nodes.
Example: Llama 3.1 405B across 2 nodes with 8 GPUs each
TP = 8 (8 GPUs per node)
PP = 2 (2 nodes)
Total GPUs = 16
In some cases, multi-node tensor parallelism is used instead, where a single tensor-parallel group spans multiple physical nodes:
TP = 16
PP = 1
Node A: 8 GPUs, Node B: 8 GPUs
One TP group spans 2 nodes with 16 GPUs total
Note
Multi-node TP requires continuous cross-node GPU communication. Performance depends on network bandwidth, latency, and Remote Direct Memory Access (RDMA) availability.
Architecture#
NIM multi-node deployments follow a leader/worker model:
The leader pod starts a Ray head node, downloads the model, and launches the vLLM inference server with distributed execution enabled.
Worker pods join the Ray cluster and provide GPU resources. vLLM spawns execution actors on worker nodes for model parallelism.
The same NIM container image is used for both leader and worker pods. The role is determined by the command injected at deployment time.
Prerequisites#
Before deploying NIM in multi-node mode, ensure the following:
Kubernetes cluster with GPU nodes (each node must have the same number and type of GPUs).
LeaderWorkerSet CRD installed on the cluster. LeaderWorkerSet is a Kubernetes-native mechanism for managing leader/worker pod groups.
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
Shared storage (recommended): A PVC with
ReadWriteManyaccess mode,hostPath, or NFS, so the model is downloaded only once by the leader and shared with workers.NGC API key stored as a Kubernetes secret:
kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=<your-key>
GPU resources: Sufficient GPU count across all nodes to satisfy the TP × PP requirement.
High-speed networking (recommended): InfiniBand or RDMA over Converged Ethernet (RoCE) for optimal NCCL performance in multi-node configurations.
Deployment with Helm#
The NIM LLM Helm chart natively supports multi-node deployments using LeaderWorkerSet.
Minimal Configuration#
Create a values.yaml for your multi-node deployment:
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "2.0.0"
model:
ngcAPISecret: ngc-api
multiNode:
enabled: true
workers: 1 # Number of worker pods (total nodes = workers + 1 leader)
tensorParallelSize: 8 # GPUs per tensor-parallel group
pipelineParallelSize: 2 # Number of pipeline stages (typically = number of nodes)
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany # Required for multi-node with shared storage
storageClass: <your-rwx-storage-class>
Deploy with:
helm install nim-llm nim-llm/ -f values.yaml
Helm Values Reference#
Parameter |
Description |
Default |
|---|---|---|
|
Enable multi-node deployment mode |
|
|
Number of worker pods per replica |
|
|
Number of GPUs per tensor-parallel group (sets |
|
|
Number of pipeline stages across nodes (sets |
|
|
Ray head node communication port |
|
|
Explicit profile name or hash (alternative to TP/PP values). Sets |
|
|
Kubernetes secret name containing |
|
Profile Selection#
You must specify how the model profile is selected. Choose one of these approaches:
Set multiNode.tensorParallelSize and multiNode.pipelineParallelSize directly. The Helm chart injects NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE environment variables, and the correct profile is selected automatically.
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set multiNode.tensorParallelSize=8 \
--set multiNode.pipelineParallelSize=2
Set model.profile to a profile name or hash. The Helm chart injects the NIM_MODEL_PROFILE environment variable.
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set model.profile=vllm-fp16-tp8-pp2
Warning
Multi-node deployment requires either tensorParallelSize/pipelineParallelSize (both > 0) or model.profile to be set. The Helm chart will fail to render if neither is provided.
Model Storage#
The leader and all worker nodes must have access to the model weights at the same filesystem path. There are two supported approaches:
Independent Downloads#
If shared storage is not available, each node downloads the model independently to local storage (emptyDir). This works but is not recommended because it wastes time and network bandwidth.
Note
Model-free NIM deployments (using model.modelPath) support both shared and independent storage. However, shared storage is strongly recommended for multi-node deployments.
Deployment with the NIM Operator#
The NIM Operator provides a fully automated deployment experience for multi-node NIM. The operator manages the NIMService custom resource and automatically:
Generates leader and worker pod specifications
Injects the appropriate Ray startup commands
Handles PVC setup through
NIMCacheManages probes, networking, and secrets
Refer to the NIM Operator Deployment documentation for full details on deploying with the NIM Operator.
Examples#
Example 1: Llama 3.1 405B on 2 Nodes (TP=8, PP=2)#
Two nodes, each with 8 GPUs. The model is split across 16 GPUs total.
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "2.0.0"
model:
ngcAPISecret: ngc-api
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany
storageClass: local-nfs
imagePullSecrets:
- name: nvcr-imagepull
Example 2: Multi-Node Tensor Parallelism (TP=16, PP=1)#
Two nodes, each with 8 GPUs. A single TP group spans both nodes.
multiNode:
enabled: true
workers: 1
tensorParallelSize: 16
pipelineParallelSize: 1
Note
Multi-node TP requires high-bandwidth, low-latency interconnect (for example, InfiniBand with RDMA) for acceptable performance.
Example 3: Model-Free Multi-Node Deployment#
Deploy a model from HuggingFace across multiple nodes using model-free NIM:
image:
repository: <NIM_LLM_MODEL_FREE_IMAGE>
tag: "2.0.0"
model:
modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
ngcAPISecret: ngc-api
hfTokenSecret: hf-token
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany
storageClass: local-nfs
Troubleshooting#
Workers Cannot Join the Ray Cluster#
Verify that the Ray port (default
6379) is accessible between pods. Check network policies.Ensure that
LWS_LEADER_ADDRESSis being injected correctly by the LeaderWorkerSet controller.Check worker logs for Ray connection errors:
kubectl logs <worker-pod-name>
Model Download Failures#
Confirm that
NGC_API_KEYis set correctly in the referenced secret.If using shared storage, verify the PVC is bound and has
ReadWriteManyaccess mode.Check that the storage class supports the required access mode.
NCCL Communication Errors#
For multi-node TP, ensure high-speed interconnect (InfiniBand/RoCE) is available and configured.
Verify that NCCL environment variables are set appropriately for your network topology. You can pass these via
envinvalues.yaml:env: - name: NCCL_IB_DISABLE value: "0" - name: NCCL_DEBUG value: "INFO"
Pod Scheduling Issues#
Verify that each node has the required GPU resources available.
Check that
nodeSelector,affinity, ortolerationsare configured correctly for GPU nodes.Ensure
nvidia.com/gpuresource requests match the available GPUs per node.