Multi-Node Deployment#
Multi-node deployment enables very large models (for example, Llama 3.1 405B or DeepSeek R1) to run across multiple physical nodes when a single node’s GPU capacity is insufficient. NIM LLM uses Ray for cluster formation and vLLM for distributed model execution across the cluster.
Multi-node deployment splits model weights across nodes using two parallelism strategies:
Pipeline Parallelism (PP): Splits model stages across nodes. Example: PP=2 across 2 nodes.
Tensor Parallelism (TP): Splits model layers across GPUs. Example: TP=8 across 8 GPUs.
The most common configuration sets the tensor-parallel size to the number of GPUs per node and the pipeline-parallel size to the number of nodes.
For example, use the following settings for Llama 3.1 405B across two nodes with eight GPUs each:
TP = 8 (8 GPUs per node)
PP = 2 (2 nodes)
Total GPUs = 16
In some cases, multi-node tensor parallelism is used instead, where a single tensor-parallel group spans multiple physical nodes:
TP = 16
PP = 1
Node A: 8 GPUs, Node B: 8 GPUs
One TP group spans 2 nodes with 16 GPUs total
Note
Multi-node TP requires continuous cross-node GPU communication. Performance depends on network bandwidth, latency, and Remote Direct Memory Access (RDMA) availability.
Known Limitations#
Warning
Disable structured (JSON) logging for multi-node deployments. Set model.jsonLogging=false in Helm values. The NIM JSON log formatter is not available in vLLM Ray worker processes and causes worker initialization to fail.
Architecture#
NIM multi-node deployments follow a leader/worker model:
The leader pod starts a Ray head node, downloads the model, and launches the vLLM inference server with distributed execution enabled.
Worker pods join the Ray cluster and provide GPU resources. vLLM spawns execution actors on worker nodes for model parallelism.
The following diagram shows this architecture:
The same NIM container image is used for both leader and worker pods. The role is determined by the command injected at deployment time.
Prerequisites#
Before deploying NIM in multi-node mode, make sure you have the following:
A Kubernetes cluster with GPU nodes, where each node has the same number and type of GPUs.
The LeaderWorkerSet CRD installed on the cluster. LeaderWorkerSet is a Kubernetes-native mechanism for managing leader/worker pod groups:
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
Shared storage (recommended): A PVC with
ReadWriteManyaccess mode,hostPath, or NFS, so the leader downloads the model once and shares it with workers.High-speed networking (recommended): InfiniBand or RDMA over Converged Ethernet (RoCE) for optimal NCCL performance in multi-node configurations.
Sufficient GPU resources across all nodes to satisfy the TP × PP requirement.
An NGC API key stored as a Kubernetes secret:
kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=<your-key>
Deployment with Helm#
The NIM LLM Helm chart natively supports multi-node deployments using LeaderWorkerSet.
Minimal Configuration#
Complete the following steps to deploy the minimal multi-node Helm example:
Create a
values.yamlfile for your multi-node deployment:image: repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE> tag: "2.0.1" model: ngcAPISecret: ngc-api multiNode: enabled: true workers: 1 # Number of worker pods (total nodes = workers + 1 leader) tensorParallelSize: 8 # GPUs per tensor-parallel group pipelineParallelSize: 2 # Number of pipeline stages (typically = number of nodes) resources: limits: nvidia.com/gpu: 8 requests: nvidia.com/gpu: 8 persistence: enabled: true size: 200Gi accessMode: ReadWriteMany # Required for multi-node with shared storage storageClass: <your-rwx-storage-class>
Install the chart:
helm install nim-llm nim-llm/ -f values.yaml
Helm Values Reference#
Use the following Helm values to configure multi-node deployment behavior, model profile selection, and Ray communication settings.
Parameter |
Description |
Default |
|---|---|---|
|
Enable multi-node deployment mode |
|
|
Number of worker pods per replica |
|
|
Number of GPUs per tensor-parallel group (sets |
|
|
Number of pipeline stages across nodes (sets |
|
|
Ray head node communication port |
|
|
Explicit profile name or hash (alternative to TP/PP values). Sets |
|
|
Kubernetes secret name containing |
|
Profile Selection#
You must specify how the model profile is selected. Choose one of these approaches:
Set multiNode.tensorParallelSize and multiNode.pipelineParallelSize directly. The Helm chart injects NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE environment variables, and the correct profile is selected automatically.
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set multiNode.tensorParallelSize=8 \
--set multiNode.pipelineParallelSize=2
Set model.profile to a profile name or hash. The Helm chart injects the NIM_MODEL_PROFILE environment variable.
helm install nim-llm nim-llm/ \
--set multiNode.enabled=true \
--set multiNode.workers=1 \
--set model.profile=vllm-fp16-tp8-pp2
Warning
Multi-node deployment requires either tensorParallelSize/pipelineParallelSize (both > 0) or model.profile to be set. The Helm chart will fail to render if neither is provided.
Model Storage#
The leader and all worker nodes must have access to the model weights at the same filesystem path. There are two supported approaches:
Independent Downloads#
If shared storage is not available, each node downloads the model independently to local storage (emptyDir). This works but is not recommended because it wastes time and network bandwidth.
Note
Model-free NIM deployments (using model.modelPath) support both shared and independent storage. However, shared storage is strongly recommended for multi-node deployments.
Deployment with the NIM Operator#
The NIM Operator provides a fully automated deployment experience for multi-node NIM. The operator manages the NIMService custom resource and automatically:
Generates leader and worker pod specifications
Injects the appropriate Ray startup commands
Handles PVC setup through
NIMCacheManages probes, networking, and secrets
Refer to the NIM Operator Deployment documentation for full details on deploying with the NIM Operator.
Examples#
Use the following examples to configure common multi-node deployments with different tensor and pipeline parallelism settings.
Llama 3.1 405B on 2 Nodes (TP=8, PP=2)#
Two nodes, each with 8 GPUs. The model is split across 16 GPUs total.
image:
repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
tag: latest
model:
ngcAPISecret: ngc-api
jsonLogging: false
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
resources:
limits:
nvidia.com/gpu: 8
requests:
nvidia.com/gpu: 8
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany
storageClass: local-nfs
imagePullSecrets:
- name: nvcr-imagepull
Multi-Node Tensor Parallelism (TP=16, PP=1)#
Two nodes, each with 8 GPUs. A single TP group spans both nodes.
multiNode:
enabled: true
workers: 1
tensorParallelSize: 16
pipelineParallelSize: 1
Note
Multi-node TP requires high-bandwidth, low-latency interconnect (for example, InfiniBand with RDMA) for acceptable performance.
Model-Free Multi-Node Deployment#
Deploy a model from Hugging Face across multiple nodes using model-free NIM:
image:
repository: nvcr.io/nim/nim-llm
tag: latest
model:
modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
ngcAPISecret: ngc-api
hfTokenSecret: hf-token
jsonLogging: false
multiNode:
enabled: true
workers: 1
tensorParallelSize: 8
pipelineParallelSize: 2
persistence:
enabled: true
size: 200Gi
accessMode: ReadWriteMany
storageClass: local-nfs
Troubleshooting#
Workers Cannot Join the Ray Cluster#
Use the following checks if workers cannot join the Ray cluster:
Verify that the Ray port (default
6379) is accessible between pods. Check network policies.Ensure that
LWS_LEADER_ADDRESSis being injected correctly by the LeaderWorkerSet controller.Check worker logs for Ray connection errors:
kubectl logs <worker-pod-name>
Model Download Failures#
Use the following checks if model downloads fail:
Confirm that
NGC_API_KEYis set correctly in the referenced secret.If using shared storage, verify the PVC is bound and has
ReadWriteManyaccess mode.Check that the storage class supports the required access mode.
NCCL Communication Errors#
Use the following checks if NCCL communication errors occur:
For multi-node TP, ensure high-speed interconnect (InfiniBand/RoCE) is available and configured.
Verify that NCCL environment variables are set appropriately for your network topology. You can pass these via
envinvalues.yaml:env: - name: NCCL_IB_DISABLE value: "0" - name: NCCL_DEBUG value: "INFO"
Pod Scheduling Issues#
Use the following checks if pods cannot be scheduled:
Verify that each node has the required GPU resources available.
Check that
nodeSelector,affinity, ortolerationsare configured correctly for GPU nodes.Ensure
nvidia.com/gpuresource requests match the available GPUs per node.