Multi-Node Deployment#

Multi-node deployment enables very large models (for example, Llama 3.1 405B or DeepSeek R1) to run across multiple physical nodes when a single node’s GPU capacity is insufficient. NIM LLM uses Ray for cluster formation and vLLM for distributed model execution across the cluster.

Overview#

Multi-node deployment splits model weights across nodes using two parallelism strategies:

Strategy	Description	Example
Tensor Parallelism (TP)	Splits model layers across GPUs	TP=8 across 8 GPUs
Pipeline Parallelism (PP)	Splits model stages across nodes	PP=2 across 2 nodes

The most common configuration sets the tensor-parallel size to the number of GPUs per node and the pipeline-parallel size to the number of nodes.

Example: Llama 3.1 405B across 2 nodes with 8 GPUs each

TP = 8 (8 GPUs per node)
PP = 2 (2 nodes)
Total GPUs = 16

In some cases, multi-node tensor parallelism is used instead, where a single tensor-parallel group spans multiple physical nodes:

TP = 16
PP = 1
Node A: 8 GPUs, Node B: 8 GPUs
One TP group spans 2 nodes with 16 GPUs total

Note

Multi-node TP requires continuous cross-node GPU communication. Performance depends on network bandwidth, latency, and Remote Direct Memory Access (RDMA) availability.

Architecture#

NIM multi-node deployments follow a leader/worker model:

The leader pod starts a Ray head node, downloads the model, and launches the vLLM inference server with distributed execution enabled.
Worker pods join the Ray cluster and provide GPU resources. vLLM spawns execution actors on worker nodes for model parallelism.

flowchart LR subgraph Leader["Leader Node"] RH["Ray Head"] VLLM["vLLM Server"] end subgraph Worker1["Worker Node 1"] RW1["Ray Worker"] GPU1["GPUs"] end subgraph Worker2["Worker Node N"] RW2["Ray Worker"] GPU2["GPUs"] end RH <-->|"Ray cluster"| RW1 RH <-->|"Ray cluster"| RW2 VLLM -->|"NCCL"| GPU1 VLLM -->|"NCCL"| GPU2 style Leader fill:#76b900,stroke:#333,color:#fff style Worker1 fill:#1a1a2e,stroke:#333,color:#fff style Worker2 fill:#1a1a2e,stroke:#333,color:#fff

The same NIM container image is used for both leader and worker pods. The role is determined by the command injected at deployment time.

Prerequisites#

Before deploying NIM in multi-node mode, ensure the following:

Kubernetes cluster with GPU nodes (each node must have the same number and type of GPUs).
LeaderWorkerSet CRD installed on the cluster. LeaderWorkerSet is a Kubernetes-native mechanism for managing leader/worker pod groups.
```
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
```
Shared storage (recommended): A PVC with ReadWriteMany access mode, hostPath, or NFS, so the model is downloaded only once by the leader and shared with workers.

NGC API key stored as a Kubernetes secret:

kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=<your-key>

GPU resources: Sufficient GPU count across all nodes to satisfy the TP × PP requirement.
High-speed networking (recommended): InfiniBand or RDMA over Converged Ethernet (RoCE) for optimal NCCL performance in multi-node configurations.

Deployment with Helm#

The NIM LLM Helm chart natively supports multi-node deployments using LeaderWorkerSet.

Minimal Configuration#

Create a values.yaml for your multi-node deployment:

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.0"

model:
  ngcAPISecret: ngc-api

multiNode:
  enabled: true
  workers: 1                  # Number of worker pods (total nodes = workers + 1 leader)
  tensorParallelSize: 8       # GPUs per tensor-parallel group
  pipelineParallelSize: 2     # Number of pipeline stages (typically = number of nodes)

resources:
  limits:
    nvidia.com/gpu: 8
  requests:
    nvidia.com/gpu: 8

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany   # Required for multi-node with shared storage
  storageClass: <your-rwx-storage-class>

Deploy with:

helm install nim-llm nim-llm/ -f values.yaml

Helm Values Reference#

Parameter	Description	Default
`multiNode.enabled`	Enable multi-node deployment mode	`false`
`multiNode.workers`	Number of worker pods per replica	`1`
`multiNode.tensorParallelSize`	Number of GPUs per tensor-parallel group (sets `NIM_TENSOR_PARALLEL_SIZE`). Set to `0` to omit.	`0`
`multiNode.pipelineParallelSize`	Number of pipeline stages across nodes (sets `NIM_PIPELINE_PARALLEL_SIZE`). Set to `0` to omit.	`0`
`multiNode.ray.port`	Ray head node communication port	`6379`
`model.profile`	Explicit profile name or hash (alternative to TP/PP values). Sets `NIM_MODEL_PROFILE`.	`""`
`model.hfTokenSecret`	Kubernetes secret name containing `HF_TOKEN` for HuggingFace model downloads.	`""`

Profile Selection#

You must specify how the model profile is selected. Choose one of these approaches:

Option A: TP/PP Values (Recommended)

Set multiNode.tensorParallelSize and multiNode.pipelineParallelSize directly. The Helm chart injects NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE environment variables, and the correct profile is selected automatically.

helm install nim-llm nim-llm/ \
  --set multiNode.enabled=true \
  --set multiNode.workers=1 \
  --set multiNode.tensorParallelSize=8 \
  --set multiNode.pipelineParallelSize=2

Option B: Explicit Profile

Set model.profile to a profile name or hash. The Helm chart injects the NIM_MODEL_PROFILE environment variable.

helm install nim-llm nim-llm/ \
  --set multiNode.enabled=true \
  --set multiNode.workers=1 \
  --set model.profile=vllm-fp16-tp8-pp2

Warning

Multi-node deployment requires either tensorParallelSize/pipelineParallelSize (both > 0) or model.profile to be set. The Helm chart will fail to render if neither is provided.

Model Storage#

The leader and all worker nodes must have access to the model weights at the same filesystem path. There are two supported approaches:

Shared Storage (Recommended)#

Use a PVC with ReadWriteMany access mode, a hostPath volume, or NFS. With shared storage:

The leader downloads the model once.
Workers automatically skip the download and access the model through the shared volume.

The Helm chart supports three shared storage backends:

PVC with existingClaim: Pre-create a ReadWriteMany PVC. The PVC survives helm uninstall, so the model cache persists across deployments and subsequent installs skip the download.
NFS direct mount: Set nfs.enabled=true with nfs.server and nfs.path. Same persistence behavior as PVC.
hostPath: Set hostPath.enabled=true. Works when all nodes share the same local path (for example, a network filesystem mounted at the same location on every node).

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: <your-rwx-storage-class>

Warning

If persistence is enabled, accessMode must be ReadWriteMany. A ReadWriteOnce PVC cannot be mounted on multiple nodes.

Tip

Use a dedicated PVC per model. The model workspace (${NIM_CACHE_PATH}/_workspace) is a fixed path. Running different models on the same PVC can leave stale workspace data from a previous model. If you must share a PVC across models, set a different model.nimCache (NIM_CACHE_PATH) for each model to isolate their caches and workspaces.

Independent Downloads#

If shared storage is not available, each node downloads the model independently to local storage (emptyDir). This works but is not recommended because it wastes time and network bandwidth.

Note

Model-free NIM deployments (using model.modelPath) support both shared and independent storage. However, shared storage is strongly recommended for multi-node deployments.

Deployment with the NIM Operator#

The NIM Operator provides a fully automated deployment experience for multi-node NIM. The operator manages the NIMService custom resource and automatically:

Generates leader and worker pod specifications
Injects the appropriate Ray startup commands
Handles PVC setup through NIMCache
Manages probes, networking, and secrets

Refer to the NIM Operator Deployment documentation for full details on deploying with the NIM Operator.

Examples#

Example 1: Llama 3.1 405B on 2 Nodes (TP=8, PP=2)#

Two nodes, each with 8 GPUs. The model is split across 16 GPUs total.

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.0"

model:
  ngcAPISecret: ngc-api

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 8
  pipelineParallelSize: 2

resources:
  limits:
    nvidia.com/gpu: 8
  requests:
    nvidia.com/gpu: 8

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: local-nfs

imagePullSecrets:
  - name: nvcr-imagepull

Example 2: Multi-Node Tensor Parallelism (TP=16, PP=1)#

Two nodes, each with 8 GPUs. A single TP group spans both nodes.

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 16
  pipelineParallelSize: 1

Note

Multi-node TP requires high-bandwidth, low-latency interconnect (for example, InfiniBand with RDMA) for acceptable performance.

Example 3: Model-Free Multi-Node Deployment#

Deploy a model from HuggingFace across multiple nodes using model-free NIM:

image:
  repository: <NIM_LLM_MODEL_FREE_IMAGE>
  tag: "2.0.0"

model:
  modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
  ngcAPISecret: ngc-api
  hfTokenSecret: hf-token

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 8
  pipelineParallelSize: 2

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: local-nfs

Troubleshooting#

Workers Cannot Join the Ray Cluster#

Verify that the Ray port (default 6379) is accessible between pods. Check network policies.
Ensure that LWS_LEADER_ADDRESS is being injected correctly by the LeaderWorkerSet controller.
Check worker logs for Ray connection errors:
```
kubectl logs <worker-pod-name>
```

Model Download Failures#

Confirm that NGC_API_KEY is set correctly in the referenced secret.
If using shared storage, verify the PVC is bound and has ReadWriteMany access mode.
Check that the storage class supports the required access mode.

NCCL Communication Errors#

For multi-node TP, ensure high-speed interconnect (InfiniBand/RoCE) is available and configured.
Verify that NCCL environment variables are set appropriately for your network topology. You can pass these via env in values.yaml:
```
env:
  - name: NCCL_IB_DISABLE
    value: "0"
  - name: NCCL_DEBUG
    value: "INFO"
```

Pod Scheduling Issues#

Verify that each node has the required GPU resources available.
Check that nodeSelector, affinity, or tolerations are configured correctly for GPU nodes.
Ensure nvidia.com/gpu resource requests match the available GPUs per node.