Is this page helpful?

Multi-Node Deployment#

Multi-node deployment enables very large models (for example, Llama 3.1 405B or DeepSeek R1) to run across multiple physical nodes when a single node’s GPU capacity is insufficient. NIM LLM uses Ray for cluster formation and vLLM for distributed model execution across the cluster.

Multi-node deployment splits model weights across nodes using two parallelism strategies:

Pipeline Parallelism (PP): Splits model stages across nodes. Example: PP=2 across 2 nodes.
Tensor Parallelism (TP): Splits model layers across GPUs. Example: TP=8 across 8 GPUs.

The most common configuration sets the tensor-parallel size to the number of GPUs per node and the pipeline-parallel size to the number of nodes.

For example, use the following settings for Llama 3.1 405B across two nodes with eight GPUs each:

TP = 8 (8 GPUs per node)
PP = 2 (2 nodes)
Total GPUs = 16

In some cases, multi-node tensor parallelism is used instead, where a single tensor-parallel group spans multiple physical nodes:

TP = 16
PP = 1
Node A: 8 GPUs, Node B: 8 GPUs
One TP group spans 2 nodes with 16 GPUs total

Note

Multi-node TP requires continuous cross-node GPU communication. Performance depends on network bandwidth, latency, and Remote Direct Memory Access (RDMA) availability.

Known Limitations#

Warning

Disable structured (JSON) logging for multi-node deployments. Set model.jsonLogging=false in Helm values. The NIM JSON log formatter is not available in vLLM Ray worker processes and causes worker initialization to fail.

Architecture#

NIM multi-node deployments follow a leader/worker model:

The leader pod starts a Ray head node, downloads the model, and launches the vLLM inference server with distributed execution enabled.
Worker pods join the Ray cluster and provide GPU resources. vLLM spawns execution actors on worker nodes for model parallelism.

The following diagram shows this architecture:

flowchart LR subgraph Leader["Leader Node"] RH["Ray Head"] VLLM["vLLM Server"] end subgraph Worker1["Worker Node 1"] RW1["Ray Worker"] GPU1["GPUs"] end subgraph Worker2["Worker Node N"] RW2["Ray Worker"] GPU2["GPUs"] end RH <-->|"Ray cluster"| RW1 RH <-->|"Ray cluster"| RW2 VLLM -->|"NCCL"| GPU1 VLLM -->|"NCCL"| GPU2 style Leader fill:#76b900,stroke:#333,color:#fff style Worker1 fill:#1a1a2e,stroke:#333,color:#fff style Worker2 fill:#1a1a2e,stroke:#333,color:#fff

The same NIM container image is used for both leader and worker pods. The role is determined by the command injected at deployment time.

Prerequisites#

Before deploying NIM in multi-node mode, make sure you have the following:

A Kubernetes cluster with GPU nodes, where each node has the same number and type of GPUs.
The LeaderWorkerSet CRD installed on the cluster. LeaderWorkerSet is a Kubernetes-native mechanism for managing leader/worker pod groups:
```
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
```
Shared storage (recommended): A PVC with ReadWriteMany access mode, hostPath, or NFS, so the leader downloads the model once and shares it with workers.
High-speed networking (recommended): InfiniBand or RDMA over Converged Ethernet (RoCE) for optimal NCCL performance in multi-node configurations.
Sufficient GPU resources across all nodes to satisfy the TP × PP requirement.

An NGC API key stored as a Kubernetes secret:

kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=<your-key>

Deployment with Helm#

The NIM LLM Helm chart natively supports multi-node deployments using LeaderWorkerSet.

Minimal Configuration#

Complete the following steps to deploy the minimal multi-node Helm example:

Create a values.yaml file for your multi-node deployment:

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "2.0.7"

model:
  ngcAPISecret: ngc-api

multiNode:
  enabled: true
  workers: 1                  # Number of worker pods (total nodes = workers + 1 leader)
  tensorParallelSize: 8       # GPUs per tensor-parallel group
  pipelineParallelSize: 2     # Number of pipeline stages (typically = number of nodes)

resources:
  limits:
    nvidia.com/gpu: 8
  requests:
    nvidia.com/gpu: 8

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany   # Required for multi-node with shared storage
  storageClass: <your-rwx-storage-class>

Install the chart:

helm install nim-llm nim-llm/ -f values.yaml

Helm Values Reference#

Use the following Helm values to configure multi-node deployment behavior, model profile selection, and Ray communication settings.

Parameter	Description	Default
`multiNode.enabled`	Enable multi-node deployment mode	`false`
`multiNode.workers`	Number of worker pods per replica	`1`
`multiNode.tensorParallelSize`	Number of GPUs per tensor-parallel group (sets `NIM_TENSOR_PARALLEL_SIZE`). Set to `0` to omit.	`0`
`multiNode.pipelineParallelSize`	Number of pipeline stages across nodes (sets `NIM_PIPELINE_PARALLEL_SIZE`). Set to `0` to omit.	`0`
`multiNode.ray.port`	Ray head node communication port	`6379`
`model.profile`	Explicit profile name or hash (alternative to TP/PP values). Sets `NIM_MODEL_PROFILE`.	`""`
`model.hfTokenSecret`	Kubernetes secret name containing `HF_TOKEN` for Hugging Face model downloads.	`""`

Profile Selection#

You must specify how the model profile is selected. Choose one of these approaches:

Option A: TP/PP Values (Recommended)

Set multiNode.tensorParallelSize and multiNode.pipelineParallelSize directly. The Helm chart injects NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE environment variables, and the correct profile is selected automatically.

helm install nim-llm nim-llm/ \
  --set multiNode.enabled=true \
  --set multiNode.workers=1 \
  --set multiNode.tensorParallelSize=8 \
  --set multiNode.pipelineParallelSize=2

Option B: Explicit Profile

Set model.profile to a profile name or hash. The Helm chart injects the NIM_MODEL_PROFILE environment variable.

helm install nim-llm nim-llm/ \
  --set multiNode.enabled=true \
  --set multiNode.workers=1 \
  --set model.profile=vllm-fp16-tp8-pp2

Warning

Multi-node deployment requires either tensorParallelSize/pipelineParallelSize (both > 0) or model.profile to be set. The Helm chart will fail to render if neither is provided.

Model Storage#

The leader and all worker nodes must have access to the model weights at the same filesystem path. There are two supported approaches:

Shared Storage (Recommended)#

Use a PVC with ReadWriteMany access mode or NFS. With shared storage:

The leader downloads the model once to the shared volume.
Workers detect the cached model blobs in shared storage and materialize a local workspace at /tmp/nim_workspace without a network download.

The Helm chart supports two shared storage backends:

PVC with existingClaim: Pre-create a ReadWriteMany PVC. The PVC survives helm uninstall, so the model cache persists across deployments and subsequent installs skip the download.
NFS direct mount: Set nfs.enabled=true with nfs.server and nfs.path. Same persistence behavior as PVC.

The following example shows a PVC-based configuration:

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: <your-rwx-storage-class>

Warning

If persistence is enabled, accessMode must be ReadWriteMany. A ReadWriteOnce PVC cannot be mounted on multiple nodes.

Independent Downloads#

If shared storage is not available, each node downloads the model independently to local storage (emptyDir). This works but is not recommended because it wastes time and network bandwidth.

Note

Model-free NIM deployments (using model.modelPath) support both shared and independent storage. However, shared storage is strongly recommended for multi-node deployments.

Deployment with the NIM Operator#

The NIM Operator provides a fully automated deployment experience for multi-node NIM. The operator manages the NIMService custom resource and automatically:

Generates leader and worker pod specifications
Injects the appropriate Ray startup commands
Handles PVC setup through NIMCache
Manages probes, networking, and secrets

Refer to the NIM Operator Deployment documentation for full details on deploying with the NIM Operator.

Examples#

Use the following examples to configure common multi-node deployments with different tensor and pipeline parallelism settings.

Llama 3.1 405B on 2 Nodes (TP=8, PP=2)#

Two nodes, each with 8 GPUs. The model is split across 16 GPUs total.

image:
  repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
  tag: latest

model:
  ngcAPISecret: ngc-api
  jsonLogging: false

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 8
  pipelineParallelSize: 2

resources:
  limits:
    nvidia.com/gpu: 8
  requests:
    nvidia.com/gpu: 8

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: local-nfs

imagePullSecrets:
  - name: nvcr-imagepull

Multi-Node Tensor Parallelism (TP=16, PP=1)#

Two nodes, each with 8 GPUs. A single TP group spans both nodes.

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 16
  pipelineParallelSize: 1

Note

Multi-node TP requires high-bandwidth, low-latency interconnect (for example, InfiniBand with RDMA) for acceptable performance.

Model-Free Multi-Node Deployment#

Deploy a model from Hugging Face across multiple nodes using model-free NIM:

image:
  repository: nvcr.io/nim/nim-llm
  tag: latest

model:
  modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
  ngcAPISecret: ngc-api
  hfTokenSecret: hf-token
  jsonLogging: false

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 8
  pipelineParallelSize: 2

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: local-nfs

Troubleshooting#

Workers Cannot Join the Ray Cluster#

Use the following checks if workers cannot join the Ray cluster:

Verify that the Ray port (default 6379) is accessible between pods. Check network policies.
Ensure that LWS_LEADER_ADDRESS is being injected correctly by the LeaderWorkerSet controller.
Check worker logs for Ray connection errors:
```
kubectl logs <worker-pod-name>
```

Model Download Failures#

Use the following checks if model downloads fail:

Confirm that NGC_API_KEY is set correctly in the referenced secret.
If using shared storage, verify the PVC is bound and has ReadWriteMany access mode.
Check that the storage class supports the required access mode.

NCCL Communication Errors#

Use the following checks if NCCL communication errors occur:

For multi-node TP, ensure high-speed interconnect (InfiniBand/RoCE) is available and configured.
Verify that NCCL environment variables are set appropriately for your network topology. You can pass these via env in values.yaml:
```
env:
  - name: NCCL_IB_DISABLE
    value: "0"
  - name: NCCL_DEBUG
    value: "INFO"
```

Pod Scheduling Issues#

Use the following checks if pods cannot be scheduled:

Verify that each node has the required GPU resources available.
Check that nodeSelector, affinity, or tolerations are configured correctly for GPU nodes.
Ensure nvidia.com/gpu resource requests match the available GPUs per node.