Multi-Node Deployment#

Multi-node deployment enables very large models (for example, Llama 3.1 405B or DeepSeek R1) to run across multiple physical nodes when a single node’s GPU capacity is insufficient. NIM LLM uses Ray for cluster formation and vLLM for distributed model execution across the cluster.

Multi-node deployment splits model weights across nodes using two parallelism strategies:

  • Pipeline Parallelism (PP): Splits model stages across nodes. Example: PP=2 across 2 nodes.

  • Tensor Parallelism (TP): Splits model layers across GPUs. Example: TP=8 across 8 GPUs.

The most common configuration sets the tensor-parallel size to the number of GPUs per node and the pipeline-parallel size to the number of nodes.

For example, use the following settings for Llama 3.1 405B across two nodes with eight GPUs each:

  • TP = 8 (8 GPUs per node)

  • PP = 2 (2 nodes)

  • Total GPUs = 16

In some cases, multi-node tensor parallelism is used instead, where a single tensor-parallel group spans multiple physical nodes:

  • TP = 16

  • PP = 1

  • Node A: 8 GPUs, Node B: 8 GPUs

  • One TP group spans 2 nodes with 16 GPUs total

Note

Multi-node TP requires continuous cross-node GPU communication. Performance depends on network bandwidth, latency, and Remote Direct Memory Access (RDMA) availability.

Known Limitations#

Warning

Disable structured (JSON) logging for multi-node deployments. Set model.jsonLogging=false in Helm values. The NIM JSON log formatter is not available in vLLM Ray worker processes and causes worker initialization to fail.

Architecture#

NIM multi-node deployments follow a leader/worker model:

  • The leader pod starts a Ray head node, downloads the model, and launches the vLLM inference server with distributed execution enabled.

  • Worker pods join the Ray cluster and provide GPU resources. vLLM spawns execution actors on worker nodes for model parallelism.

The following diagram shows this architecture:

flowchart LR subgraph Leader["Leader Node"] RH["Ray Head"] VLLM["vLLM Server"] end subgraph Worker1["Worker Node 1"] RW1["Ray Worker"] GPU1["GPUs"] end subgraph Worker2["Worker Node N"] RW2["Ray Worker"] GPU2["GPUs"] end RH <-->|"Ray cluster"| RW1 RH <-->|"Ray cluster"| RW2 VLLM -->|"NCCL"| GPU1 VLLM -->|"NCCL"| GPU2 style Leader fill:#76b900,stroke:#333,color:#fff style Worker1 fill:#1a1a2e,stroke:#333,color:#fff style Worker2 fill:#1a1a2e,stroke:#333,color:#fff

The same NIM container image is used for both leader and worker pods. The role is determined by the command injected at deployment time.

Prerequisites#

Before deploying NIM in multi-node mode, make sure you have the following:

  • A Kubernetes cluster with GPU nodes, where each node has the same number and type of GPUs.

  • The LeaderWorkerSet CRD installed on the cluster. LeaderWorkerSet is a Kubernetes-native mechanism for managing leader/worker pod groups:

    kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
    
  • Shared storage (recommended): A PVC with ReadWriteMany access mode, hostPath, or NFS, so the leader downloads the model once and shares it with workers.

  • High-speed networking (recommended): InfiniBand or RDMA over Converged Ethernet (RoCE) for optimal NCCL performance in multi-node configurations.

  • Sufficient GPU resources across all nodes to satisfy the TP × PP requirement.

  • An NGC API key stored as a Kubernetes secret:

    kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=<your-key>
    

Deployment with Helm#

The NIM LLM Helm chart natively supports multi-node deployments using LeaderWorkerSet.

Minimal Configuration#

Complete the following steps to deploy the minimal multi-node Helm example:

  1. Create a values.yaml file for your multi-node deployment:

    image:
      repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
      tag: "2.0.1"
    
    model:
      ngcAPISecret: ngc-api
    
    multiNode:
      enabled: true
      workers: 1                  # Number of worker pods (total nodes = workers + 1 leader)
      tensorParallelSize: 8       # GPUs per tensor-parallel group
      pipelineParallelSize: 2     # Number of pipeline stages (typically = number of nodes)
    
    resources:
      limits:
        nvidia.com/gpu: 8
      requests:
        nvidia.com/gpu: 8
    
    persistence:
      enabled: true
      size: 200Gi
      accessMode: ReadWriteMany   # Required for multi-node with shared storage
      storageClass: <your-rwx-storage-class>
    
  2. Install the chart:

    helm install nim-llm nim-llm/ -f values.yaml
    

Helm Values Reference#

Use the following Helm values to configure multi-node deployment behavior, model profile selection, and Ray communication settings.

Parameter

Description

Default

multiNode.enabled

Enable multi-node deployment mode

false

multiNode.workers

Number of worker pods per replica

1

multiNode.tensorParallelSize

Number of GPUs per tensor-parallel group (sets NIM_TENSOR_PARALLEL_SIZE). Set to 0 to omit.

0

multiNode.pipelineParallelSize

Number of pipeline stages across nodes (sets NIM_PIPELINE_PARALLEL_SIZE). Set to 0 to omit.

0

multiNode.ray.port

Ray head node communication port

6379

model.profile

Explicit profile name or hash (alternative to TP/PP values). Sets NIM_MODEL_PROFILE.

""

model.hfTokenSecret

Kubernetes secret name containing HF_TOKEN for Hugging Face model downloads.

""

Profile Selection#

You must specify how the model profile is selected. Choose one of these approaches:

Set multiNode.tensorParallelSize and multiNode.pipelineParallelSize directly. The Helm chart injects NIM_TENSOR_PARALLEL_SIZE and NIM_PIPELINE_PARALLEL_SIZE environment variables, and the correct profile is selected automatically.

helm install nim-llm nim-llm/ \
  --set multiNode.enabled=true \
  --set multiNode.workers=1 \
  --set multiNode.tensorParallelSize=8 \
  --set multiNode.pipelineParallelSize=2

Set model.profile to a profile name or hash. The Helm chart injects the NIM_MODEL_PROFILE environment variable.

helm install nim-llm nim-llm/ \
  --set multiNode.enabled=true \
  --set multiNode.workers=1 \
  --set model.profile=vllm-fp16-tp8-pp2

Warning

Multi-node deployment requires either tensorParallelSize/pipelineParallelSize (both > 0) or model.profile to be set. The Helm chart will fail to render if neither is provided.

Model Storage#

The leader and all worker nodes must have access to the model weights at the same filesystem path. There are two supported approaches:

Independent Downloads#

If shared storage is not available, each node downloads the model independently to local storage (emptyDir). This works but is not recommended because it wastes time and network bandwidth.

Note

Model-free NIM deployments (using model.modelPath) support both shared and independent storage. However, shared storage is strongly recommended for multi-node deployments.

Deployment with the NIM Operator#

The NIM Operator provides a fully automated deployment experience for multi-node NIM. The operator manages the NIMService custom resource and automatically:

  • Generates leader and worker pod specifications

  • Injects the appropriate Ray startup commands

  • Handles PVC setup through NIMCache

  • Manages probes, networking, and secrets

Refer to the NIM Operator Deployment documentation for full details on deploying with the NIM Operator.

Examples#

Use the following examples to configure common multi-node deployments with different tensor and pipeline parallelism settings.

Llama 3.1 405B on 2 Nodes (TP=8, PP=2)#

Two nodes, each with 8 GPUs. The model is split across 16 GPUs total.

image:
  repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
  tag: latest

model:
  ngcAPISecret: ngc-api
  jsonLogging: false

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 8
  pipelineParallelSize: 2

resources:
  limits:
    nvidia.com/gpu: 8
  requests:
    nvidia.com/gpu: 8

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: local-nfs

imagePullSecrets:
  - name: nvcr-imagepull

Multi-Node Tensor Parallelism (TP=16, PP=1)#

Two nodes, each with 8 GPUs. A single TP group spans both nodes.

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 16
  pipelineParallelSize: 1

Note

Multi-node TP requires high-bandwidth, low-latency interconnect (for example, InfiniBand with RDMA) for acceptable performance.

Model-Free Multi-Node Deployment#

Deploy a model from Hugging Face across multiple nodes using model-free NIM:

image:
  repository: nvcr.io/nim/nim-llm
  tag: latest

model:
  modelPath: "hf://meta-llama/Llama-3.1-405B-Instruct"
  ngcAPISecret: ngc-api
  hfTokenSecret: hf-token
  jsonLogging: false

multiNode:
  enabled: true
  workers: 1
  tensorParallelSize: 8
  pipelineParallelSize: 2

persistence:
  enabled: true
  size: 200Gi
  accessMode: ReadWriteMany
  storageClass: local-nfs

Troubleshooting#

Workers Cannot Join the Ray Cluster#

Use the following checks if workers cannot join the Ray cluster:

  • Verify that the Ray port (default 6379) is accessible between pods. Check network policies.

  • Ensure that LWS_LEADER_ADDRESS is being injected correctly by the LeaderWorkerSet controller.

  • Check worker logs for Ray connection errors:

    kubectl logs <worker-pod-name>
    

Model Download Failures#

Use the following checks if model downloads fail:

  • Confirm that NGC_API_KEY is set correctly in the referenced secret.

  • If using shared storage, verify the PVC is bound and has ReadWriteMany access mode.

  • Check that the storage class supports the required access mode.

NCCL Communication Errors#

Use the following checks if NCCL communication errors occur:

  • For multi-node TP, ensure high-speed interconnect (InfiniBand/RoCE) is available and configured.

  • Verify that NCCL environment variables are set appropriately for your network topology. You can pass these via env in values.yaml:

    env:
      - name: NCCL_IB_DISABLE
        value: "0"
      - name: NCCL_DEBUG
        value: "INFO"
    

Pod Scheduling Issues#

Use the following checks if pods cannot be scheduled:

  • Verify that each node has the required GPU resources available.

  • Check that nodeSelector, affinity, or tolerations are configured correctly for GPU nodes.

  • Ensure nvidia.com/gpu resource requests match the available GPUs per node.