Multi-Node Deployment for NVIDIA NIM for LLMs#

Use this documentation to learn how to deploy NIM on multiple, different nodes. Some models are too large to be deployed on a single node, even when using multiple GPUs. For these models, you can split the model weights across different nodes, and across different GPUs on each node.

To determine whether your model requires multi-node deployment, find the number of GPUs required for your desired model in Supported Models for NVIDIA NIM for LLMs. If you do not have a single node with at least the specified number of GPUs, you must use multi-node deployment.

Multi-node deployment requires coordinating the creation of NIM containers across multiple different nodes, and setting up a method for communication between those containers. The recommended approach for this orchestration is to use Kubernetes with the nim-deploy Helm chart.

Note

To check server readiness for a multi-node deployment, perform a /v1/health/ready probe and evaluate the response. A successful probe will generate the following log entry: [INFO 2025-02-13 02:36:24.635 health.py:43] Health request successful..

Preparation#

Before deploying, adhere to the necessary prerequisites and configuration steps. Afterward, generate an image pull secret to use for deployment.

Multi-Node Models#

Note

Compatibility between versions:

LLM NIM version <1.3.0 requires Helm chart version 1.1.2
LLM NIM version >= 1.3.0 < 1.7.0 requires Helm version 1.3.0+
DeepSeek R1 multi-node deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0

Two options exist for deploying multi-node NIMs on Kubernetes: LeaderWorkerSets and MPI Jobs using the MPI Operator.

LeaderWorkerSet#

Note

Requires Kubernetes version >1.26

LeaderWorkerSet (LWS) deployments are the recommended method for deploying multi-node models with NIM. To enable LWS deployments, refer to the installation instructions in the LWS documentation. The Helm chart defaults to LWS for multi-node deployment.

With LWS deployments, you will see Leader and Worker pods that coordinate together to run your multi-node models.

LWS deployments support manual scaling and autoscaling, where the entire set of pods are treated as a single replica. However, there are limitations to scaling with LWS deployments. If you scale manually (autoscaling is not enabled), you cannot scale above the initial number of replicas set in the Helm chart.

Use the following example values file to deploy the Llama 3.1 405B model using this method. Refer to the Supported Models for NVIDIA NIM for LLMs section to determine whether your hardware is sufficient to run this model.

image:
  # Adjust to the actual location of the image and version you want
  repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
  tag: 1.0.3
imagePullSecrets:
  - name: ngc-secret
model:
  name: meta/llama-3_1-405b-instruct
  ngcAPISecret: ngc-api
# NVIDIA recommends using an NFS-style read-write-many storage class.
# All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs".
persistence:
  enabled: true
  size: 1000Gi
  accessMode: ReadWriteMany
  storageClass: nfs
  annotations:
    helm.sh/resource-policy: "keep"
# This should match `multiNode.gpusPerNode`
resources:
  limits:
    nvidia.com/gpu: 8
multiNode:
  enabled: true
  workers: 2
  gpusPerNode: 8
# Downloading the model will take quite a long time. Give it as much time as ends up being needed.
startupProbe:
  failureThreshold: 1500

MPI Job#

MPI Jobs using the MPI Operator are an alternative deployment option for clusters that do not support LeaderWorkerSet (Kubernetes version less than v1.27). To enable MPI Jobs, install the MPI operator. The following custom-values.yaml example disables LeaderWorkerSets and launches an MPI Job:

image:
  # Adjust to the actual location of the image and version you want
  repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
  tag: 1.0.3
imagePullSecrets:
  - name: ngc-secret
model:
  name: meta/llama-3_1-405b-instruct
  ngcAPISecret: ngc-api
# NVIDIA recommends using an NFS-style read-write-many storage class.
# All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs".
persistence:
  enabled: true
  size: 1000Gi
  accessMode: ReadWriteMany
  storageClass: nfs
  annotations:
    helm.sh/resource-policy: "keep"
# This should match `multiNode.gpusPerNode`
resources:
  limits:
    nvidia.com/gpu: 8
multiNode:
  enabled: true
  leaderWorkerSet:
    enabled: False
  workers: 2
  gpusPerNode: 8
# Downloading the model will take quite a long time. Give it as much time as ends up being needed.
startupProbe:
  failureThreshold: 1500

For MPI Jobs, you will see a launcher pod and one or more worker pods deployed for your model. The launcher pod does not require any GPUs, and deployment logs are available through the launcher pod.

When deploying with MPI Jobs, you can set the number of replicas. However, dynamic scaling is not supported without redeploying the Helm chart. MPI Jobs also do not restart automatically, so if any pod in the multi-node set fails, uninstall and reinstall the job to start it again.

Example: Helm chart for DeepSeek R1 using an SGLang Backend#

Note

DeepSeek R1 multi-node deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0

This custom-values.yaml file example deploys the DeepSeek-R1 model using LeaderWorkerSets.

image:
  repository: [container]
  tag: [container tag]
model:
  logLevel: INFO
  name: deepseek-ai/DeepSeek-R1
  ngcAPISecret: ngc-api
  jsonLogging: false
env:
  - name: NIM_MODEL_PROFILE
    value:  [profile id]
  - name: NIM_USE_SGLANG
    value: "1"
  - name: NIM_MULTI_NODE
    value: "1"
  - name: NIM_TENSOR_PARALLEL_SIZE
    value: '8'
  - name: NIM_PIPELINE_PARALLEL_SIZE
    value: '2'
  - name: NGC_HOME
    value: /model-store/ngc/hub
  - name: HF_HOME
    value: /model-store/huggingface/hub
  - name: NUMBA_CACHE_DIR
    value: /tmp/numba
  - name: OUTLINES_CACHE_DIR
    value: /tmp/outlines
  - name: UCX_TLS
    value: ib,tcp,shm
  - name: UCC_TLS
    value: ucp
  - name: UCC_CONFIG_FILE
    value: " "
  - name: GLOO_SOCKET_IFNAME
    value: ens6f0
  - name: NCCL_SOCKET_IFNAME
    value: ens6f0
  - name: NIM_TRUST_CUSTOM_CODE
    value: "1"
  - name: NIM_NODE_RANK
    valueFrom:
      fieldRef:
        fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
imagePullSecrets:
  - name: ngc-secret
persistence:
  enabled: true
  existingClaim: mypvc
  # size: 1000Gi
  # accessMode: ReadWriteMany
  # storageClass: nfs-client
  # annotations:
  #   helm.sh/resource-policy: "keep"
resources:
  limits:
    nvidia.com/gpu: 8
# customCommand: [ "/opt/nim/start_server.sh" ]
# customCommand: [ "sh", "-c", "while true; do sleep 300; done" ]
multiNode:
  enabled: true
  workers: 2
  gpusPerNode: 8
  # workerCustomCommand: [ "/opt/nim/start_server.sh" ]
  # workerCustomCommand: [ "sh", "-c", "while true; do sleep 300; done" ]
readinessProbe: {}
livenessProbe: {}
startupProbe: {}

Note

For the multi-LLM compatible NIM container, environment variables HF_TOKEN and NIM_MODEL_NAME should be specified with a HuggingFace URI in addition to the rest of the environment variables in the example above. For example, add following environment vairables to run inference with HF model - hf://meta-llama/Meta-Llama-3-8B

env: name: NIM_MODEL_NAME value: hf://meta-llama/Meta-Llama-3-8B env: name: HF_TOKEN value: x

How to obtain right value for GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME?

apt update
apt install iproute2
ip addr

Select the interface with <BROADCAST,UP,LOWER_UP> and ip address.

Why are GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME needed?

GLOO_SOCKET_IFNAME is used to specify the network interface responsible for coordinating processes across nodes. An incorrect interface can prevent nodes from discovering each other and setting up the inference job. NCCL_SOCKET_IFNAME determines the network path for inter-GPU communication across nodes.

If you do not have any further configuration needs, you can run the commands to launch the NIM in Kubernetes and run inference. Otherwise, refer to Deploying with Helm to learn about additional configuration options that are not limited to multi-node models (such as storage, telemetry, etc.).

Troubleshooting Multi-Node Deployments#

The interface of GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME need to align with those within the container. If the deployment failed with GLOO or NCCL related interface errors, please use the following script to identify the correct interface and its associated ip address for broadcasting. For example, see eth0.

import psutil
import json

def get_network_interfaces():
    interfaces = psutil.net_if_addrs()
    stats = psutil.net_if_stats()
    
    interface_info = {}

    for iface, addrs in interfaces.items():
        iface_data = {
            "status": "UP" if stats[iface].isup else "DOWN",
            "addresses": []
        }
        for addr in addrs:
            addr_info = {
                "family": str(addr.family),
                "address": addr.address,
                "netmask": addr.netmask,
                "broadcast": addr.broadcast
            }
            iface_data["addresses"].append(addr_info)
        
        interface_info[iface] = iface_data

    return interface_info

print(json.dumps(get_network_interfaces(), indent=4))

{
    "lo": {
        "status": "UP",
        "addresses": [
            {
                "family": "2",
                "address": "127.0.0.1",
                "netmask": "255.0.0.0",
                "broadcast": null
            },
            {
                "family": "10",
                "address": "::1",
                "netmask": "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff",
                "broadcast": null
            },
            {
                "family": "17",
                "address": "00:00:00:00:00:00",
                "netmask": null,
                "broadcast": null
            }
        ]
    },
    "eth0": {
        "status": "UP",
        "addresses": [
            {
                "family": "2",
                "address": "[ip v4 address]",
                "netmask": "255.255.0.0",
                "broadcast": "172.17.255.255"
            },
            {
                "family": "17",
                "address": "02:42:ac:11:00:05",
                "netmask": null,
                "broadcast": "ff:ff:ff:ff:ff:ff"
            }
        ]
    }
}

Multi-Node Parameters#

Large models that must span multiple nodes do not work on plain Kubernetes with the GPU Operator alone at this time. Optimized TensorRT profiles, when selected automatically or by environment variable, require either LeaderWorkerSets or the MPI Operator’s MPIJobs to be installed. Since MPIJob is a batch-type resource that is not designed with service stability and reliability in mind, you should use LeaderWorkerSets if your cluster version allows it. Only optimized profiles are supported for multi-node deployment at this time.

Name	Description	Value
`multiNode.enabled`	Enables multi-node deployments.	`false`
`multiNode.clusterStartTimeout`	Sets the number of seconds to wait for worker nodes to come up before failing.	`300`
`multiNode.gpusPerNode`	Number of GPUs that will be presented to each pod. In most cases, this should match `resources.limits.nvidia.com/gpu`.	`1`
`multiNode.workers`	Specifies how many worker pods per multi-node replica to launch.	`1`
`multiNode.workerCustomCommand`	Sets a custom command array for the worker nodes in a LeaderWorkerSet only.	`[]`
`multiNode.leaderWorkerSet.enabled`	NVIDIA recommends you use `LeaderWorkerSets` to deploy. If disabled, defaults to using `MPIJob` from mpi-operator.	`true`
`multiNode.existingSSHSecret`	Sets the SSH private key for MPI to an existing secret. Otherwise, the Helm chart generates a key randomly during installation.	`""`
`multiNode.mpiJob.workerAnnotations`	Annotations only applied to workers for `MPIJob`, if used. This may be necessary to ensure the workers connect to `CNI`s offered by `multus` and the network operator, if used.	`{}`
`multiNode.mpiJob.launcherResources`	Resources section to apply only to the launcher pods in `MPIJob`, if used. Launchers do not get the chart resources restrictions. Only workers do, since they require GPUs.	`{}`
`multiNode.optimized.enabled`	Enables optimized multi-node deployments (currently the only option supported).	`true`