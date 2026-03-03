Multi-Node Deployment for NVIDIA NIM for LLMs#
Use this documentation to learn how to deploy NIM on multiple, different nodes. Some models are too large to be deployed on a single node, even when using multiple GPUs. For these models, you can split the model weights across different nodes, and across different GPUs on each node.
To determine whether your model requires multi-node deployment, find the number of GPUs required for your desired model in Supported Models for NVIDIA NIM for LLMs. If you do not have a single node with at least the specified number of GPUs, you must use multi-node deployment.
Multi-node deployment requires coordinating the creation of NIM containers across multiple different nodes, and setting up a method for communication between those containers.
The recommended approach for this orchestration is to use Kubernetes with the
nim-deploy Helm chart.
Note
To check server readiness for a multi-node deployment, perform a
/v1/health/ready probe and evaluate the response. A successful probe will generate the following log entry:
[INFO 2025-02-13 02:36:24.635 health.py:43] Health request successful..
Preparation#
Before deploying, adhere to the necessary prerequisites and configuration steps. Afterward, generate an image pull secret to use for deployment.
Multi-Node Models#
Note
Compatibility between versions:
LLM NIM version <1.3.0 requires Helm chart version 1.1.2
LLM NIM version >= 1.3.0 < 1.7.0 requires Helm version 1.3.0+
DeepSeek R1 multi-node deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0
Two options exist for deploying multi-node NIMs on Kubernetes: LeaderWorkerSets and MPI Jobs using the MPI Operator.
LeaderWorkerSet#
Note
Requires Kubernetes version >1.26
LeaderWorkerSet (LWS) deployments are the recommended method for deploying multi-node models with NIM. To enable LWS deployments, refer to the installation instructions in the LWS documentation. The Helm chart defaults to LWS for multi-node deployment.
With LWS deployments, you will see
Leader and
Worker pods that coordinate together to run your multi-node models.
LWS deployments support manual scaling and autoscaling, where the entire set of pods are treated as a single replica. However, there are limitations to scaling with LWS deployments. If you scale manually (
autoscaling is not enabled), you cannot scale above the initial number of replicas set in the Helm chart.
Use the following example values file to deploy the Llama 3.1 405B model using this method. Refer to the Supported Models for NVIDIA NIM for LLMs section to determine whether your hardware is sufficient to run this model.
image:
# Adjust to the actual location of the image and version you want
repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
tag: 1.0.3
imagePullSecrets:
- name: ngc-secret
model:
name: meta/llama-3_1-405b-instruct
ngcAPISecret: ngc-api
# NVIDIA recommends using an NFS-style read-write-many storage class.
# All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs".
persistence:
enabled: true
size: 1000Gi
accessMode: ReadWriteMany
storageClass: nfs
annotations:
helm.sh/resource-policy: "keep"
# This should match `multiNode.gpusPerNode`
resources:
limits:
nvidia.com/gpu: 8
multiNode:
enabled: true
workers: 2
gpusPerNode: 8
# Downloading the model will take quite a long time. Give it as much time as ends up being needed.
startupProbe:
failureThreshold: 1500
MPI Job#
MPI Jobs using the MPI Operator are an alternative deployment option for clusters that do not support LeaderWorkerSet (Kubernetes version less than v1.27). To enable MPI Jobs, install the MPI operator. The following
custom-values.yaml example disables
LeaderWorkerSets and launches an MPI Job:
image:
# Adjust to the actual location of the image and version you want
repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
tag: 1.0.3
imagePullSecrets:
- name: ngc-secret
model:
name: meta/llama-3_1-405b-instruct
ngcAPISecret: ngc-api
# NVIDIA recommends using an NFS-style read-write-many storage class.
# All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs".
persistence:
enabled: true
size: 1000Gi
accessMode: ReadWriteMany
storageClass: nfs
annotations:
helm.sh/resource-policy: "keep"
# This should match `multiNode.gpusPerNode`
resources:
limits:
nvidia.com/gpu: 8
multiNode:
enabled: true
leaderWorkerSet:
enabled: False
workers: 2
gpusPerNode: 8
# Downloading the model will take quite a long time. Give it as much time as ends up being needed.
startupProbe:
failureThreshold: 1500
For MPI Jobs, you will see a
launcher pod and one or more
worker pods deployed for your model. The
launcher pod does not require any GPUs, and deployment logs are available through the
launcher pod.
When deploying with MPI Jobs, you can set the number of replicas. However, dynamic scaling is not supported without redeploying the Helm chart. MPI Jobs also do not restart automatically, so if any pod in the multi-node set fails, uninstall and reinstall the job to start it again.
Example: Helm chart for DeepSeek R1 using an SGLang Backend#
Note
DeepSeek R1 multi-node deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0
This
custom-values.yaml file example deploys the DeepSeek-R1 model using LeaderWorkerSets.
image:
repository: [container]
tag: [container tag]
model:
logLevel: INFO
name: deepseek-ai/DeepSeek-R1
ngcAPISecret: ngc-api
jsonLogging: false
env:
- name: NIM_MODEL_PROFILE
value: [profile id]
- name: NIM_USE_SGLANG
value: "1"
- name: NIM_MULTI_NODE
value: "1"
- name: NIM_TENSOR_PARALLEL_SIZE
value: '8'
- name: NIM_PIPELINE_PARALLEL_SIZE
value: '2'
- name: NGC_HOME
value: /model-store/ngc/hub
- name: HF_HOME
value: /model-store/huggingface/hub
- name: NUMBA_CACHE_DIR
value: /tmp/numba
- name: OUTLINES_CACHE_DIR
value: /tmp/outlines
- name: UCX_TLS
value: ib,tcp,shm
- name: UCC_TLS
value: ucp
- name: UCC_CONFIG_FILE
value: " "
- name: GLOO_SOCKET_IFNAME
value: ens6f0
- name: NCCL_SOCKET_IFNAME
value: ens6f0
- name: NIM_TRUST_CUSTOM_CODE
value: "1"
- name: NIM_NODE_RANK
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
imagePullSecrets:
- name: ngc-secret
persistence:
enabled: true
existingClaim: mypvc
# size: 1000Gi
# accessMode: ReadWriteMany
# storageClass: nfs-client
# annotations:
# helm.sh/resource-policy: "keep"
resources:
limits:
nvidia.com/gpu: 8
# customCommand: [ "/opt/nim/start_server.sh" ]
# customCommand: [ "sh", "-c", "while true; do sleep 300; done" ]
multiNode:
enabled: true
workers: 2
gpusPerNode: 8
# workerCustomCommand: [ "/opt/nim/start_server.sh" ]
# workerCustomCommand: [ "sh", "-c", "while true; do sleep 300; done" ]
readinessProbe: {}
livenessProbe: {}
startupProbe: {}
Note
For the multi-LLM compatible NIM container, environment variables
HF_TOKEN and
NIM_MODEL_NAME should be specified with a HuggingFace URI in addition to the rest of the environment variables in the example above. For example, add following environment vairables to run inference with HF model -
hf://meta-llama/Meta-Llama-3-8B
env: name: NIM_MODEL_NAME value: hf://meta-llama/Meta-Llama-3-8B env: name: HF_TOKEN value: x
How to obtain right value for
GLOO_SOCKET_IFNAME and
NCCL_SOCKET_IFNAME?
apt update
apt install iproute2
ip addr
Select the interface with
<BROADCAST,UP,LOWER_UP> and
ip address.
Why are
GLOO_SOCKET_IFNAME and
NCCL_SOCKET_IFNAME needed?
GLOO_SOCKET_IFNAME is used to specify the network interface responsible for coordinating processes across nodes. An incorrect interface can prevent nodes from discovering each other and setting up the inference job.
NCCL_SOCKET_IFNAME determines the network path for inter-GPU communication across nodes.
If you do not have any further configuration needs, you can run the commands to launch the NIM in Kubernetes and run inference. Otherwise, refer to Deploying with Helm to learn about additional configuration options that are not limited to multi-node models (such as storage, telemetry, etc.).
Troubleshooting Multi-Node Deployments#
The interface of
GLOO_SOCKET_IFNAME and
NCCL_SOCKET_IFNAME need to align with those within the container.
If the deployment failed with GLOO or NCCL related interface errors, please use the following script to identify
the correct interface and its associated
ip address for broadcasting. For example, see
eth0.
import psutil
import json
def get_network_interfaces():
interfaces = psutil.net_if_addrs()
stats = psutil.net_if_stats()
interface_info = {}
for iface, addrs in interfaces.items():
iface_data = {
"status": "UP" if stats[iface].isup else "DOWN",
"addresses": []
}
for addr in addrs:
addr_info = {
"family": str(addr.family),
"address": addr.address,
"netmask": addr.netmask,
"broadcast": addr.broadcast
}
iface_data["addresses"].append(addr_info)
interface_info[iface] = iface_data
return interface_info
print(json.dumps(get_network_interfaces(), indent=4))
{
"lo": {
"status": "UP",
"addresses": [
{
"family": "2",
"address": "127.0.0.1",
"netmask": "255.0.0.0",
"broadcast": null
},
{
"family": "10",
"address": "::1",
"netmask": "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff",
"broadcast": null
},
{
"family": "17",
"address": "00:00:00:00:00:00",
"netmask": null,
"broadcast": null
}
]
},
"eth0": {
"status": "UP",
"addresses": [
{
"family": "2",
"address": "[ip v4 address]",
"netmask": "255.255.0.0",
"broadcast": "172.17.255.255"
},
{
"family": "17",
"address": "02:42:ac:11:00:05",
"netmask": null,
"broadcast": "ff:ff:ff:ff:ff:ff"
}
]
}
}
Multi-Node Parameters#
Large models that must span multiple nodes do not work on plain Kubernetes with the GPU Operator alone at this time.
Optimized TensorRT profiles, when selected automatically or by environment variable, require either
LeaderWorkerSets or the MPI Operator’s
MPIJobs to be installed.
Since
MPIJob is a batch-type resource that is not designed with service stability and reliability in mind, you should use LeaderWorkerSets if your cluster version allows it.
Only optimized profiles are supported for multi-node deployment at this time.
|
Name
|
Description
|
Value
|
|
Enables multi-node deployments.
|
|
|
Sets the number of seconds to wait for worker nodes to come up before failing.
|
|
|
Number of GPUs that will be presented to each pod. In most cases, this should match
|
|
|
Specifies how many worker pods per multi-node replica to launch.
|
|
|
Sets a custom command array for the worker nodes in a LeaderWorkerSet only.
|
|
|
NVIDIA recommends you use
|
|
|
Sets the SSH private key for MPI to an existing secret. Otherwise, the Helm chart generates a key randomly during installation.
|
|
|
Annotations only applied to workers for
|
|
|
Resources section to apply only to the launcher pods in
|
|
|
Enables optimized multi-node deployments (currently the only option supported).
|