Multi-Node Deployment#
Multi-Node Deployment with Kubernetes#
Some models are too large to be deployed on a single node, even when using multiple GPUs. For these models, you can split the model weights across the different nodes—and across the different GPUs with each node—by deploying NIM on multiple different nodes with access to the model weights.
To determine whether your model requires multi-node deployment, find the number of GPUs required for your desired model in Supported Models. If you don’t have a single node with at least the specified number of GPUs, you must use multi-node deployment.
Multi-node deployment requires coordinating the creation of NIM containers across multiple different nodes, and setting up a method for communication between those containers. The recommended approach for this orchestration is to use Kubernetes with the nim-deploy
helm chart.
Note
To check server readiness for a multi-node deployment, perform a /v1/health/ready
probe and evaluate the response. A successful probe will generate the following log entry: [INFO 2025-02-13 02:36:24.635 health.py:43] Health request successful.
.
Preparation#
Before deploying, adhere to the necessary prerequisites and configuration steps. Afterward, generate an image pull secret to use for deployment.
Multi-Node Models#
Note
Compatability between versions:
LLM NIM version <1.3.0 requires Helm chart version 1.1.2
LLM NIM version >= 1.3.0 < 1.7.0 requires Helm version 1.3.0+
DeepSeek R1 multi-note deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0
Two options exist for deploying multi-node NIMs on Kubernetes: LeaderWorkerSets and MPI Jobs using the MPI Operator.
LeaderWorkerSet#
Note
Requires Kubernetes version >1.26
LeaderWorkerSet (LWS) deployments are the recommended method for deploying Multi-Node models with NIM. To enable LWS deployments, see the installation instructions in the LWS documentation. The helm chart defaults to LWS for multi-node deployment.
With LWS deployments, you will see Leader
and Worker
pods that coordinate together to run your multi-node models.
LWS deployments support manual scaling and auto scaling, where the entire set of pods are treated as a single replica. However, there are some limitations to scaling when using LWS deployments. If scaling manually (autoscaling
is not enabled), you cannot scale above the initial number of replicas set in the helm chart.
Use the following example values file to deploy the Llama 3.1 405B model using this method. Refer to the Supported Models section to determine whether your hardware is sufficient to run this model.
image:
# Adjust to the actual location of the image and version you want
repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
tag: 1.0.3
imagePullSecrets:
- name: ngc-secret
model:
name: meta/llama-3_1-405b-instruct
ngcAPISecret: ngc-api
# NVIDIA recommends using an NFS-style read-write-many storage class.
# All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs".
persistence:
enabled: true
size: 1000Gi
accessMode: ReadWriteMany
storageClass: nfs
annotations:
helm.sh/resource-policy: "keep"
# This should match `multiNode.gpusPerNode`
resources:
limits:
nvidia.com/gpu: 8
multiNode:
enabled: true
workers: 2
gpusPerNode: 8
# Downloading the model will take quite a long time. Give it as much time as ends up being needed.
startupProbe:
failureThreshold: 1500
MPI Job#
MPI Jobs using the MPI Operator are an alternative deployment option for clusters that don’t support LeaderWorkerSet (Kubernetes version less than v1.27). To enable MPI Jobs, install the MPI operator. This is a custom-values.yaml
file example that disables LeaderWorkerSets
and launches an MPI Job:
image:
# Adjust to the actual location of the image and version you want
repository: nvcr.io/nim/meta/llama-3.1-405b-instruct
tag: 1.0.3
imagePullSecrets:
- name: ngc-secret
model:
name: meta/llama-3_1-405b-instruct
ngcAPISecret: ngc-api
# NVIDIA recommends using an NFS-style read-write-many storage class.
# All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs".
persistence:
enabled: true
size: 1000Gi
accessMode: ReadWriteMany
storageClass: nfs
annotations:
helm.sh/resource-policy: "keep"
# This should match `multiNode.gpusPerNode`
resources:
limits:
nvidia.com/gpu: 8
multiNode:
enabled: true
leaderWorkerSet:
enabled: False
workers: 2
gpusPerNode: 8
# Downloading the model will take quite a long time. Give it as much time as ends up being needed.
startupProbe:
failureThreshold: 1500
For MPI Jobs, you will see a launcher
pod and one or more worker
pods deployed for your model. The launcher
pod does not require any GPUs, and deployment logs will be available through the launcher
pod.
When deploying with MPI Jobs you can set a number of replicas, however dynamic scaling is not supported without redeploying the helm chart. MPI Jobs also do not automatically restart, so if any pod in the multi-node set fails, the job must be manually uninstalled and reinstalled to start it back up.
Example: Helm chart for DeepSeek R1 using an SGLang Backend#
Note
DeepSeek R1 multi-note deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0
This custom-values.yaml
file example deploys the DeepSeek-R1 model using LeaderWorkerSets.
image:
repository: [container]
tag: [container tag]
model:
logLevel: INFO
name: deepseek-ai/DeepSeek-R1
ngcAPISecret: ngc-api
jsonLogging: false
env:
- name: NIM_MODEL_PROFILE
value: [profile id]
- name: NIM_USE_SGLANG
value: "1"
- name: NIM_MULTI_NODE
value: "1"
- name: NIM_TENSOR_PARALLEL_SIZE
value: '8'
- name: NIM_PIPELINE_PARALLEL_SIZE
value: '2'
- name: NGC_HOME
value: /model-store/ngc/hub
- name: HF_HOME
value: /model-store/huggingface/hub
- name: NUMBA_CACHE_DIR
value: /tmp/numba
- name: OUTLINES_CACHE_DIR
value: /tmp/outlines
- name: UCX_TLS
value: ib,tcp,shm
- name: UCC_TLS
value: ucp
- name: UCC_CONFIG_FILE
value: " "
- name: GLOO_SOCKET_IFNAME
value: ens6f0
- name: NCCL_SOCKET_IFNAME
value: ens6f0
- name: NIM_TRUST_CUSTOM_CODE
value: "1"
- name: NIM_NODE_RANK
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
imagePullSecrets:
- name: ngc-secret
persistence:
enabled: true
existingClaim: mypvc
# size: 1000Gi
# accessMode: ReadWriteMany
# storageClass: nfs-client
# annotations:
# helm.sh/resource-policy: "keep"
resources:
limits:
nvidia.com/gpu: 8
# customCommand: [ "/opt/nim/start_server.sh" ]
# customCommand: [ "sh", "-c", "while true; do sleep 300; done" ]
multiNode:
enabled: true
workers: 2
gpusPerNode: 8
# workerCustomCommand: [ "/opt/nim/start_server.sh" ]
# workerCustomCommand: [ "sh", "-c", "while true; do sleep 300; done" ]
readinessProbe: {}
livenessProbe: {}
startupProbe: {}
GLOO_SOCKET_IFNAME
is used to specify the network interface responsible for coordinating processes across nodes. An incorrect interface can prevent nodes from discovering each other and setting up the inference job. NCCL_SOCKET_IFNAME
determines the network path for inter-GPU communication across nodes. Use the following commands to get the appropriate values for GLOO_SOCKET_IFNAME
and NCCL_SOCKET_IFNAME
.
apt update
apt install iproute2
ip addr
Select the interface with <BROADCAST,UP,LOWER_UP>
and ip
address.
If you do not have any further configuration needs, you can run the commands to launch the NIM in Kubernetes and run inference. Otherwise, refer to Deploying with Helm to learn about additional configuration options that are not limited to multi-node models (such as storage, telemetry, etc.).
Troubleshooting Multi-Node Deployments#
The interface of GLOO_SOCKET_IFNAME
and NCCL_SOCKET_IFNAME
need to align with those within the container.
If the deployment fails with GLOO or NCCL related interface errors, use the following script to identify
the correct interface and its associated ip
address for broadcasting.
import psutil
import json
def get_network_interfaces():
interfaces = psutil.net_if_addrs()
stats = psutil.net_if_stats()
interface_info = {}
for iface, addrs in interfaces.items():
iface_data = {
"status": "UP" if stats[iface].isup else "DOWN",
"addresses": []
}
for addr in addrs:
addr_info = {
"family": str(addr.family),
"address": addr.address,
"netmask": addr.netmask,
"broadcast": addr.broadcast
}
iface_data["addresses"].append(addr_info)
interface_info[iface] = iface_data
return interface_info
print(json.dumps(get_network_interfaces(), indent=4))
{
"lo": {
"status": "UP",
"addresses": [
{
"family": "2",
"address": "127.0.0.1",
"netmask": "255.0.0.0",
"broadcast": null
},
{
"family": "10",
"address": "::1",
"netmask": "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff",
"broadcast": null
},
{
"family": "17",
"address": "00:00:00:00:00:00",
"netmask": null,
"broadcast": null
}
]
},
"eth0": {
"status": "UP",
"addresses": [
{
"family": "2",
"address": "[ip v4 address]",
"netmask": "255.255.0.0",
"broadcast": "172.17.255.255"
},
{
"family": "17",
"address": "02:42:ac:11:00:05",
"netmask": null,
"broadcast": "ff:ff:ff:ff:ff:ff"
}
]
}
}
Multi-Node Parameters#
Large models that must span multiple nodes do not work on plain Kubernetes with the GPU Operator alone.
Optimized TensorRT profiles, when selected automatically or by environment variable, require that you install either LeaderWorkerSets or MPI Operator’s MPIJobs
.
Since MPIJob
is a batch-type resource that is not designed with service stability and reliability in mind, you should use LeaderWorkerSets if your cluster version supports it.
Multi-node deployments support only optimized profiles.
Name |
Description |
Value |
---|---|---|
|
Enables multi-node deployments. |
|
|
Sets the number of seconds to wait for worker nodes to come up before failing. |
|
|
Number of GPUs that are presented to each pod. In most cases, this should match |
|
|
Specifies how many worker pods per multi-node replica to launch. |
|
|
Sets a custom command array for the worker nodes in a LeaderWorkerSet only. |
|
|
NVIDIA recommends you use |
|
|
Sets the SSH private key for MPI to an existing secret. Otherwise, the Helm chart randomly generates a key during installation. |
|
|
Annotations only applied to workers for |
|
|
Resources section to apply only to the launcher pods in |
|
|
Enables optimized multi-node deployments (the only supported option). |
|