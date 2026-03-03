Multi-Node Deployment for NVIDIA NIM for LLMs#

Use this documentation to learn how to deploy NIM on multiple, different nodes. Some models are too large to be deployed on a single node, even when using multiple GPUs. For these models, you can split the model weights across different nodes, and across different GPUs on each node.

To determine whether your model requires multi-node deployment, find the number of GPUs required for your desired model in Supported Models for NVIDIA NIM for LLMs. If you do not have a single node with at least the specified number of GPUs, you must use multi-node deployment.

Multi-node deployment requires coordinating the creation of NIM containers across multiple different nodes, and setting up a method for communication between those containers. The recommended approach for this orchestration is to use Kubernetes with the nim-deploy Helm chart.

Note To check server readiness for a multi-node deployment, perform a /v1/health/ready probe and evaluate the response. A successful probe will generate the following log entry: [INFO 2025-02-13 02:36:24.635 health.py:43] Health request successful. .

Multi-Node Models# Note Compatibility between versions: LLM NIM version <1.3.0 requires Helm chart version 1.1.2

LLM NIM version >= 1.3.0 < 1.7.0 requires Helm version 1.3.0+

DeepSeek R1 multi-node deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0 Two options exist for deploying multi-node NIMs on Kubernetes: LeaderWorkerSets and MPI Jobs using the MPI Operator. LeaderWorkerSet# Note Requires Kubernetes version >1.26 LeaderWorkerSet (LWS) deployments are the recommended method for deploying multi-node models with NIM. To enable LWS deployments, refer to the installation instructions in the LWS documentation. The Helm chart defaults to LWS for multi-node deployment. With LWS deployments, you will see Leader and Worker pods that coordinate together to run your multi-node models. LWS deployments support manual scaling and autoscaling, where the entire set of pods are treated as a single replica. However, there are limitations to scaling with LWS deployments. If you scale manually ( autoscaling is not enabled), you cannot scale above the initial number of replicas set in the Helm chart. Use the following example values file to deploy the Llama 3.1 405B model using this method. Refer to the Supported Models for NVIDIA NIM for LLMs section to determine whether your hardware is sufficient to run this model. image : # Adjust to the actual location of the image and version you want repository : nvcr.io/nim/meta/llama-3.1-405b-instruct tag : 1.0.3 imagePullSecrets : - name : ngc-secret model : name : meta/llama-3_1-405b-instruct ngcAPISecret : ngc-api # NVIDIA recommends using an NFS-style read-write-many storage class. # All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs". persistence : enabled : true size : 1000Gi accessMode : ReadWriteMany storageClass : nfs annotations : helm.sh/resource-policy : "keep" # This should match `multiNode.gpusPerNode` resources : limits : nvidia.com/gpu : 8 multiNode : enabled : true workers : 2 gpusPerNode : 8 # Downloading the model will take quite a long time. Give it as much time as ends up being needed. startupProbe : failureThreshold : 1500 MPI Job# MPI Jobs using the MPI Operator are an alternative deployment option for clusters that do not support LeaderWorkerSet (Kubernetes version less than v1.27). To enable MPI Jobs, install the MPI operator. The following custom-values.yaml example disables LeaderWorkerSets and launches an MPI Job: image : # Adjust to the actual location of the image and version you want repository : nvcr.io/nim/meta/llama-3.1-405b-instruct tag : 1.0.3 imagePullSecrets : - name : ngc-secret model : name : meta/llama-3_1-405b-instruct ngcAPISecret : ngc-api # NVIDIA recommends using an NFS-style read-write-many storage class. # All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs". persistence : enabled : true size : 1000Gi accessMode : ReadWriteMany storageClass : nfs annotations : helm.sh/resource-policy : "keep" # This should match `multiNode.gpusPerNode` resources : limits : nvidia.com/gpu : 8 multiNode : enabled : true leaderWorkerSet : enabled : False workers : 2 gpusPerNode : 8 # Downloading the model will take quite a long time. Give it as much time as ends up being needed. startupProbe : failureThreshold : 1500 For MPI Jobs, you will see a launcher pod and one or more worker pods deployed for your model. The launcher pod does not require any GPUs, and deployment logs are available through the launcher pod. When deploying with MPI Jobs, you can set the number of replicas. However, dynamic scaling is not supported without redeploying the Helm chart. MPI Jobs also do not restart automatically, so if any pod in the multi-node set fails, uninstall and reinstall the job to start it again.

Example: Helm chart for DeepSeek R1 using an SGLang Backend# Note DeepSeek R1 multi-node deployment (LLM NIM version 1.7.0) requires Helm version 1.7.0 This custom-values.yaml file example deploys the DeepSeek-R1 model using LeaderWorkerSets. image : repository : [ container ] tag : [ container tag ] model : logLevel : INFO name : deepseek-ai/DeepSeek-R1 ngcAPISecret : ngc-api jsonLogging : false env : - name : NIM_MODEL_PROFILE value : [ profile id ] - name : NIM_USE_SGLANG value : "1" - name : NIM_MULTI_NODE value : "1" - name : NIM_TENSOR_PARALLEL_SIZE value : '8' - name : NIM_PIPELINE_PARALLEL_SIZE value : '2' - name : NGC_HOME value : /model-store/ngc/hub - name : HF_HOME value : /model-store/huggingface/hub - name : NUMBA_CACHE_DIR value : /tmp/numba - name : OUTLINES_CACHE_DIR value : /tmp/outlines - name : UCX_TLS value : ib,tcp,shm - name : UCC_TLS value : ucp - name : UCC_CONFIG_FILE value : " " - name : GLOO_SOCKET_IFNAME value : ens6f0 - name : NCCL_SOCKET_IFNAME value : ens6f0 - name : NIM_TRUST_CUSTOM_CODE value : "1" - name : NIM_NODE_RANK valueFrom : fieldRef : fieldPath : metadata.labels['leaderworkerset.sigs.k8s.io/worker-index'] imagePullSecrets : - name : ngc-secret persistence : enabled : true existingClaim : mypvc # size: 1000Gi # accessMode: ReadWriteMany # storageClass: nfs-client # annotations: # helm.sh/resource-policy: "keep" resources : limits : nvidia.com/gpu : 8 # customCommand: [ "/opt/nim/start_server.sh" ] # customCommand: [ "sh", "-c", "while true; do sleep 300; done" ] multiNode : enabled : true workers : 2 gpusPerNode : 8 # workerCustomCommand: [ "/opt/nim/start_server.sh" ] # workerCustomCommand: [ "sh", "-c", "while true; do sleep 300; done" ] readinessProbe : {} livenessProbe : {} startupProbe : {} Note For the multi-LLM compatible NIM container, environment variables HF_TOKEN and NIM_MODEL_NAME should be specified with a HuggingFace URI in addition to the rest of the environment variables in the example above. For example, add following environment vairables to run inference with HF model - hf://meta-llama/Meta-Llama-3-8B env: name: NIM_MODEL_NAME value: hf://meta-llama/Meta-Llama-3-8B env: name: HF_TOKEN value: x How to obtain right value for GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME ? apt update apt install iproute2 ip addr Select the interface with <BROADCAST,UP,LOWER_UP> and ip address. Why are GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME needed? GLOO_SOCKET_IFNAME is used to specify the network interface responsible for coordinating processes across nodes. An incorrect interface can prevent nodes from discovering each other and setting up the inference job. NCCL_SOCKET_IFNAME determines the network path for inter-GPU communication across nodes. If you do not have any further configuration needs, you can run the commands to launch the NIM in Kubernetes and run inference. Otherwise, refer to Deploying with Helm to learn about additional configuration options that are not limited to multi-node models (such as storage, telemetry, etc.).

Troubleshooting Multi-Node Deployments# The interface of GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME need to align with those within the container. If the deployment failed with GLOO or NCCL related interface errors, please use the following script to identify the correct interface and its associated ip address for broadcasting. For example, see eth0 . import psutil import json def get_network_interfaces (): interfaces = psutil . net_if_addrs () stats = psutil . net_if_stats () interface_info = {} for iface , addrs in interfaces . items (): iface_data = { "status" : "UP" if stats [ iface ] . isup else "DOWN" , "addresses" : [] } for addr in addrs : addr_info = { "family" : str ( addr . family ), "address" : addr . address , "netmask" : addr . netmask , "broadcast" : addr . broadcast } iface_data [ "addresses" ] . append ( addr_info ) interface_info [ iface ] = iface_data return interface_info print ( json . dumps ( get_network_interfaces (), indent = 4 )) { "lo" : { "status" : "UP" , "addresses" : [ { "family" : "2" , "address" : "127.0.0.1" , "netmask" : "255.0.0.0" , "broadcast" : null }, { "family" : "10" , "address" : "::1" , "netmask" : "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff" , "broadcast" : null }, { "family" : "17" , "address" : "00:00:00:00:00:00" , "netmask" : null , "broadcast" : null } ] }, "eth0" : { "status" : "UP" , "addresses" : [ { "family" : "2" , "address" : "[ip v4 address]" , "netmask" : "255.255.0.0" , "broadcast" : "172.17.255.255" }, { "family" : "17" , "address" : "02:42:ac:11:00:05" , "netmask" : null , "broadcast" : "ff:ff:ff:ff:ff:ff" } ] } }