Multi-Node NIM Deployment#

About Multi-Node NIM Deployment#

Multi-Node NIM Deployment addresses the problem of deploying massive LLMs that cannot fit on a single GPU or need to use older GPUs. This feature enables the distribution of model weights and computation across several interconnected nodes.

NIM Operator supports deploying multi-node NIM on Kubernetes using LeaderWorkerSets (LWS), as supported by NIM Helm Chart.

Diagram 1. Multi-node architecture

For more information on multi-node deployments, refer to Multi-Node Deployment for NVIDIA NIM for LLMs in the NVIDIA NIM for LLMs documentation.

Note

Upgrade and ingress of multi-node NIM are supported. However, auto-scaling, DRA, and Multi-LLM NIM are not yet supported for multi-node configurations.

Example Procedure#

Summary#

To deploy a multi-node enabled NIM Service, follow these steps:

Complete the prerequisites.
- Pre-checks:
- Setup:
  1. Configure multi-node pod network connectivity with RDMA.
    1. Uninstall inbox OFED driver in favor of NVIDIA MOFED/DOCA driver.
    2. Install NVIDIA Network Operator.
  2. Install LeaderWorkerSet (LWS).
Deploy a multi-node enabled NIM Service.
1. Create a cache for the NIM.
2. Create a NIM Service to consume the GPU-NIC Remote Direct Memory Access (RDMA) resource and InfiniBand (IB) network on the multi-node NIM.
Display LWS and NIM Service statuses.

1. Complete the Prerequisites#

Pre-Checks#

Validate that the NIM can be used with multi-node deployment and that your cluster meets the GPU hardware requirements.

Find the number of GPUs required for your desired model in Supported Models for NVIDIA NIM for LLMs: Optimized Models. If you do not have a single node with at least the specified number of GPUs, you must use multi-node deployment.

Ensure the Container Storage Interface (CSI) driver supports ReadWriteMany Persistent Volume Claims (PVCs) and the storage capacity is sufficient for your multi-node.

Refer to the Kubernetes CSI Drivers documentation to check whether the CSI Driver is Persistent and supports the Read/Write Multiple Pods access modes.

The volume must be at least 700GB with VolumeAccessMode set to ReadWriteMany.

Check if the multi-nodes are network reachable.

Setup#

Perform the following steps:

Configure multi-node pod network connectivity with RDMA.

If you are using IPoIB with Mellanox NICs, you can use the following instructions:

Uninstall inbox OFED driver in favor of NVIDIA MOFED/DOCA driver.

Check the status of the IB interfaces to determine if your system has mlx5_core kernel mod loaded.

$ ip link show | grep "link/infiniband" -B 1

Example output

174: ibp24s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:2a brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
175: ibp41s0f0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/infiniband 00:00:09:0e:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:47:de brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
176: ibp41s0f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/infiniband 00:00:09:42:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:47:df brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
177: ibp64s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:32 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
178: ibp79s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:2e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
179: ibp94s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:26 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
180: ibp154s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:5a brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
181: ibp170s0f0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/infiniband 00:00:08:37:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:46:de brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
182: ibp170s0f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/infiniband 00:00:07:f6:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:46:df brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
183: ibp192s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:62 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
184: ibp206s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:5e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
185: ibp220s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:56 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

If the statuses of the IB devices are DOWN, manually bring them up by using the ip link set command:

$ ip link set <ib-if-name> up

Check if inbox OFED driver is installed. The presence of the /usr/sbin/ofed_uninstall.sh script indicates whether the inbox OFED driver is installed on the host. The script is part of the ofed-scripts RPM that gets installed as part of the inbox OFED installation. If the script is present, execute it to uninstall the inbox OFED driver.

For more detailed information, refer to the Installation Results page in MLNX_OFED documentation. It will describe the installed binary locations after successful installation. Also refer to the uninstallation page for MLNX_OFED.

Install NVIDIA Network Operator.

Note

IPoIB is preferred for multi-node setups. Otherwise, if the network interface is shared, the multi-node NIM workers can time out during weight transfer, which could cause the NIM pods to constantly restart, and the NIMService CR to never become ready.

To set up IPoIB Network, use the following procedure:
1. Verify your system has Mellanox NICs installed.
```
$ lspci -v | grep Mellanox
```
  Example output
  86:00.0 Network controller [0207]: Mellanox Technologies MT27620 Family Subsystem: Mellanox Technologies Device 0014 86:00.1 Network controller [0207]: Mellanox Technologies MT27620 Family Subsystem: Mellanox Technologies Device 0014
1. Deploy NVIDIA Network Operator with NFD enabled.
  
  NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster to provide high-speed network connectivity, RDMA, and GPUDirect for workloads in a Kubernetes cluster. Network Operator works in conjunction with GPU Operator to enable GPU-Direct RDMA on compatible systems.
  
  Install the NVIDIA Network Operator using the official helm chart with the following commands:
  1. Add repo:
    $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
  2. Install operator:
    $ helm install --wait network-operator \ -n nvidia-network-operator --create-namespace \ nvidia/network-operator
  3. View deployed resources:
    $ kubectl -n nvidia-network-operator get pods

Define NICClusterPolicy.

After Network Operator is installed, create a NICClusterPolicy CR to install the DOCA Driver (part of MLNX_OFED), RDMA Device Plugin, and secondary network with Multus and IPoIB CNI, as well as IPAM CNI plugin.

Note

If MOFED/DOCA driver is already pre-installed on the host, then remove the ofedDriver section in the below CR sample. This is typically required on NVIDIA DGX systems where NVIDIA BaseOS has the MOFED drivers pre-installed

Configure the rdmaSharedDevicePlugin section with the IB interface on your system.

Example NICClusterPolicy custom resource:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    forcePrecompiled: false
    imagePullSecrets: []
    terminationGracePeriodSeconds: 300
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
      safeLoad: false
      drain:
        enable: true
        force: true
        podSelector: ""
        timeoutSeconds: 300
        deleteEmptyDir: true
  rdmaSharedDevicePlugin:
    # [map[ifNames:[ibs1f0] name:rdma_shared_device_a]]
    image: k8s-rdma-shared-dev-plugin
    repository: ghcr.io/mellanox
    version: v1.5.3
    imagePullSecrets: []
    # The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
    # Replace 'devices' with your (RDMA capable) netdevice name.
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": [],
              "deviceIDs": [],
              "drivers": [],
              "ifNames": ["ibs1f0"],
              "linkTypes": []
            }
          }
        ]
      }
  secondaryNetwork:
    cniPlugins:
      image: plugins
      repository: ghcr.io/k8snetworkplumbingwg
      version: v1.6.2-update.1
      imagePullSecrets: []
    multus:
      image: multus-cni
      repository: ghcr.io/k8snetworkplumbingwg
      version: v4.1.0
      imagePullSecrets: []
    ipoib:
      image: ipoib-cni
      repository: ghcr.io/mellanox
      version: 428715a57c0b633e48ec7620f6e3af6863149ccf
    ipamPlugin:
      image: whereabouts
      repository: ghcr.io/k8snetworkplumbingwg
      version: v0.7.0
      imagePullSecrets: []

Once Network Operator has reconciled the NICClusterPolicy CR, rdma-shared-dp-ds, kube-multus-ds, kube-ipoib-cni-ds and whereabouts should be deployed in the network operator namespace.

Use the following command to view the RDMA resource defined by the NICClusterPolicy:

$ kubectl get nodes -o json | jq '.items[0].status.capacity'

Define an IP over IB Multus Network.

Use the sample CR below to define a IP over IB network:

apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
  name: example-ipoibnetwork
spec:
  networkNamespace: "default"
  master: "ibs1f0"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.5.225/28",
      "exclude": [
      "192.168.6.229/30",
      "192.168.6.236/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "gateway": "192.168.6.1"
    }

Configure .spec.networkNamespace to the namespace of the Kubernetes resources that will use the multus network. (It is the namespace that NetworkAttachmentDefinition will be created.)
Configure .spec.master to the IB interface you wish to enslave by the multus network on the host node.
Ensure the IP Cidr under .spec.ipam is not conflicting with the existing network (for example, default CNI podCidr and other networks).

Once the above CR is created, a Multus NetworkAttachmentDefintion will be created for consumption in the namespace specified in .spec.networkNamespace.

$ kubectl get network-attachment-definition -n nim-service

$ kubectl describe  network-attachment-definition -n nim-service

Install LeaderWorkerSet (LWS).

Note

LWS version 0.6.2 or later must be installed.

To install the LWS controller, use the following commands:

$ VERSION=v0.6.2
$ helm install lws https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/lws-chart-$VERSION.tgz \
   --namespace lws-system \
   --create-namespace \
   --wait --timeout 300s

For more information, refer to Installing LWS to a Kubernetes Cluster.

2. Deploy a Multi-node enabled NIM Service#

Create a cache for the NIM.

Create a file, such as nimcache.yaml, with contents like the following sample manifest:

# NIM Cache with LLM-Specific NIM from NGC
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: deepseek-r1-nimcache
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
  storage:
    pvc:
      create: true
      storageClass: '' # set to the storage class that supports RWX volumes
      size: "100Gi"
      volumeAccessMode: ReadWriteMany

Apply the manifest:

$ kubectl apply -n nim-service -f nimcache.yaml

Note

The NIM Cache job might fail due to various reasons, such as from network issues. NIM Operator automatically retries 5 times to download the model. In such scenarios, you might see some cache jobs in error states.

Create a NIM Service to consume the GPU-NIC Remote Direct Memory Access (RDMA) resource and InfiniBand (IB) network on the multi-node NIM.

Create a file, such as nimservice.yaml, with contents like the following sample manifest:

# NIM Service with multi-node deployment enabled using RDMA
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: deepseek-r1
  namespace: nim-service
spec:
  annotations:
    k8s.v1.cni.cncf.io/networks: example-ipoibnetwork   # configured by NVIDIA Network Operator
  env:
  - name: NCCL_DEBUG
    value: "WARN"
  - name: "NCCL_NET"
    value: "Socket"
  - name: NIM_USE_SGLANG
    value: "1"
  - name: HF_HOME
    value: /model-store/huggingface/hub
  - name: NUMBA_CACHE_DIR
    value: /tmp/numba
  - name: UCX_TLS
    value: ib,tcp,shm
  - name: UCC_TLS
    value: ucp
  - name: UCC_CONFIG_FILE
    value: " "
  - name: GLOO_SOCKET_IFNAME
    value: eth0
  - name: NCCL_SOCKET_IFNAME
    value: net1
  - name: NIM_TRUST_CUSTOM_CODE
    value: "1"
  readinessProbe:
    probe:
      failureThreshold: 3
      httpGet:
        path: "/v1/health/ready"
        port: "api"
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
  startupProbe:
    probe:
      failureThreshold: 100
      httpGet:
        path: "/v1/health/ready"
        port: "api"
      initialDelaySeconds: 900
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
  image:
    repository: nvcr.io/nim/deepseek-ai/deepseek-r1
    tag: "1.7.3"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: deepseek-r1-nimcache
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 8
      rdma/rdma_shared_device_a: 1    # configured through NICClusterPolicy using NVIDIA Network Operator
    requests:
      nvidia.com/gpu: 8
      rdma/rdma_shared_device_a: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  multiNode:
    parallelism:
      pipeline: 2
      tensor: 8
    mpi:
      mpiStartTimeout: 6000

Apply the manifest:

$ kubectl create -f nimservice.yaml -n nim-service

3. Display LWS and NIM Service Statuses#

To view the status of LWS resources, use the following command:
```
$ kubectl get lws
```
Example output
NAME AGE deepseek-r1-lws 20d

To view all LWS pods running, use the following command:

$ kubectl get pods -o wide

To view the status of NIM Service, use the following command:

$ kubectl get nimservice

Troubleshooting#

This section explains some common troubleshooting steps to identify issues with Multi-Node configurations.

If the system comes with Intel IB devices, the kernel module irdma is likely loaded. This will conflict with the Mellanox OFED driver, causing the below failure when installing MOFED:

Function: generate_ofed_modules_blacklist Unloading ib_uverbs [FAILED] rmmod: ERROR: Module ib_uverbs is in use by: irdma [16-Jul-25_16:25:49] Command "/etc/init.d/openibd restart" failed with exit code: 1 [16-Jul-25_16:25:49] Remove blacklisted mofed modules file from host

To resolve the conflict, remove the irdma module and blacklist it:

$ sudo rmmod irdma
$ echo "blacklist irdma" | sudo tee /etc/modprobe.d/blacklist-irdma.conf

Then reboot the node.

Ensure the LeaderWorkerSet controller is deployed onto the cluster before NIM Operator. If LeaderWorkerSet controller is deployed after NIM Operator, restart NIM Operator
Ensure that no existing GPU workloads are running on the nodes before deploying GPU Operator.
Ensure the NFS volume is accessible and has enough space, and its CSI driver is correctly set up.
Multi-node NIM without RoCE can experience frequent restart on the LWS leader and worker pods due to model shard loading timeout. NCCL with Infiniband connection is highly recommended
If using infiniband:
- Ensure network operator is deployed with IP over Infiniband network.
- Refer to config/samples/nim/serving/advanced/multi-node/multi-node-nimservice-rdma.yaml in the GitHub repository for a sample multi-node nim over IB network.
If the NIM Cache job failed while downloading the model, a new cache job is created.

The NIM Cache job might fail due to various reasons, such as from network issues. NIM Operator automatically retries 5 times to download the model. In such scenarios, you might see some cache jobs in error states.

Logs and Messages to Gather#

NIM container logs, such as PyTorch, SGLang, and NCCL, can be viewed from the LWS leader pods, for example:
```
$ kubectl logs deepseek-r1-lws-0
```
Kubernetes deployment logs can be found in the NIM operator using the following command:
```
$ kubectl logs -l app.kubernetes.io/instance=nim-operator
```
Additional information can be found by describing the NIM Service using the following command:
```
$ kubectl describe nimservice
```