Multi-Node NIM Deployment#

About Multi-Node NIM Deployment#

Multi-Node NIM Deployment addresses the problem of deploying massive LLMs that cannot fit on a single GPU or need to use older GPUs. This feature enables the distribution of model weights and computation across several interconnected nodes.

NIM Operator supports deploying multi-node NIM on Kubernetes using LeaderWorkerSets (LWS), as supported by NIM Helm Chart.

Diagram 1. Multi-node architecture

For more information on multi-node deployments, refer to Multi-Node Deployment for NVIDIA NIM for LLMs in the NVIDIA NIM for LLMs documentation.

Note

Upgrade and ingress of multi-node NIM are supported. However, auto-scaling, DRA, and Multi-LLM NIM are not yet supported for multi-node configurations.

Example Procedure#

Summary#

To deploy a multi-node enabled NIM Service, follow these steps:

  1. Complete the prerequisites.

  2. Deploy a multi-node enabled NIM Service.

    1. Create a cache for the NIM.

    2. Create a NIM Service to consume the GPU-NIC Remote Direct Memory Access (RDMA) resource and InfiniBand (IB) network on the multi-node NIM.

  3. Display LWS and NIM Service statuses.

1. Complete the Prerequisites#

Pre-Checks#

  • Validate that the NIM can be used with multi-node deployment and that your cluster meets the GPU hardware requirements.

    Find the number of GPUs required for your desired model in Supported Models for NVIDIA NIM for LLMs: Optimized Models. If you do not have a single node with at least the specified number of GPUs, you must use multi-node deployment.

  • Ensure the Container Storage Interface (CSI) driver supports ReadWriteMany Persistent Volume Claims (PVCs) and the storage capacity is sufficient for your multi-node.

    Refer to the Kubernetes CSI Drivers documentation to check whether the CSI Driver is Persistent and supports the Read/Write Multiple Pods access modes.

    The volume must be at least 700GB with VolumeAccessMode set to ReadWriteMany.

  • Check if the multi-nodes are network reachable.

Setup#

Perform the following steps:

  1. Configure multi-node pod network connectivity with RDMA.

    If you are using IPoIB with Mellanox NICs, you can use the following instructions:

    1. Uninstall inbox OFED driver in favor of NVIDIA MOFED/DOCA driver.

      Check the status of the IB interfaces to determine if your system has mlx5_core kernel mod loaded.

      $ ip link show | grep "link/infiniband" -B 1
      
      Example output
      174: ibp24s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:2a brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      175: ibp41s0f0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
          link/infiniband 00:00:09:0e:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:47:de brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      176: ibp41s0f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
          link/infiniband 00:00:09:42:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:47:df brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      177: ibp64s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:32 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      178: ibp79s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:2e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      179: ibp94s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:26 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      180: ibp154s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:5a brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      181: ibp170s0f0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
          link/infiniband 00:00:08:37:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:46:de brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      182: ibp170s0f1: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN mode DEFAULT group default qlen 1000
          link/infiniband 00:00:07:f6:fe:80:00:00:00:00:00:00:9c:63:c0:03:00:a1:46:df brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      183: ibp192s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:62 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      184: ibp206s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:5e brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      185: ibp220s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
          link/infiniband 00:00:10:47:fe:80:00:00:00:00:00:00:5c:25:73:03:00:48:34:56 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
      

      If the statuses of the IB devices are DOWN, manually bring them up by using the ip link set command:

      $ ip link set <ib-if-name> up
      

      Check if inbox OFED driver is installed. The presence of the /usr/sbin/ofed_uninstall.sh script indicates whether the inbox OFED driver is installed on the host. The script is part of the ofed-scripts RPM that gets installed as part of the inbox OFED installation. If the script is present, execute it to uninstall the inbox OFED driver.

      For more detailed information, refer to the Installation Results page in MLNX_OFED documentation. It will describe the installed binary locations after successful installation. Also refer to the uninstallation page for MLNX_OFED.

    1. Install NVIDIA Network Operator.

      Note

      IPoIB is preferred for multi-node setups. Otherwise, if the network interface is shared, the multi-node NIM workers can time out during weight transfer, which could cause the NIM pods to constantly restart, and the NIMService CR to never become ready.

      To set up IPoIB Network, use the following procedure:

      1. Verify your system has Mellanox NICs installed.

        $ lspci -v | grep Mellanox
        
        Example output
        86:00.0 Network controller [0207]: Mellanox Technologies MT27620 Family
                Subsystem: Mellanox Technologies Device 0014
        86:00.1 Network controller [0207]: Mellanox Technologies MT27620 Family
                Subsystem: Mellanox Technologies Device 0014
        
      1. Deploy NVIDIA Network Operator with NFD enabled.

        NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster to provide high-speed network connectivity, RDMA, and GPUDirect for workloads in a Kubernetes cluster. Network Operator works in conjunction with GPU Operator to enable GPU-Direct RDMA on compatible systems.

        Install the NVIDIA Network Operator using the official helm chart with the following commands:

        1. Add repo:

          $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
            && helm repo update
          
        2. Install operator:

          $ helm install --wait network-operator \
              -n nvidia-network-operator --create-namespace \
              nvidia/network-operator
          
        3. View deployed resources:

          $ kubectl -n nvidia-network-operator get pods
          
    1. Define NICClusterPolicy.

      After Network Operator is installed, create a NICClusterPolicy CR to install the DOCA Driver (part of MLNX_OFED), RDMA Device Plugin, and secondary network with Multus and IPoIB CNI, as well as IPAM CNI plugin.

      Note

      If MOFED/DOCA driver is already pre-installed on the host, then remove the ofedDriver section in the below CR sample. This is typically required on NVIDIA DGX systems where NVIDIA BaseOS has the MOFED drivers pre-installed

      Configure the rdmaSharedDevicePlugin section with the IB interface on your system.

      Example NICClusterPolicy custom resource:

      apiVersion: mellanox.com/v1alpha1
      kind: NicClusterPolicy
      metadata:
        name: nic-cluster-policy
      spec:
        ofedDriver:
          image: doca-driver
          repository: nvcr.io/nvidia/mellanox
          version: 25.04-0.6.1.0-2
          forcePrecompiled: false
          imagePullSecrets: []
          terminationGracePeriodSeconds: 300
          startupProbe:
            initialDelaySeconds: 10
            periodSeconds: 20
          livenessProbe:
            initialDelaySeconds: 30
            periodSeconds: 30
          readinessProbe:
            initialDelaySeconds: 10
            periodSeconds: 30
          upgradePolicy:
            autoUpgrade: true
            maxParallelUpgrades: 1
            safeLoad: false
            drain:
              enable: true
              force: true
              podSelector: ""
              timeoutSeconds: 300
              deleteEmptyDir: true
        rdmaSharedDevicePlugin:
          # [map[ifNames:[ibs1f0] name:rdma_shared_device_a]]
          image: k8s-rdma-shared-dev-plugin
          repository: ghcr.io/mellanox
          version: v1.5.3
          imagePullSecrets: []
          # The config below directly propagates to k8s-rdma-shared-device-plugin configuration.
          # Replace 'devices' with your (RDMA capable) netdevice name.
          config: |
            {
              "configList": [
                {
                  "resourceName": "rdma_shared_device_a",
                  "rdmaHcaMax": 63,
                  "selectors": {
                    "vendors": [],
                    "deviceIDs": [],
                    "drivers": [],
                    "ifNames": ["ibs1f0"],
                    "linkTypes": []
                  }
                }
              ]
            }
        secondaryNetwork:
          cniPlugins:
            image: plugins
            repository: ghcr.io/k8snetworkplumbingwg
            version: v1.6.2-update.1
            imagePullSecrets: []
          multus:
            image: multus-cni
            repository: ghcr.io/k8snetworkplumbingwg
            version: v4.1.0
            imagePullSecrets: []
          ipoib:
            image: ipoib-cni
            repository: ghcr.io/mellanox
            version: 428715a57c0b633e48ec7620f6e3af6863149ccf
          ipamPlugin:
            image: whereabouts
            repository: ghcr.io/k8snetworkplumbingwg
            version: v0.7.0
            imagePullSecrets: []
      

      Once Network Operator has reconciled the NICClusterPolicy CR, rdma-shared-dp-ds, kube-multus-ds, kube-ipoib-cni-ds and whereabouts should be deployed in the network operator namespace.

      Use the following command to view the RDMA resource defined by the NICClusterPolicy:

      $ kubectl get nodes -o json | jq '.items[0].status.capacity'
      
      Example output
      {
        "cpu": "224",
        "ephemeral-storage": "1845230620Ki",
        "hugepages-1Gi": "0",
        "hugepages-2Mi": "0",
        "memory": "2113470820Ki",
        "nvidia.com/gpu": "8",
        "pods": "110",
        "rdma/rdma_shared_device_a": "63"
      }
      
    1. Define an IP over IB Multus Network.

      Use the sample CR below to define a IP over IB network:

      apiVersion: mellanox.com/v1alpha1
      kind: IPoIBNetwork
      metadata:
        name: example-ipoibnetwork
      spec:
        networkNamespace: "default"
        master: "ibs1f0"
        ipam: |
          {
            "type": "whereabouts",
            "datastore": "kubernetes",
            "kubernetes": {
              "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
            },
            "range": "192.168.5.225/28",
            "exclude": [
            "192.168.6.229/30",
            "192.168.6.236/32"
            ],
            "log_file" : "/var/log/whereabouts.log",
            "log_level" : "info",
            "gateway": "192.168.6.1"
          }
      
      • Configure .spec.networkNamespace to the namespace of the Kubernetes resources that will use the multus network. (It is the namespace that NetworkAttachmentDefinition will be created.)

      • Configure .spec.master to the IB interface you wish to enslave by the multus network on the host node.

      • Ensure the IP Cidr under .spec.ipam is not conflicting with the existing network (for example, default CNI podCidr and other networks).

      Once the above CR is created, a Multus NetworkAttachmentDefintion will be created for consumption in the namespace specified in .spec.networkNamespace.

      $ kubectl get network-attachment-definition -n nim-service
      
      Example output
      NAME                    AGE
      example-ipoibnetwork    24h
      
      $ kubectl describe  network-attachment-definition -n nim-service
      
      Example output
      Name:         example-ipoibnetwork
      Namespace:    nim-service
      Labels:       nvidia.network-operator.state=state-IPoIB-Network
      Annotations:  nvidia.network-operator.revision: 2162029353
      API Version:  k8s.cni.cncf.io/v1
      Kind:         NetworkAttachmentDefinition
      Metadata:
        Creation Timestamp:  2025-07-17T17:18:15Z
        Generation:          1
        Owner References:
          API Version:           mellanox.com/v1alpha1
          Block Owner Deletion:  true
          Controller:            true
          Kind:                  IPoIBNetwork
          Name:                  example-ipoibnetwork
          UID:                   38a01091-896d-47d5-82cc-37689d14b9c8
        Resource Version:        459806
        UID:                     fe4644fd-85dd-408c-bc46-04d2c532af4f
      Spec:
        Config:  { "cniVersion":"0.3.1", "name":"example-ipoibnetwork", "type":"ipoib", "master": "ibp24s0", "ipam":{"type":"whereabouts","datastore":"kubernetes","kubernetes":{"kubeconfig":"/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"},"range":"192.168.5.225/28","exclude":["192.168.6.229/30","192.168.6.236/32"],"log_file":"/var/log/whereabouts.log","log_level":"info","gateway":"192.168.6.1"} }
      Events:    <none>
      
  1. Install LeaderWorkerSet (LWS).

Note

LWS version 0.6.2 or later must be installed.

To install the LWS controller, use the following commands:

$ VERSION=v0.6.2
$ helm install lws https://github.com/kubernetes-sigs/lws/releases/download/$VERSION/lws-chart-$VERSION.tgz \
   --namespace lws-system \
   --create-namespace \
   --wait --timeout 300s

For more information, refer to Installing LWS to a Kubernetes Cluster.

2. Deploy a Multi-node enabled NIM Service#

  1. Create a cache for the NIM.

    1. Create a file, such as nimcache.yaml, with contents like the following sample manifest:

      # NIM Cache with LLM-Specific NIM from NGC
      apiVersion: apps.nvidia.com/v1alpha1
      kind: NIMCache
      metadata:
        name: deepseek-r1-nimcache
        namespace: nim-service
      spec:
        source:
          ngc:
            modelPuller: nvcr.io/nim/deepseek-ai/deepseek-r1:1.7.3
            pullSecret: ngc-secret
            authSecret: ngc-api-secret
            model:
        storage:
          pvc:
            create: true
            storageClass: '' # set to the storage class that supports RWX volumes
            size: "100Gi"
            volumeAccessMode: ReadWriteMany
      
    2. Apply the manifest:

      $ kubectl apply -n nim-service -f nimcache.yaml
      

    Note

    The NIM Cache job might fail due to various reasons, such as from network issues. NIM Operator automatically retries 5 times to download the model. In such scenarios, you might see some cache jobs in error states.

  1. Create a NIM Service to consume the GPU-NIC Remote Direct Memory Access (RDMA) resource and InfiniBand (IB) network on the multi-node NIM.

    1. Create a file, such as nimservice.yaml, with contents like the following sample manifest:

      # NIM Service with multi-node deployment enabled using RDMA
      apiVersion: apps.nvidia.com/v1alpha1
      kind: NIMService
      metadata:
        name: deepseek-r1
        namespace: nim-service
      spec:
        annotations:
          k8s.v1.cni.cncf.io/networks: example-ipoibnetwork   # configured by NVIDIA Network Operator
        env:
        - name: NCCL_DEBUG
          value: "WARN"
        - name: "NCCL_NET"
          value: "Socket"
        - name: NIM_USE_SGLANG
          value: "1"
        - name: HF_HOME
          value: /model-store/huggingface/hub
        - name: NUMBA_CACHE_DIR
          value: /tmp/numba
        - name: UCX_TLS
          value: ib,tcp,shm
        - name: UCC_TLS
          value: ucp
        - name: UCC_CONFIG_FILE
          value: " "
        - name: GLOO_SOCKET_IFNAME
          value: eth0
        - name: NCCL_SOCKET_IFNAME
          value: net1
        - name: NIM_TRUST_CUSTOM_CODE
          value: "1"
        readinessProbe:
          probe:
            failureThreshold: 3
            httpGet:
              path: "/v1/health/ready"
              port: "api"
            initialDelaySeconds: 15
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
        startupProbe:
          probe:
            failureThreshold: 100
            httpGet:
              path: "/v1/health/ready"
              port: "api"
            initialDelaySeconds: 900
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
        image:
          repository: nvcr.io/nim/deepseek-ai/deepseek-r1
          tag: "1.7.3"
          pullPolicy: IfNotPresent
          pullSecrets:
            - ngc-secret
        authSecret: ngc-api-secret
        storage:
          nimCache:
            name: deepseek-r1-nimcache
        replicas: 1
        resources:
          limits:
            nvidia.com/gpu: 8
            rdma/rdma_shared_device_a: 1    # configured through NICClusterPolicy using NVIDIA Network Operator
          requests:
            nvidia.com/gpu: 8
            rdma/rdma_shared_device_a: 1
        expose:
          service:
            type: ClusterIP
            port: 8000
        multiNode:
          parallelism:
            pipeline: 2
            tensor: 8
          mpi:
            mpiStartTimeout: 6000
      
    2. Apply the manifest:

      $ kubectl create -f nimservice.yaml -n nim-service
      

3. Display LWS and NIM Service Statuses#

  • To view the status of LWS resources, use the following command:

    $ kubectl get lws
    
    Example output
    NAME                AGE
    deepseek-r1-lws     20d
    
  • To view all LWS pods running, use the following command:

    $ kubectl get pods -o wide
    
    Example output
    NAME                    READY   STATUS      RESTARTS    AGE     IP                  NODE                NOMINATED NODE      READINESS GATES
    deepseek-r1-lws-0       1/1     Running     0           30m     192.168.2.192       viking-prod-640     <none>              <none>
    deepseek-r1-lws-0-1     1/1     Running     0           30m     192.168.0.46        viking-prod-642     <none>              <none> 
    
  • To view the status of NIM Service, use the following command:

    $ kubectl get nimservice
    
    Example output
    NAME          STATUS   AGE
    deepseek-r1   Ready    20d
    

Troubleshooting#

This section explains some common troubleshooting steps to identify issues with Multi-Node configurations.

  • If the system comes with Intel IB devices, the kernel module irdma is likely loaded. This will conflict with the Mellanox OFED driver, causing the below failure when installing MOFED:

    Function: generate_ofed_modules_blacklist Unloading ib_uverbs [FAILED] rmmod: ERROR: Module ib_uverbs is in use by: irdma [16-Jul-25_16:25:49] Command "/etc/init.d/openibd restart" failed with exit code: 1 [16-Jul-25_16:25:49] Remove blacklisted mofed modules file from host
    

    To resolve the conflict, remove the irdma module and blacklist it:

    $ sudo rmmod irdma
    $ echo "blacklist irdma" | sudo tee /etc/modprobe.d/blacklist-irdma.conf
    

    Then reboot the node.

  • Ensure the LeaderWorkerSet controller is deployed onto the cluster before NIM Operator. If LeaderWorkerSet controller is deployed after NIM Operator, restart NIM Operator

  • Ensure that no existing GPU workloads are running on the nodes before deploying GPU Operator.

  • Ensure the NFS volume is accessible and has enough space, and its CSI driver is correctly set up.

  • Multi-node NIM without RoCE can experience frequent restart on the LWS leader and worker pods due to model shard loading timeout. NCCL with Infiniband connection is highly recommended

  • If using infiniband:

  • If the NIM Cache job failed while downloading the model, a new cache job is created.

    The NIM Cache job might fail due to various reasons, such as from network issues. NIM Operator automatically retries 5 times to download the model. In such scenarios, you might see some cache jobs in error states.

Logs and Messages to Gather#

  • NIM container logs, such as PyTorch, SGLang, and NCCL, can be viewed from the LWS leader pods, for example:

    $ kubectl logs deepseek-r1-lws-0
    
  • Kubernetes deployment logs can be found in the NIM operator using the following command:

    $ kubectl logs -l app.kubernetes.io/instance=nim-operator
    
  • Additional information can be found by describing the NIM Service using the following command:

    $ kubectl describe nimservice