RDMA / InfiniBand on AKS
RDMA / InfiniBand on AKS
RDMA / InfiniBand on AKS
This guide covers setting up RDMA over InfiniBand on AKS for high-performance disaggregated inference with Dynamo. RDMA enables direct memory access between GPUs across nodes, bypassing CPU and kernel overhead — critical for low-latency KV cache transfer between prefill and decode workers.
Without RDMA, disaggregated inference falls back to TCP with severe performance degradation (~98s TTFT vs ~200-500ms with RDMA). See the Disaggregated Communication Guide for details on transport options and performance expectations.
The Network Operator and NicClusterPolicy steps in this guide are based on the Azure AKS RDMA InfiniBand repository. That project is open-source and not covered by Microsoft Azure support — file issues on the GitHub repository.
Prerequisites
AKS cluster with RDMA-capable nodes:
- At least 2 GPU nodes to enable cross-node RDMA communication
- ND-series VMs with Mellanox ConnectX InfiniBand NICs (e.g.,
Standard_ND96asr_v4,Standard_ND96isr_H100_v5) - Ubuntu OS on the node pool (required for NVIDIA driver compatibility)
- GPU driver installation skipped on the node pool (
--skip-gpu-driver-install) — see GPU Node Pool Setup
Register the AKS InfiniBand feature to ensure nodes land on the same physical InfiniBand network:
Overview
The RDMA setup involves five components installed in this order:
- Network Operator — Deploys the Mellanox OFED driver and Node Feature Discovery
- NicClusterPolicy — Configures the OFED driver on InfiniBand-capable nodes
- IB Node Configuration — Loads InfiniBand kernel modules and sets memlock limits
- RDMA Shared Device Plugin — Exposes InfiniBand NICs to pods as a Kubernetes resource
- GPU Operator — Installed with RDMA-specific settings (NFD disabled, GPUDirect RDMA enabled, host MOFED)
Step 1: Install the NVIDIA Network Operator
The NVIDIA Network Operator automates deployment of networking components including Mellanox OFED drivers for InfiniBand support.
Create the namespace and label it for privileged workloads:
Add the NVIDIA Helm repo (if not already added):
Create a network-operator-values.yaml:
Install the Network Operator:
Verify the Network Operator pod is running:
Step 2: Apply the NicClusterPolicy
The NicClusterPolicy configures the OFED driver (Mellanox OFED / DOCA driver) as a DaemonSet on all InfiniBand-capable nodes.
Apply the base NicClusterPolicy using kustomize:
This targets nodes with Mellanox NICs (feature.node.kubernetes.io/pci-15b3.present) and installs the DOCA/OFED driver as a DaemonSet.
Wait for the MOFED driver DaemonSet to finish installing on all nodes (this may take several minutes):
Step 3: Deploy the IB Node Configuration DaemonSet
This DaemonSet loads InfiniBand kernel modules and sets unlimited memlock limits on GPU nodes. This is required for RDMA to function — without it, InfiniBand device files may not exist and memory pinning for RDMA transfers will fail.
This step is not covered in the Azure RDMA repo but is required for a working setup. The DaemonSet loads ib_umad and rdma_ucm kernel modules, sets unlimited memlock limits for containerd and kubelet, and restarts both services to apply the changes.
Create ib-node-config.yaml:
<GPU_NODE_POOL_NAME> with your GPU node pool name (e.g., ndh100pool).Wait for all pods to complete initialization:
What this does:
ib_umad— InfiniBand user-space management datagram module, required for RDMA device accessrdma_ucm— RDMA user-space connection manager- Memlock limits — RDMA requires pinning memory pages; without unlimited memlock, large transfers fail
- Service restarts — containerd and kubelet must be restarted to pick up the new memlock limits
Step 4: Deploy the RDMA Shared Device Plugin
The RDMA Shared Device Plugin exposes InfiniBand NICs as a Kubernetes extended resource so pods can request RDMA access.
Create the ConfigMap with the device plugin configuration:
Create the DaemonSet:
<GPU_NODE_POOL_NAME> with your GPU node pool name (e.g., ndh100pool).Wait for the device plugin pods to start:
Step 5: Install the GPU Operator (RDMA-Enabled)
Install the GPU Operator with RDMA-specific values:
Key differences from a standard GPU Operator install:
nfd.enabled=false— Network Operator already deploys Node Feature Discovery; running two NFD instances causes conflictsdriver.rdma.enabled=true— enables GPUDirect RDMA support; causes the driver daemonset to build and loadnvidia_peermemdriver.rdma.useHostMofed=true— tells the GPU Operator to use the MOFED driver installed by the Network Operator (Step 1) rather than its own; required when the Network Operator manages OFED
Wait for the GPU Operator pods to reach Running state:
Verification
1. Check that MOFED driver pods are running on all InfiniBand nodes:
2. Check that IB node config pods completed initialization:
3. Check that the RDMA Shared Device Plugin is running:
4. Verify RDMA resources are available on GPU nodes:
Each InfiniBand-capable node should report rdma/hca_shared_devices_a resources (typically 1k based on rdmaHcaMax: 1000).
5. Check GPU Operator pods are healthy:
Pod Resource Requests
Dynamo pods that need RDMA access should request the rdma/hca_shared_devices_a resource. When using the Dynamo operator with DGDR, this is handled automatically for disaggregated deployments on RDMA-capable clusters.
For manual DGD specs, add the resource request to your container:
IPC_LOCK capability is not required when this setup is followed. IPC_LOCK is historically needed for RDMA because ibv_reg_mr calls mlock() to pin memory pages — but mlock() only needs the capability if the memlock rlimit would otherwise block it. The ib-node-config DaemonSet (Step 3) sets LimitMEMLOCK=infinity on the kubelet and containerd systemd units, so all pods on GPU nodes inherit an unlimited memlock limit and RDMA memory pinning works without any capability in the pod spec.
If you see ENOMEM errors from ibv_reg_mr and ib-node-config is running, verify that containerd and kubelet were restarted after the limits were applied (check the init container logs). If ib-node-config is not deployed, add IPC_LOCK to your pod’s securityContext.capabilities.add.
Troubleshooting
MOFED pods stuck in Init or CrashLoopBackOff:
- Verify nodes are Ubuntu OS:
kubectl get nodes -o custom-columns="NAME:.metadata.name,OS:.status.nodeInfo.osImage" - Check MOFED pod logs:
kubectl logs -n network-operator <mofed-pod> -c mofed-container
rdma/hca_shared_devices_a not showing on nodes:
- Check the RDMA device plugin pods are running:
kubectl get pods -n kube-system -l name=rdma-shared-dp-ds - Check device plugin logs:
kubectl logs -n kube-system <rdma-shared-dp-pod> - Verify the
rdma-devicesConfigMap exists:kubectl get configmap rdma-devices -n kube-system
IB kernel modules not loading:
- Check the ib-node-config init container logs:
kubectl logs -n kube-system <ib-node-config-pod> -c ib-setup - Verify the MOFED driver is installed first (Step 2 must complete before Step 3)
Memlock errors during RDMA transfers (ENOMEM from ibv_reg_mr):
- Verify the ib-node-config DaemonSet has run on all GPU nodes and init containers completed
- Check that containerd and kubelet were restarted:
kubectl logs -n kube-system <ib-node-config-pod> -c ib-setup - Confirm the limits took effect on the kubelet process:
- If limits are not unlimited, the ib-node-config DaemonSet needs to be re-applied and services restarted
GPUDirect RDMA not working — nvidia_peermem module missing:
ND-series nodes (including ND H100 v5) do not ship nvidia_peermem in the host OS. This module is required for InfiniBand adapters to directly read/write GPU memory — without it, RDMA transfers fall back to staging through host memory.
Verify whether the module is loaded:
With the GPU Operator managing drivers (driver.rdma.enabled=true), nvidia_peermem is built and loaded by the nvidia-driver-daemonset — it lives in the driver pod’s /lib/modules, not the host’s native kernel modules. Verify the driver daemonset is loading it:
If this returns empty, ensure driver.rdma.enabled=true and driver.rdma.useHostMofed=true are set in your GPU Operator Helm values (see Step 5 above), then restart the driver daemonset:
The nvidia-peermem-reloader DaemonSet from the Azure RDMA repo is designed for clusters using AKS-managed GPU drivers (without the GPU Operator). It simply runs modprobe nvidia-peermem — which will fail on ND H100 v5 nodes because the host OS doesn’t include the module. When using the GPU Operator (recommended), the operator handles nvidia_peermem automatically via driver.rdma.enabled=true.
See Also
- Azure AKS RDMA InfiniBand — GitHub
- Set up InfiniBand on Azure HPC VMs — Microsoft Learn
- Enable InfiniBand VM extension — Microsoft Learn
- NVIDIA Network Operator Documentation
- Disaggregated Communication Guide — transport options, UCX configuration, performance expectations