RDMA / InfiniBand on AKS

View as Markdown

RDMA / InfiniBand on AKS

This guide covers setting up RDMA over InfiniBand on AKS for high-performance disaggregated inference with Dynamo. RDMA enables direct memory access between GPUs across nodes, bypassing CPU and kernel overhead — critical for low-latency KV cache transfer between prefill and decode workers.

Without RDMA, disaggregated inference falls back to TCP with severe performance degradation (~98s TTFT vs ~200-500ms with RDMA). See the Disaggregated Communication Guide for details on transport options and performance expectations.

The Network Operator and NicClusterPolicy steps in this guide are based on the Azure AKS RDMA InfiniBand repository. That project is open-source and not covered by Microsoft Azure support — file issues on the GitHub repository.

Prerequisites

AKS cluster with RDMA-capable nodes:

  • At least 2 GPU nodes to enable cross-node RDMA communication
  • ND-series VMs with Mellanox ConnectX InfiniBand NICs (e.g., Standard_ND96asr_v4, Standard_ND96isr_H100_v5)
  • Ubuntu OS on the node pool (required for NVIDIA driver compatibility)
  • GPU driver installation skipped on the node pool (--skip-gpu-driver-install) — see GPU Node Pool Setup

Register the AKS InfiniBand feature to ensure nodes land on the same physical InfiniBand network:

$az feature register --namespace Microsoft.ContainerService --name AKSInfinibandSupport
$az feature show --namespace Microsoft.ContainerService --name AKSInfinibandSupport --query "properties.state"
$# Wait until "Registered"
$
$az provider register --namespace Microsoft.ContainerService

Overview

The RDMA setup involves five components installed in this order:

  1. Network Operator — Deploys the Mellanox OFED driver and Node Feature Discovery
  2. NicClusterPolicy — Configures the OFED driver on InfiniBand-capable nodes
  3. IB Node Configuration — Loads InfiniBand kernel modules and sets memlock limits
  4. RDMA Shared Device Plugin — Exposes InfiniBand NICs to pods as a Kubernetes resource
  5. GPU Operator — Installed with RDMA-specific settings (NFD disabled, GPUDirect RDMA enabled, host MOFED)

Step 1: Install the NVIDIA Network Operator

The NVIDIA Network Operator automates deployment of networking components including Mellanox OFED drivers for InfiniBand support.

Create the namespace and label it for privileged workloads:

$kubectl create ns network-operator
$kubectl label --overwrite ns network-operator pod-security.kubernetes.io/enforce=privileged

Add the NVIDIA Helm repo (if not already added):

$helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$helm repo update

Create a network-operator-values.yaml:

1nfd:
2 deployNodeFeatureRules: false

Install the Network Operator:

$helm install network-operator nvidia/network-operator \
> --namespace network-operator \
> -f network-operator-values.yaml \
> --version v26.1.0

Verify the Network Operator pod is running:

$kubectl get pods -n network-operator

Step 2: Apply the NicClusterPolicy

The NicClusterPolicy configures the OFED driver (Mellanox OFED / DOCA driver) as a DaemonSet on all InfiniBand-capable nodes.

Apply the base NicClusterPolicy using kustomize:

$kubectl apply -k https://github.com/Azure/aks-rdma-infiniband/configs/nicclusterpolicy/base

This targets nodes with Mellanox NICs (feature.node.kubernetes.io/pci-15b3.present) and installs the DOCA/OFED driver as a DaemonSet.

Wait for the MOFED driver DaemonSet to finish installing on all nodes (this may take several minutes):

$kubectl get pods -n network-operator -l app=mofed-ubuntu22.04-ds -w
$# Wait until all pods show Running

Step 3: Deploy the IB Node Configuration DaemonSet

This DaemonSet loads InfiniBand kernel modules and sets unlimited memlock limits on GPU nodes. This is required for RDMA to function — without it, InfiniBand device files may not exist and memory pinning for RDMA transfers will fail.

This step is not covered in the Azure RDMA repo but is required for a working setup. The DaemonSet loads ib_umad and rdma_ucm kernel modules, sets unlimited memlock limits for containerd and kubelet, and restarts both services to apply the changes.

Create ib-node-config.yaml:

1apiVersion: apps/v1
2kind: DaemonSet
3metadata:
4 name: ib-node-config
5 namespace: kube-system
6spec:
7 selector:
8 matchLabels:
9 app: ib-node-config
10 template:
11 metadata:
12 labels:
13 app: ib-node-config
14 spec:
15 hostPID: true
16 nodeSelector:
17 kubernetes.azure.com/agentpool: <GPU_NODE_POOL_NAME>
18 tolerations:
19 - operator: Exists
20 initContainers:
21 - name: ib-setup
22 image: busybox:1.36
23 securityContext:
24 privileged: true
25 command:
26 - sh
27 - -c
28 - |
29 echo "=== IB Node Configuration ==="
30
31 nsenter -t 1 -m -u -i -n -- modprobe ib_umad
32 nsenter -t 1 -m -u -i -n -- modprobe rdma_ucm 2>/dev/null || true
33 nsenter -t 1 -m -u -i -n -- modprobe ib_ucm 2>/dev/null || true
34 nsenter -t 1 -m -u -i -n -- lsmod | grep ib_umad && echo "OK: ib_umad" || echo "FAIL: ib_umad"
35 nsenter -t 1 -m -u -i -n -- ls /dev/infiniband/rdma_cm && echo "OK: rdma_cm device" || echo "WARN: no rdma_cm device"
36
37 nsenter -t 1 -m -u -i -n -- sh -c 'printf "ib_umad\nrdma_ucm\n" > /etc/modules-load.d/ib-umad.conf'
38 nsenter -t 1 -m -u -i -n -- sh -c 'printf "* - memlock unlimited\nroot - memlock unlimited\n" > /etc/security/limits.d/99-ib-memlock.conf'
39
40 nsenter -t 1 -m -u -i -n -- sh -c 'mkdir -p /etc/systemd/system/containerd.service.d && printf "[Service]\nLimitMEMLOCK=infinity\n" > /etc/systemd/system/containerd.service.d/memlock.conf'
41 nsenter -t 1 -m -u -i -n -- sh -c 'mkdir -p /etc/systemd/system/kubelet.service.d && printf "[Service]\nLimitMEMLOCK=infinity\n" > /etc/systemd/system/kubelet.service.d/memlock.conf'
42
43 nsenter -t 1 -m -u -i -n -- systemctl daemon-reload
44 nsenter -t 1 -m -u -i -n -- systemctl restart containerd
45 nsenter -t 1 -m -u -i -n -- systemctl restart kubelet
46
47 sleep 10
48
49 nsenter -t 1 -m -u -i -n -- systemctl is-active containerd && echo "OK: containerd active" || echo "FAIL: containerd"
50 nsenter -t 1 -m -u -i -n -- systemctl is-active kubelet && echo "OK: kubelet active" || echo "FAIL: kubelet"
51
52 echo "=== Setup Complete ==="
53 containers:
54 - name: keepalive
55 image: busybox:1.36
56 command: ["sh", "-c", "echo IB node config active; sleep infinity"]
Replace <GPU_NODE_POOL_NAME> with your GPU node pool name (e.g., ndh100pool).
$kubectl apply -f ib-node-config.yaml

Wait for all pods to complete initialization:

$kubectl get pods -n kube-system -l app=ib-node-config -w

What this does:

  • ib_umad — InfiniBand user-space management datagram module, required for RDMA device access
  • rdma_ucm — RDMA user-space connection manager
  • Memlock limits — RDMA requires pinning memory pages; without unlimited memlock, large transfers fail
  • Service restarts — containerd and kubelet must be restarted to pick up the new memlock limits

Step 4: Deploy the RDMA Shared Device Plugin

The RDMA Shared Device Plugin exposes InfiniBand NICs as a Kubernetes extended resource so pods can request RDMA access.

Create the ConfigMap with the device plugin configuration:

1apiVersion: v1
2kind: ConfigMap
3metadata:
4 name: rdma-devices
5 namespace: kube-system
6data:
7 config.json: |
8 {
9 "periodicUpdateInterval": 300,
10 "configList": [{
11 "resourceName": "hca_shared_devices_a",
12 "rdmaHcaMax": 1000,
13 "selectors": {
14 "vendors": ["15b3"],
15 "drivers": ["mlx5_core"]
16 }
17 }
18 ]
19 }

Create the DaemonSet:

1apiVersion: apps/v1
2kind: DaemonSet
3metadata:
4 name: rdma-shared-dp-ds
5 namespace: kube-system
6spec:
7 selector:
8 matchLabels:
9 name: rdma-shared-dp-ds
10 template:
11 metadata:
12 labels:
13 name: rdma-shared-dp-ds
14 spec:
15 hostNetwork: true
16 nodeSelector:
17 kubernetes.azure.com/agentpool: <GPU_NODE_POOL_NAME>
18 tolerations:
19 - operator: Exists
20 containers:
21 - name: k8s-rdma-shared-dp-ds
22 image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:v1.5.3
23 securityContext:
24 privileged: true
25 volumeMounts:
26 - name: device-plugin
27 mountPath: /var/lib/kubelet/device-plugins
28 - name: plugins-registry
29 mountPath: /var/lib/kubelet/plugins_registry
30 - name: config
31 mountPath: /k8s-rdma-shared-dev-plugin
32 - name: devs
33 mountPath: /dev/
34 volumes:
35 - name: device-plugin
36 hostPath:
37 path: /var/lib/kubelet/device-plugins
38 - name: plugins-registry
39 hostPath:
40 path: /var/lib/kubelet/plugins_registry
41 - name: config
42 configMap:
43 name: rdma-devices
44 - name: devs
45 hostPath:
46 path: /dev/
Replace <GPU_NODE_POOL_NAME> with your GPU node pool name (e.g., ndh100pool).
$kubectl apply -f rdma-configmap.yaml
$kubectl apply -f rdma-shared-dp-ds.yaml

Wait for the device plugin pods to start:

$kubectl get pods -n kube-system -l name=rdma-shared-dp-ds -w

Step 5: Install the GPU Operator (RDMA-Enabled)

Install the GPU Operator with RDMA-specific values:

$helm install gpu-operator nvidia/gpu-operator \
> --namespace gpu-operator --create-namespace \
> --set nfd.enabled=false \
> --set driver.rdma.enabled=true \
> --set driver.rdma.useHostMofed=true

Key differences from a standard GPU Operator install:

  • nfd.enabled=false — Network Operator already deploys Node Feature Discovery; running two NFD instances causes conflicts
  • driver.rdma.enabled=true — enables GPUDirect RDMA support; causes the driver daemonset to build and load nvidia_peermem
  • driver.rdma.useHostMofed=true — tells the GPU Operator to use the MOFED driver installed by the Network Operator (Step 1) rather than its own; required when the Network Operator manages OFED

Wait for the GPU Operator pods to reach Running state:

$kubectl get pods -n gpu-operator -w

Verification

1. Check that MOFED driver pods are running on all InfiniBand nodes:

$kubectl get pods -n network-operator -l app=mofed-ubuntu22.04-ds

2. Check that IB node config pods completed initialization:

$kubectl get pods -n kube-system -l app=ib-node-config

3. Check that the RDMA Shared Device Plugin is running:

$kubectl get pods -n kube-system -l name=rdma-shared-dp-ds

4. Verify RDMA resources are available on GPU nodes:

$kubectl get nodes -o json | jq '.items[] | select(.status.allocatable["rdma/hca_shared_devices_a"] != null) | {name: .metadata.name, rdma: .status.allocatable["rdma/hca_shared_devices_a"], gpu: .status.allocatable["nvidia.com/gpu"]}'

Each InfiniBand-capable node should report rdma/hca_shared_devices_a resources (typically 1k based on rdmaHcaMax: 1000).

5. Check GPU Operator pods are healthy:

$kubectl get pods -n gpu-operator

Pod Resource Requests

Dynamo pods that need RDMA access should request the rdma/hca_shared_devices_a resource. When using the Dynamo operator with DGDR, this is handled automatically for disaggregated deployments on RDMA-capable clusters.

For manual DGD specs, add the resource request to your container:

1resources:
2 limits:
3 nvidia.com/gpu: 8
4 rdma/hca_shared_devices_a: 1

IPC_LOCK capability is not required when this setup is followed. IPC_LOCK is historically needed for RDMA because ibv_reg_mr calls mlock() to pin memory pages — but mlock() only needs the capability if the memlock rlimit would otherwise block it. The ib-node-config DaemonSet (Step 3) sets LimitMEMLOCK=infinity on the kubelet and containerd systemd units, so all pods on GPU nodes inherit an unlimited memlock limit and RDMA memory pinning works without any capability in the pod spec.

If you see ENOMEM errors from ibv_reg_mr and ib-node-config is running, verify that containerd and kubelet were restarted after the limits were applied (check the init container logs). If ib-node-config is not deployed, add IPC_LOCK to your pod’s securityContext.capabilities.add.

Troubleshooting

MOFED pods stuck in Init or CrashLoopBackOff:

  • Verify nodes are Ubuntu OS: kubectl get nodes -o custom-columns="NAME:.metadata.name,OS:.status.nodeInfo.osImage"
  • Check MOFED pod logs: kubectl logs -n network-operator <mofed-pod> -c mofed-container

rdma/hca_shared_devices_a not showing on nodes:

  • Check the RDMA device plugin pods are running: kubectl get pods -n kube-system -l name=rdma-shared-dp-ds
  • Check device plugin logs: kubectl logs -n kube-system <rdma-shared-dp-pod>
  • Verify the rdma-devices ConfigMap exists: kubectl get configmap rdma-devices -n kube-system

IB kernel modules not loading:

  • Check the ib-node-config init container logs: kubectl logs -n kube-system <ib-node-config-pod> -c ib-setup
  • Verify the MOFED driver is installed first (Step 2 must complete before Step 3)

Memlock errors during RDMA transfers (ENOMEM from ibv_reg_mr):

  • Verify the ib-node-config DaemonSet has run on all GPU nodes and init containers completed
  • Check that containerd and kubelet were restarted: kubectl logs -n kube-system <ib-node-config-pod> -c ib-setup
  • Confirm the limits took effect on the kubelet process:
    $# On a GPU node (via kubectl debug or ssh)
    $cat /proc/$(pgrep -x kubelet)/limits | grep -i memlock
    $# Should show: Max locked memory unlimited unlimited
  • If limits are not unlimited, the ib-node-config DaemonSet needs to be re-applied and services restarted

GPUDirect RDMA not working — nvidia_peermem module missing:

ND-series nodes (including ND H100 v5) do not ship nvidia_peermem in the host OS. This module is required for InfiniBand adapters to directly read/write GPU memory — without it, RDMA transfers fall back to staging through host memory.

Verify whether the module is loaded:

$# Check on a GPU node via a privileged pod or node shell
$lsmod | grep nvidia_peermem
$# If empty, the module is not loaded
$
$modinfo nvidia_peermem
$# If "Module not found", it is also not present in the host's /lib/modules

With the GPU Operator managing drivers (driver.rdma.enabled=true), nvidia_peermem is built and loaded by the nvidia-driver-daemonset — it lives in the driver pod’s /lib/modules, not the host’s native kernel modules. Verify the driver daemonset is loading it:

$kubectl exec -n gpu-operator $(kubectl get pod -n gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}') -- lsmod | grep nvidia_peermem

If this returns empty, ensure driver.rdma.enabled=true and driver.rdma.useHostMofed=true are set in your GPU Operator Helm values (see Step 5 above), then restart the driver daemonset:

$kubectl rollout restart daemonset/nvidia-driver-daemonset -n gpu-operator

The nvidia-peermem-reloader DaemonSet from the Azure RDMA repo is designed for clusters using AKS-managed GPU drivers (without the GPU Operator). It simply runs modprobe nvidia-peermem — which will fail on ND H100 v5 nodes because the host OS doesn’t include the module. When using the GPU Operator (recommended), the operator handles nvidia_peermem automatically via driver.rdma.enabled=true.

See Also