Multinode Networking#
Multi-node GPU training requires high-bandwidth, low-latency east-west networking between nodes. Without it, NCCL falls back to standard Ethernet and training times increase significantly. The Helm chart installs Kyverno policies that automatically inject the correct networking resources for four cloud providers.
Prerequisites#
Kyverno installed in the cluster
Volcano scheduler installed (multi-node jobs use Volcano)
Nodes with high-performance networking hardware (EFA, InfiniBand, etc.)
The cloud provider’s device plugin or network operator deployed so that networking resources are visible to Kubernetes
Note
The NeMo Platform does not provision or manage the underlying cloud networking infrastructure. Each cloud provider section below lists the cluster-level prerequisites your administrator must configure. Refer to your cloud provider’s documentation for setup instructions.
How It Works#
When a job requests more than one node, the jobs controller annotates the pod with nmp.nvidia.com/enable-multi-node-networking: "true". Kyverno watches for this annotation and mutates the pod spec to inject the cloud-specific device requests, environment variables, and volume mounts that NCCL needs for high-performance inter-node communication.
Configuration#
Enable exactly one cloud provider in the multinodeNetworking section of your values.yaml. Enabling more than one causes a Helm install error.
Important
Device-count parameters (efaDevicesPerGPU, rdmaDevicesPerGPU) must match the hardware ratio of your instance type. A mismatch causes jobs to either fail scheduling or silently run without high-speed networking.
AWS (EFA)#
Injects Elastic Fabric Adapter device requests (vpc.amazonaws.com/efa), mounts the EFA OFI library, and configures shared memory.
multinodeNetworking:
aws:
enabled: true
efaDevicesPerGPU: 4 # must match your instance type
Parameter |
Description |
|---|---|
|
EFA interfaces to request per GPU. The Kyverno policy multiplies this value by the number of GPUs requested by each container to produce the total |
To determine the correct value, divide the number of EFA interfaces on your instance type by the number of GPUs. For example, p5.48xlarge has 32 EFA interfaces and 8 GPUs: 32 / 8 = 4. Consult the AWS EFA instance types documentation for your instance type’s EFA count.
Cluster prerequisites:
EFA-enabled instance types (e.g.,
p5.48xlarge,p5e.48xlarge)EFA device plugin deployed
Security group permitting all traffic between nodes in the placement group
Azure (InfiniBand / RDMA)#
Injects RDMA device requests, IPC_LOCK capability, NCCL topology, and UCX environment variables.
multinodeNetworking:
azure:
enabled: true
rdmaDevicesPerGPU: 1
rdmaDeviceName: "hca_shared_devices_a"
Parameter |
Description |
|---|---|
|
RDMA devices per GPU. Must match your VM size. |
|
Resource name exposed by the RDMA device plugin (e.g., |
Cluster prerequisites:
InfiniBand-capable VM sizes (e.g.,
Standard_ND96asr_v4,Standard_ND96amsr_A100_v4)RDMA device plugin or Network Operator deployed
GCP (TCP-XO)#
Injects a GPUDirect-TCPXO sidecar daemon, multi-NIC annotations, and NCCL Fastrak environment variables.
multinodeNetworking:
gcp:
enabled: true
Cluster prerequisites:
A3+ GPU node pools with multi-NIC networking enabled
GKE multi-network support configured with the required
gpu-nicnetworksSee the GKE GPUDirect-family documentation, specifically the TCPXO sections
OCI (SR-IOV / RDMA)#
Injects Mellanox NIC device requests (nvidia.com/mlnxnics), SR-IOV network annotations via Multus, and InfiniBand-tuned NCCL environment variables.
multinodeNetworking:
oci:
enabled: true
rdmaDevicesPerGPU: 8
Parameter |
Description |
|---|---|
|
Mellanox RDMA NICs per GPU. Must match your bare-metal shape. |
Cluster prerequisites:
RDMA-capable bare-metal shapes (e.g.,
BM.GPU.A100-v2.8,BM.GPU.H100.8)Network Operator and SR-IOV device plugin deployed