Multinode Networking#

Multi-node GPU training requires high-bandwidth, low-latency east-west networking between nodes. Without it, NCCL falls back to standard Ethernet and training times increase significantly. The Helm chart installs Kyverno policies that automatically inject the correct networking resources for four cloud providers.

Prerequisites#

Kyverno installed in the cluster
Volcano scheduler installed (multi-node jobs use Volcano)
Nodes with high-performance networking hardware (EFA, InfiniBand, etc.)
The cloud provider’s device plugin or network operator deployed so that networking resources are visible to Kubernetes

Note

The NeMo Platform does not provision or manage the underlying cloud networking infrastructure. Each cloud provider section below lists the cluster-level prerequisites your administrator must configure. Refer to your cloud provider’s documentation for setup instructions.

How It Works#

When a job requests more than one node, the jobs controller annotates the pod with nmp.nvidia.com/enable-multi-node-networking: "true". Kyverno watches for this annotation and mutates the pod spec to inject the cloud-specific device requests, environment variables, and volume mounts that NCCL needs for high-performance inter-node communication.

Configuration#

Enable exactly one cloud provider in the multinodeNetworking section of your values.yaml. Enabling more than one causes a Helm install error.

Important

Device-count parameters (efaDevicesPerGPU, rdmaDevicesPerGPU) must match the hardware ratio of your instance type. A mismatch causes jobs to either fail scheduling or silently run without high-speed networking.

AWS (EFA)#

Injects Elastic Fabric Adapter device requests (vpc.amazonaws.com/efa), mounts the EFA OFI library, and configures shared memory.

multinodeNetworking:
  aws:
    enabled: true
    efaDevicesPerGPU: 4  # must match your instance type

Parameter	Description
`efaDevicesPerGPU`	EFA interfaces to request per GPU. The Kyverno policy multiplies this value by the number of GPUs requested by each container to produce the total `vpc.amazonaws.com/efa` resource request.

To determine the correct value, divide the number of EFA interfaces on your instance type by the number of GPUs. For example, p5.48xlarge has 32 EFA interfaces and 8 GPUs: 32 / 8 = 4. Consult the AWS EFA instance types documentation for your instance type’s EFA count.

Cluster prerequisites:

EFA-enabled instance types (e.g., p5.48xlarge, p5e.48xlarge)
EFA device plugin deployed
Security group permitting all traffic between nodes in the placement group
See AWS EFA documentation

Azure (InfiniBand / RDMA)#

Injects RDMA device requests, IPC_LOCK capability, NCCL topology, and UCX environment variables.

multinodeNetworking:
  azure:
    enabled: true
    rdmaDevicesPerGPU: 1
    rdmaDeviceName: "hca_shared_devices_a"

Parameter	Description
`rdmaDevicesPerGPU`	RDMA devices per GPU. Must match your VM size.
`rdmaDeviceName`	Resource name exposed by the RDMA device plugin (e.g., `hca_shared_devices_a`).

Cluster prerequisites:

InfiniBand-capable VM sizes (e.g., Standard_ND96asr_v4, Standard_ND96amsr_A100_v4)
RDMA device plugin or Network Operator deployed
See Azure InfiniBand documentation

GCP (TCP-XO)#

Injects a GPUDirect-TCPXO sidecar daemon, multi-NIC annotations, and NCCL Fastrak environment variables.

multinodeNetworking:
  gcp:
    enabled: true

Cluster prerequisites:

A3+ GPU node pools with multi-NIC networking enabled
GKE multi-network support configured with the required gpu-nic networks
See the GKE GPUDirect-family documentation, specifically the TCPXO sections

OCI (SR-IOV / RDMA)#

Injects Mellanox NIC device requests (nvidia.com/mlnxnics), SR-IOV network annotations via Multus, and InfiniBand-tuned NCCL environment variables.

multinodeNetworking:
  oci:
    enabled: true
    rdmaDevicesPerGPU: 8

Parameter	Description
`rdmaDevicesPerGPU`	Mellanox RDMA NICs per GPU. Must match your bare-metal shape.

Cluster prerequisites:

RDMA-capable bare-metal shapes (e.g., BM.GPU.A100-v2.8, BM.GPU.H100.8)
Network Operator and SR-IOV device plugin deployed
See OCI cluster networking documentation