GKE TCPXO Networking Prerequisites

View as Markdown

For the H100 GKE COS training recipes (h100-gke-cos-training*, on a3-megagpu-8g nodes), GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, the NVIDIA Collective Communications Library (NCCL) falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).

A100 (a2) exception: the a100-gke-cos-training* recipes intentionally omit the gke-nccl-tcpxo component — GPUDirect TCPXO targets H100 a3-megagpu-8g nodes, not the A100 a2-highgpu/a2-ultragpu machine family. The prerequisites below do not apply to A100 GKE recipes, and the generated A100 bundle does not install the TCPXO DaemonSets.

Infrastructure Prerequisites

GKE clusters must have multi-NIC networking configured before deploying AICR bundles:

  • Multi-NIC networking enabled (8 GPU NICs per a3-megagpu-8g node)
  • Network + GKENetworkParamSet CRs configured for GPU NICs (cluster-specific, not managed by AICR)
  • nccl-tcpxo-installer DaemonSet on GPU nodes (included in AICR bundle)
  • nri-device-injector DaemonSet on GPU nodes (included in AICR bundle)

Important: The GPU node pool must be provisioned with only the 8 GPU NIC networks (gpu-nic-0 through gpu-nic-7). Do not include a gVNIC additional network — it takes a GPU NIC PCI slot (0000:06:00.0), leaving only 7/8 GPUs available for TCPXO.

Workload Pod Configuration (NRI Profile)

The NRI profile mounts the host’s /sys and /proc/sys into the TCPXO daemon container, giving it PCI sysfs visibility without hostNetwork. This preserves pod networking (DNS, network policies, service mesh compatibility).

1apiVersion: v1
2kind: Pod
3metadata:
4 name: my-workload
5 annotations:
6 # NRI device injection for tcpxo-daemon GPU access
7 devices.gke.io/container.tcpxo-daemon: |
8 - path: /dev/nvidia0
9 - path: /dev/nvidia1
10 - path: /dev/nvidia2
11 - path: /dev/nvidia3
12 - path: /dev/nvidia4
13 - path: /dev/nvidia5
14 - path: /dev/nvidia6
15 - path: /dev/nvidia7
16 - path: /dev/nvidiactl
17 - path: /dev/nvidia-uvm
18 - path: /dev/dmabuf_import_helper
19 # Multi-NIC mapping (network names are cluster-specific)
20 networking.gke.io/default-interface: eth0
21 networking.gke.io/interfaces: |
22 [{"interfaceName":"eth0","network":"default"},
23 {"interfaceName":"eth1","network":"gpu-nic0"},
24 {"interfaceName":"eth2","network":"gpu-nic1"},
25 {"interfaceName":"eth3","network":"gpu-nic2"},
26 {"interfaceName":"eth4","network":"gpu-nic3"},
27 {"interfaceName":"eth5","network":"gpu-nic4"},
28 {"interfaceName":"eth6","network":"gpu-nic5"},
29 {"interfaceName":"eth7","network":"gpu-nic6"},
30 {"interfaceName":"eth8","network":"gpu-nic7"}]
31spec:
32 hostNetwork: false
33 containers:
34 - name: tcpxo-daemon
35 image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.20
36 securityContext:
37 capabilities:
38 add: [NET_ADMIN, NET_BIND_SERVICE]
39 volumeMounts:
40 - name: nvtcpxo-libraries
41 mountPath: /usr/local/nvidia
42 readOnly: true
43 - name: nvtcpxo-sys
44 mountPath: /hostsysfs
45 - name: nvtcpxo-proc-sys
46 mountPath: /hostprocsysfs
47 env:
48 - name: LD_LIBRARY_PATH
49 value: /usr/local/nvidia/lib64
50 - name: workload
51 # ... your training container
52 volumeMounts:
53 - name: nvtcpxo-aperture-devices
54 mountPath: /dev/aperture_devices
55 volumes:
56 - name: nvtcpxo-libraries
57 hostPath:
58 path: /home/kubernetes/bin/nvidia
59 - name: nvtcpxo-sys
60 hostPath:
61 path: /sys
62 - name: nvtcpxo-proc-sys
63 hostPath:
64 path: /proc/sys
65 - name: nvtcpxo-aperture-devices
66 hostPath:
67 path: /dev/aperture_devices

Key properties:

  • hostNetwork: false — workloads get proper pod networking
  • privileged: false — tcpxo-daemon uses only NET_ADMIN and NET_BIND_SERVICE
  • /sys mounted as /hostsysfs — provides PCI sysfs visibility for GPU enumeration
  • /proc/sys mounted as /hostprocsysfs — allows kernel network tuning
  • NRI annotations inject GPU devices and multi-NIC interfaces
  • Requires NRI device injector DaemonSet deployed on GPU nodes

See demos/workloads/training/gke-nccl-test-tcpxo.yaml for a complete 2-node NCCL benchmark example.

NCCL Plugin Version Matching

The NCCL test container image must match the cluster’s installed TCPXO plugin version. Check with:

$kubectl get ds nccl-tcpxo-installer -n kube-system \
> -o jsonpath='{.spec.template.spec.containers[?(@.name=="nccl-tcpxo-installer")].image}'

Update the nccl-plugin-gpudirecttcpx-dev image tag in your workload to match.

Running the NCCL Benchmark

aicr validate --phase performance is not yet automated for GKE — the test uses raw Pods with a TCPXO daemon sidecar, which differs from the EKS TrainJob-based approach. Run it manually with either of the two profiles below. Each pod runs a tcpxo-daemon sidecar (manages the GPUDirect TCPX data path) plus the nccl-test container.

Option 1: testdata manifests (matches validator framework)

$export NAMESPACE=nccl-perf
$export GPU_COUNT_PER_NODE=8
$export GPU_COUNT=16
$export WORKER_COUNT=2
$export TEST_TYPE=all_reduce_perf
$export MIN_MESSAGE_SIZE=1M
$export MAX_MESSAGE_SIZE=8G
$
$kubectl create ns $NAMESPACE
$envsubst < validators/performance/testdata/h100/gke/runtime.yaml | kubectl apply -f -
$
$# Wait for pods to be 2/2 Running
$kubectl get pods -n $NAMESPACE -o wide -w
$
$# Trigger the AllReduce benchmark from host-1
$kubectl exec nccl-test-host-1 -n $NAMESPACE -c nccl-test -- \
> /scripts/allreduce.sh nccl-host-1 nccl-host-2
$
$# Expected: ~340 GB/s busBW at 16 GB (AllReduce), ~100 GB/s avg
$# Clean up
$kubectl delete ns $NAMESPACE

Option 2: standalone demo manifests

NRI profile (recommended, no hostNetwork):

$kubectl create ns nccl-test
$kubectl apply -f demos/workloads/training/gke-nccl-test-tcpxo.yaml -n nccl-test
$
$# Wait for pods to be 2/2 Running
$kubectl get pods -n nccl-test -o wide -w
$
$# Trigger the AllReduce benchmark from host-1
$kubectl exec nccl-test-host-1 -n nccl-test -c nccl-test -- bash -c '
> /scripts/init_ssh.sh nccl-host-1 nccl-host-2 &&
> pushd /scripts && /scripts/gen_hostfiles.sh nccl-host-1 nccl-host-2 && popd &&
> DATA_MIN=1K DATA_MAX=16G BENCHMARK=all_reduce_perf NHOSTS=2 \
> NCCL_LIB_DIR="/usr/local/nvidia/lib64" LD_LIBRARY_PATH="/usr/local/nvidia/lib64" \
> /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh'
$
$# Expected: ~340 GB/s busBW at 16 GB (AllReduce), ~100 GB/s avg
$# Clean up
$kubectl delete ns nccl-test

Interpreting results

MetricWithout TCPXOWith TCPXO
AllReduce busBW (16 GB)~4 GB/s~340 GB/s
AllReduce avg busBW~4 GB/s~100 GB/s

Troubleshooting

RxDM detects 7/8 GPUs

If RxDM reports Number of GPUs detected 7 is not equal to the actual number of GPUs 8, check the GPU node pool’s additional network configuration:

$gcloud container node-pools describe <pool-name> \
> --cluster <cluster> --region <region> --project <project> \
> --format="yaml(networkConfig.additionalNodeNetworkConfigs)"

If a gVNIC network appears in the list, it is taking a GPU NIC PCI slot. Remove the gVNIC from the node pool and reprovision the GPU nodes.

You can also verify the node NIC mapping:

$kubectl get node <gpu-node> \
> -o jsonpath='{.metadata.annotations.networking\.gke\.io/nic-info}'

All 8 GPU NIC PCI addresses should be mapped to eth1eth8. If a gVNIC is present, it typically occupies PCI 0000:06:00.0, displacing the first GPU NIC.

RxDM detects 0/8 GPUs

If RxDM reports Number of GPUs detected in the PCI tree 0, the pod is missing the /sys hostPath mount. Ensure /sys is mounted as /hostsysfs in the tcpxo-daemon container. Without it, the container network namespace hides the host PCI sysfs tree entirely.

Performance Reference

Validated on GKE 1.35 / a3-megagpu-8g (2 nodes, 16 GPUs):

ProfilehostNetworkbusBW @ 16 GBAvg busBW
NRI (recommended)false~340 GB/s~100 GB/s
Without TCPXON/A~4 GB/s~4 GB/s