GKE TCPXO Networking Prerequisites

View as Markdown

For *-gke-cos-training* recipes, GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).

Infrastructure Prerequisites

GKE clusters must have multi-NIC networking configured before deploying AICR bundles:

  • Multi-NIC networking enabled (8 GPU NICs per a3-megagpu-8g node)
  • Network + GKENetworkParamSet CRs configured for GPU NICs (cluster-specific, not managed by AICR)
  • nccl-tcpxo-installer DaemonSet on GPU nodes (included in AICR bundle)
  • nri-device-injector DaemonSet on GPU nodes (included in AICR bundle)

Important: The GPU node pool must be provisioned with only the 8 GPU NIC networks (gpu-nic-0 through gpu-nic-7). Do not include a gVNIC additional network — it takes a GPU NIC PCI slot (0000:06:00.0), leaving only 7/8 GPUs available for TCPXO.

Workload Pod Configuration (NRI Profile)

The NRI profile mounts the host’s /sys and /proc/sys into the TCPXO daemon container, giving it PCI sysfs visibility without hostNetwork. This preserves pod networking (DNS, network policies, service mesh compatibility).

1apiVersion: v1
2kind: Pod
3metadata:
4 name: my-workload
5 annotations:
6 # NRI device injection for tcpxo-daemon GPU access
7 devices.gke.io/container.tcpxo-daemon: |
8 - path: /dev/nvidia0
9 - path: /dev/nvidia1
10 - path: /dev/nvidia2
11 - path: /dev/nvidia3
12 - path: /dev/nvidia4
13 - path: /dev/nvidia5
14 - path: /dev/nvidia6
15 - path: /dev/nvidia7
16 - path: /dev/nvidiactl
17 - path: /dev/nvidia-uvm
18 - path: /dev/dmabuf_import_helper
19 # Multi-NIC mapping (network names are cluster-specific)
20 networking.gke.io/default-interface: eth0
21 networking.gke.io/interfaces: |
22 [{"interfaceName":"eth0","network":"default"},
23 {"interfaceName":"eth1","network":"gpu-nic0"},
24 {"interfaceName":"eth2","network":"gpu-nic1"},
25 {"interfaceName":"eth3","network":"gpu-nic2"},
26 {"interfaceName":"eth4","network":"gpu-nic3"},
27 {"interfaceName":"eth5","network":"gpu-nic4"},
28 {"interfaceName":"eth6","network":"gpu-nic5"},
29 {"interfaceName":"eth7","network":"gpu-nic6"},
30 {"interfaceName":"eth8","network":"gpu-nic7"}]
31spec:
32 hostNetwork: false
33 containers:
34 - name: tcpxo-daemon
35 image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.20
36 securityContext:
37 capabilities:
38 add: [NET_ADMIN, NET_BIND_SERVICE]
39 volumeMounts:
40 - name: nvtcpxo-libraries
41 mountPath: /usr/local/nvidia
42 readOnly: true
43 - name: nvtcpxo-sys
44 mountPath: /hostsysfs
45 - name: nvtcpxo-proc-sys
46 mountPath: /hostprocsysfs
47 env:
48 - name: LD_LIBRARY_PATH
49 value: /usr/local/nvidia/lib64
50 - name: workload
51 # ... your training container
52 volumeMounts:
53 - name: nvtcpxo-aperture-devices
54 mountPath: /dev/aperture_devices
55 volumes:
56 - name: nvtcpxo-libraries
57 hostPath:
58 path: /home/kubernetes/bin/nvidia
59 - name: nvtcpxo-sys
60 hostPath:
61 path: /sys
62 - name: nvtcpxo-proc-sys
63 hostPath:
64 path: /proc/sys
65 - name: nvtcpxo-aperture-devices
66 hostPath:
67 path: /dev/aperture_devices

Key properties:

  • hostNetwork: false — workloads get proper pod networking
  • privileged: false — tcpxo-daemon uses only NET_ADMIN and NET_BIND_SERVICE
  • /sys mounted as /hostsysfs — provides PCI sysfs visibility for GPU enumeration
  • /proc/sys mounted as /hostprocsysfs — allows kernel network tuning
  • NRI annotations inject GPU devices and multi-NIC interfaces
  • Requires NRI device injector DaemonSet deployed on GPU nodes

See demos/workloads/training/gke-nccl-test-tcpxo.yaml for a complete 2-node NCCL benchmark example.

NCCL Plugin Version Matching

The NCCL test container image must match the cluster’s installed TCPXO plugin version. Check with:

$kubectl get ds nccl-tcpxo-installer -n kube-system \
> -o jsonpath='{.spec.template.spec.containers[?(@.name=="nccl-tcpxo-installer")].image}'

Update the nccl-plugin-gpudirecttcpx-dev image tag in your workload to match.

Troubleshooting

RxDM detects 7/8 GPUs

If RxDM reports Number of GPUs detected 7 is not equal to the actual number of GPUs 8, check the GPU node pool’s additional network configuration:

$gcloud container node-pools describe <pool-name> \
> --cluster <cluster> --region <region> --project <project> \
> --format="yaml(networkConfig.additionalNodeNetworkConfigs)"

If a gVNIC network appears in the list, it is taking a GPU NIC PCI slot. Remove the gVNIC from the node pool and reprovision the GPU nodes.

You can also verify the node NIC mapping:

$kubectl get node <gpu-node> \
> -o jsonpath='{.metadata.annotations.networking\.gke\.io/nic-info}'

All 8 GPU NIC PCI addresses should be mapped to eth1eth8. If a gVNIC is present, it typically occupies PCI 0000:06:00.0, displacing the first GPU NIC.

RxDM detects 0/8 GPUs

If RxDM reports Number of GPUs detected in the PCI tree 0, the pod is missing the /sys hostPath mount. Ensure /sys is mounted as /hostsysfs in the tcpxo-daemon container. Without it, the container network namespace hides the host PCI sysfs tree entirely.

Performance Reference

Validated on GKE 1.35 / a3-megagpu-8g (2 nodes, 16 GPUs):

ProfilehostNetworkbusBW @ 16 GBAvg busBW
NRI (recommended)false~340 GB/s~100 GB/s
Without TCPXON/A~4 GB/s~4 GB/s