GCP-Based DGX Cloud Create Cluster Configuration#

This section provides specific details about configurations or customizations available in GCP-based DGX Cloud Create clusters.

GCP TCPXO Networking#

DGX Cloud Create clusters in Google GKE provide TCPXO to enable high-speed distributed computing. DGXC customers can use this fabric to enable GPUDirect RDMA, NCCL, and MPI for distributed workloads.

While many container images built for distributed computing already bundle tools like MPI and NCCL, workloads that want to take advantage of TCPXO should use the stack provided by DGX Cloud Create. DGXC provides the required environment variables and TCPXO network driver automatically to pods launched as multi-node distributed MPIJob or PyTorchJob by mutating their pod definitions.

Automatic TCPXO Enablement#

TCPXO is automatically enabled for workload requests of 8 GPU from an H100 node. TCPXO enablement consists of:

The insertion of a TCPXO driver sidecar container into the workload pod specification.
The addition of volume mounts at:
- /dev/aperture_devices
- /home/kubernetes/bin/nvidia
- /proc/sys
- /sys
Added container resource requests of networking.gke.io.networks/gpu-nic enumerated from 0-7.
LD_LIBRARY_PATH set to /usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64 which can be prepended to.
The following environment variables are set.

Variable	Setting
NCCL_ALGO	Ring,Tree
NCCL_BUFFSIZE	8388608
NCCL_CROSS_NIC	0
NCCL_DYNAMIC_CHUNK_SIZE	524288
NCCL_FASTRAK_CTRL_DEV	eth0
NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL	0
NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING	0
NCCL_FASTRAK_IFNAME	eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8
NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY	/dev/aperture_devices
NCCL_FASTRAK_NUM_FLOWS	2
NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS	600000
NCCL_FASTRAK_USE_LLCM	1
NCCL_FASTRAK_USE_SNAP	1
NCCL_MIN_NCHANNELS	4
NCCL_NET_GDR_LEVEL	PIX
NCCL_NVLS_ENABLE	0
NCCL_P2P_NET_CHUNKSIZE	524288
NCCL_P2P_NVL_CHUNKSIZE	1048576
NCCL_P2P_PCI_CHUNKSIZE	524288
NCCL_PROTO	Simple
NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE	/usr/local/nvidia/lib64/a3plus_guest_config.textproto
NCCL_SOCKET_IFNAME	eth0
NCCL_TUNER_CONFIG_PATH	/usr/local/nvidia/lib64/a3plus_tuner_config.textproto
NCCL_TUNER_PLUGIN	libnccl-tuner.so

Some workloads will still require small modifications to take advantage of the mounted TCPXO stack:

Container images must be built using the C toolchain version 2.34 or later (generally, images built from at least Ubuntu 21.10 or ubi9).
If the distributed job is a PyTorchJob, generally no modifications are required.
If the distributed job is an MPIJob, it may also be necessary to pass along environment variables defined in the table above from the launcher to the worker nodes, such as mpirun -x LD_LIBRARY_PATH -x NCCL_ALGO -x NCCL_BUFFSIZE ....

Troubleshooting#

User wants to confirm that the TCPXO plugin is being used:

This can usually be done by setting the environment variable NCCL_INFO=DEBUG for workloads that use NCCL. When the workload sends messages, it should print information in logs like:

INFO(NCCL PLUGIN): It's a3-megagpu machine.
INFO(NCCL PLUGIN): Loading plugin: libnccl-tcpxo.so

User is running a single node 8-GPU PyTorch job and seeing an error at finish:

E1115 21:38:55.904000 136233596458176 torch/distributed/elastic/multiprocessing/api.py:838]
failed (exitcode: -27) local_rank: 0 (pid: 1185) of binary: /usr/bin/python

See the Opt-Out section below for remediation.

Opt-Out#

In some cases, users may want to disable TCPXO injection entirely. This is achieved by adding an annotation to workloads: runai.dgxc.nvidia.com/gcp-nccl: 'skip'. How to add this annotation depends on how the workload is submitted:

NVIDIA Run:ai CLI

runai submit-dist … --annotation "runai.dgxc.nvidia.com/gcp-nccl=skip"

NVIDIA Run:ai UI

When submitting a new distributed training workload, find the General section and select + Annotation.
YAML Files

Add runai.dgxc.nvidia.com/gcp-nccl: 'skip' to the metadata annotations section of any pod specs.

Note

The runai.dgxc.nvidia.com/gcp-nccl annotation has another option of none. This setting ensures that the variable NCCL_TUNER_PLUGIN is set to none which means that no TCPXO plugin will be specified even if explicitly requested from the workload spec.