GCP-Based DGX Cloud Create Cluster Configuration#
This section provides specific details about configurations or customizations available in GCP-based DGX Cloud Create clusters.
GCP TCPXO Networking#
DGX Cloud Create clusters in Google GKE provide TCPXO to enable high-speed distributed computing. DGXC customers can use this fabric to enable GPUDirect RDMA, NCCL, and MPI for distributed workloads.
While many container images built for distributed computing already bundle tools like MPI and NCCL, workloads that want to take advantage of TCPXO should use the stack provided by DGX Cloud Create. DGXC provides the required environment variables and TCPXO network driver automatically to pods launched as multi-node distributed MPIJob or PyTorchJob by mutating their pod definitions.
Automatic TCPXO Enablement#
TCPXO is automatically enabled for workload requests of 8 GPU from an H100 node. TCPXO enablement consists of:
The insertion of a TCPXO driver sidecar container into the workload pod specification.
The addition of volume mounts at:
/dev/aperture_devices
/home/kubernetes/bin/nvidia
/proc/sys
/sys
Added container resource requests of
networking.gke.io.networks/gpu-nic
enumerated from 0-7.LD_LIBRARY_PATH
set to/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64
which can be prepended to.The following environment variables are set.
Variable |
Setting |
---|---|
NCCL_ALGO |
Ring,Tree |
NCCL_BUFFSIZE |
8388608 |
NCCL_CROSS_NIC |
0 |
NCCL_DYNAMIC_CHUNK_SIZE |
524288 |
NCCL_FASTRAK_CTRL_DEV |
eth0 |
NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL |
0 |
NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING |
0 |
NCCL_FASTRAK_IFNAME |
eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8 |
NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY |
/dev/aperture_devices |
NCCL_FASTRAK_NUM_FLOWS |
2 |
NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS |
600000 |
NCCL_FASTRAK_USE_LLCM |
1 |
NCCL_FASTRAK_USE_SNAP |
1 |
NCCL_MIN_NCHANNELS |
4 |
NCCL_NET_GDR_LEVEL |
PIX |
NCCL_NVLS_ENABLE |
0 |
NCCL_P2P_NET_CHUNKSIZE |
524288 |
NCCL_P2P_NVL_CHUNKSIZE |
1048576 |
NCCL_P2P_PCI_CHUNKSIZE |
524288 |
NCCL_PROTO |
Simple |
NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE |
/usr/local/nvidia/lib64/a3plus_guest_config.textproto |
NCCL_SOCKET_IFNAME |
eth0 |
NCCL_TUNER_CONFIG_PATH |
/usr/local/nvidia/lib64/a3plus_tuner_config.textproto |
NCCL_TUNER_PLUGIN |
libnccl-tuner.so |
Some workloads will still require small modifications to take advantage of the mounted TCPXO stack:
Container images must be built using the C toolchain version 2.34 or later (generally, images built from at least Ubuntu 21.10 or ubi9).
If the distributed job is a PyTorchJob, generally no modifications are required.
If the distributed job is an MPIJob, it may also be necessary to pass along environment variables defined in the table above from the launcher to the worker nodes, such as
mpirun -x LD_LIBRARY_PATH -x NCCL_ALGO -x NCCL_BUFFSIZE ...
.
Troubleshooting#
User wants to confirm that the TCPXO plugin is being used:
This can usually be done by setting the environment variable NCCL_INFO=DEBUG
for workloads that use NCCL.
When the workload sends messages, it should print information in logs like:
INFO(NCCL PLUGIN): It's a3-megagpu machine.
INFO(NCCL PLUGIN): Loading plugin: libnccl-tcpxo.so
User is running a single node 8-GPU PyTorch job and seeing an error at finish:
E1115 21:38:55.904000 136233596458176 torch/distributed/elastic/multiprocessing/api.py:838]
failed (exitcode: -27) local_rank: 0 (pid: 1185) of binary: /usr/bin/python
See the Opt-Out section below for remediation.
Opt-Out#
In some cases, users may want to disable TCPXO injection entirely. This is achieved by adding an annotation to workloads:
runai.dgxc.nvidia.com/gcp-nccl: 'skip'
. How to add this annotation depends on how the workload is submitted:
NVIDIA Run:ai CLI
runai submit-dist … --annotation "runai.dgxc.nvidia.com/gcp-nccl=skip"
NVIDIA Run:ai UI
When submitting a new distributed training workload, find the General section and select + Annotation.
YAML Files
Add
runai.dgxc.nvidia.com/gcp-nccl: 'skip'
to the metadata annotations section of any pod specs.
Note
The runai.dgxc.nvidia.com/gcp-nccl
annotation has another option of none
. This setting ensures that the variable NCCL_TUNER_PLUGIN
is set to none
which means that no TCPXO plugin will be specified even if explicitly requested from the workload spec.