GCP-Based DGX Cloud Create Cluster Configuration#

This section provides specific details about configurations or customizations available in GCP-based DGX Cloud Create clusters.

GCP TCPXO Networking#

DGX Cloud Create clusters in Google GKE provide TCPXO to enable high-speed distributed computing. DGXC customers can use this fabric to enable GPUDirect RDMA, NCCL, and MPI for distributed workloads.

While many container images built for distributed computing already bundle tools like MPI and NCCL, workloads that want to take advantage of TCPXO should use the stack provided by DGX Cloud Create. DGXC provides the required environment variables and TCPXO network driver automatically to pods launched as multi-node distributed MPIJob or PyTorchJob by mutating their pod definitions.

Automatic TCPXO Enablement#

TCPXO is automatically enabled for workload requests of 8 GPU from an H100 node. TCPXO enablement consists of:

  • The insertion of a TCPXO driver sidecar container into the workload pod specification.

  • The addition of volume mounts at:

    • /dev/aperture_devices

    • /home/kubernetes/bin/nvidia

    • /proc/sys

    • /sys

  • Added container resource requests of networking.gke.io.networks/gpu-nic enumerated from 0-7.

  • LD_LIBRARY_PATH set to /usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64 which can be prepended to.

  • The following environment variables are set.

Variable

Setting

NCCL_ALGO

Ring,Tree

NCCL_BUFFSIZE

8388608

NCCL_CROSS_NIC

0

NCCL_DYNAMIC_CHUNK_SIZE

524288

NCCL_FASTRAK_CTRL_DEV

eth0

NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL

0

NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING

0

NCCL_FASTRAK_IFNAME

eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8

NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY

/dev/aperture_devices

NCCL_FASTRAK_NUM_FLOWS

2

NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS

600000

NCCL_FASTRAK_USE_LLCM

1

NCCL_FASTRAK_USE_SNAP

1

NCCL_MIN_NCHANNELS

4

NCCL_NET_GDR_LEVEL

PIX

NCCL_NVLS_ENABLE

0

NCCL_P2P_NET_CHUNKSIZE

524288

NCCL_P2P_NVL_CHUNKSIZE

1048576

NCCL_P2P_PCI_CHUNKSIZE

524288

NCCL_PROTO

Simple

NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE

/usr/local/nvidia/lib64/a3plus_guest_config.textproto

NCCL_SOCKET_IFNAME

eth0

NCCL_TUNER_CONFIG_PATH

/usr/local/nvidia/lib64/a3plus_tuner_config.textproto

NCCL_TUNER_PLUGIN

libnccl-tuner.so

Some workloads will still require small modifications to take advantage of the mounted TCPXO stack:

  • Container images must be built using the C toolchain version 2.34 or later (generally, images built from at least Ubuntu 21.10 or ubi9).

  • If the distributed job is a PyTorchJob, generally no modifications are required.

  • If the distributed job is an MPIJob, it may also be necessary to pass along environment variables defined in the table above from the launcher to the worker nodes, such as mpirun -x LD_LIBRARY_PATH -x NCCL_ALGO -x NCCL_BUFFSIZE ....

Troubleshooting#

User wants to confirm that the TCPXO plugin is being used:

This can usually be done by setting the environment variable NCCL_INFO=DEBUG for workloads that use NCCL. When the workload sends messages, it should print information in logs like:

INFO(NCCL PLUGIN): It's a3-megagpu machine.
INFO(NCCL PLUGIN): Loading plugin: libnccl-tcpxo.so

User is running a single node 8-GPU PyTorch job and seeing an error at finish:

E1115 21:38:55.904000 136233596458176 torch/distributed/elastic/multiprocessing/api.py:838]
failed (exitcode: -27) local_rank: 0 (pid: 1185) of binary: /usr/bin/python

See the Opt-Out section below for remediation.

Opt-Out#

In some cases, users may want to disable TCPXO injection entirely. This is achieved by adding an annotation to workloads: runai.dgxc.nvidia.com/gcp-nccl: 'skip'. How to add this annotation depends on how the workload is submitted:

  • NVIDIA Run:ai CLI

    runai submit-dist  --annotation "runai.dgxc.nvidia.com/gcp-nccl=skip"
    
  • NVIDIA Run:ai UI

    When submitting a new distributed training workload, find the General section and select + Annotation.

  • YAML Files

    Add runai.dgxc.nvidia.com/gcp-nccl: 'skip' to the metadata annotations section of any pod specs.

Note

The runai.dgxc.nvidia.com/gcp-nccl annotation has another option of none. This setting ensures that the variable NCCL_TUNER_PLUGIN is set to none which means that no TCPXO plugin will be specified even if explicitly requested from the workload spec.