Cloud Service Providers

Cluster Bring-Up

Common

For setting up a Slurm cluster for the NVIDIA NeMo™ framework, NVIDIA recommends Nephele. This cluster deployment tool has been tested on Azure, and Oracle Cloud. Nephele can be hosted on a VM instance in any CSP. To get started:

Clone the Nephele repo.
Install the dependencies.
Provide CSP credentials in nephele.conf.
Change REPLICAS_x8a100 in nephele.conf to the desired number of nodes.

Finally, run ./nephele init and ./nephele create.

NVIDIA also recommends mounting an external persistent NFS once the cluster is up and running (ensure it is mounted on all nodes) and using this to configure and run the NeMo framework.

The steps above apply to all CSPs, including Azure, and OCI. Some modifications are necessary for OCI, as detailed below. Note that for OCI, a custom image must be imported, which should be done before running ./nephele create.

OCI

The NeMo framework supports running training and inference containers on OCI. For more details about orchestration scripts, reach out to oci_nm@nvidia.com.

GCP

To use Slurm on Google Cloud, NVIDIA recommends the Google Cloud HPC Toolkit. For those using Kubernetes, NVIDIA recommends using Google Kubernetes Engine (GKE). For more information about configuring NeMo on GCP, you can go here or contact us at gcp-nemo@google.com.

AWS

For AWS, NVIDIA recommends using ParallelCluster. You can find the details on how to launch ParallelCluster for running training with NeMo Megatron Launcher using this AWS-Sample: aws-samples/awsome-distributed-training/nemo-launcher. There will soon be a blog published as well, we will link it here once it becomes available.

To launch jobs on AWS, the EFA driver and NCCL plugin must first be installed on top of the training container. NVIDIA recommends building a new container image with Docker, then creating an Enroot image. The Enroot image will be a squashfs file (nemo_megatron_training.sqsh) equivalent to the Docker image, and can be used with the Slurm cluster. For more information on Enroot images see the Enroot GitHub docs.

On the scheduler node:

Install Docker.
Build the image with EFA drivers and NCCL plugin from csp_tools/aws/Dockerfile. .. code-block:: bash

cd csp_tools/aws docker build -t nemofw-training-build:23.07-py3 .

Run this command on the Docker image to create an Enroot image:

Copy
Copied!

            
            enroot import --output nemo_megatron_training.sqsh dockerd://<image_name>:<tag>

Move the .sqsh file to the root of NeMo-Megatron-Launcher.

Set the container path in launcher_scripts/conf/config.yaml to the new Enroot image:

Copy
Copied!

            
            container: /path/to/nemo_megatron_launcher/nemo_megatron_training.sqsh

Copy the topology file contents of the target node from aws-ofi-nccl and paste in the csp_tools/aws/topo.xml #. P5.48xlarge - [p5.48xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/p5.48xl-topo.xml) #. G5.48xlarge - [g5.48xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/g5.48xl-topo.xml) #. P4d.24xlarge - [p4d-24xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/p4d-24xl-topo.xml) #. P4de.24xlarge - [p4de-24xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/p4de-24xl-topo.xml)

Cluster Validation

Before running the cluster validation script, make sure that an NGC token has been added to ~/.config/enroot/.credentials on all nodes.

The cluster validation script at csp_tools/<csp>/cluster_validation.sh runs GPU diagnostics and tests NCCL node-to-node bus bandwidth. The logs from these tests are stored at results/cluster_validation. The script lists any nodes that fail these tests. Replace these nodes or restart them through the CSP UI.

Validation Script Usage

The script has three required configuration settings:

--nodes: the number of nodes
--nodelist: the list of node names
--partition: the Slurm partition that the nodes are assigned to

The values for these configurations must be in the same format as sinfo, as in this example.

Copy
Copied!

            
            PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
x8a100       up   infinite      8   idle x8a100-[0000-0007]

To test all eight idle nodes, run the script like this.

Copy
Copied!

            
            bash cluster_validation.sh --nodes=8 --nodelist=x8a100-[0000-0007] --partition=x8a100

The script runs both the GPU diagnostics and the NCCL test by default. To run only one or the other, specify one of the following flags:

--dcgm: run GPU diagnostics only
--nccl: run NCCL test only

See bash cluster_validation.sh -h for more information.

Running tests manually

The cluster_validation.sh script is essentially a wrapper for the two Slurm job scripts in the CSP directories. These jobs can also be run manually. Make sure to use the Slurm job script in the relevant CSP’s path (csp_tools/<csp>/dcgmi_diag.sh and csp_tools/<csp>/nccl.sh)

For the GPU diagnostics job, provide these arguments when submitting the job to Slurm.

Copy
Copied!

            
            sbatch -p <partition> -w <node_list> -o <job_log_file> dcgmi_diag.sh

For the NCCL test job, cluster_validation.sh performs a pair-wise sweep of the nodes, as this is a sufficient test, but a different number of nodes can be used as well.

First build the test binaries.

Copy
Copied!

            
            sbatch -N 1 build-nccl-tests.sh

Then, to run a two-node all_reduce_perf job.

Copy
Copied!

            
            sbatch -w <node_1>,<node_2> -o <job_log_file> nccl.sh

To run the job with more nodes, simply add the node names to the -w flag in the same comma-separated list format.

Configuration Changes

Before launching jobs, some changes must be made to the configuration.

Set NCCL Topology

The NCCL topology file is unique for each CSP, and can be found in CSP’s folder (csp_tools/<csp>/topo.xml).

In launcher_scripts/conf/config.yaml, mount the directory containing the topology file.

Copy
Copied!

            
            container_mounts:
  - /path/to/nemo_megatron_launcher/csp_tools/<csp>/:/nccl

Then set the path of the file in the container.

Copy
Copied!

            
            env_vars:
    NCCL_TOPO_FILE: /nccl/topo.xml

Environment Variables

Some environment variables must be set to ensure correct behavior on CSPs. This can be done through config.yaml.

Azure Variables

Set these environment variables for Azure.

Copy
Copied!

            
            env_vars:
  UCX_IB_PCI_RELAXED_ORDERING: auto
  NCCL_IB_PCI_RELAXED_ORDERING: 2
  NCCL_IB_TIMEOUT: 22
  NCCL_DEBUG: INFO

AWS Variables

AWS recommends setting the following flag to avoid data corruption.

Copy
Copied!

            
            env_vars:
  NCCL_PROTO: simple

Setting this flag reduces training throughput by roughly 2%.