Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Cloud Service Providers

Cluster Bring-Up

Common

For setting up a Slurm cluster for the NVIDIA NeMo™ framework, NVIDIA recommends Nephele. This cluster deployment tool has been tested on Azure, and Oracle Cloud. Nephele can be hosted on a VM instance in any CSP. To get started:

Clone the Nephele repo.
Install the dependencies.
Provide CSP credentials in nephele.conf.
Change REPLICAS_x8a100 in nephele.conf to the desired number of nodes.

Finally, run ./nephele init and ./nephele create.

NVIDIA also recommends mounting an external persistent NFS once the cluster is up and running (ensure it is mounted on all nodes) and using this to configure and run the NeMo Framework.

The steps above apply to all CSPs, including Azure, and OCI. Some modifications are necessary for OCI, as detailed below. Note that for OCI, a custom image must be imported, which should be done before running ./nephele create.

OCI

The NeMo Framework supports running training and inference containers on OCI. For more details about orchestration scripts, reach out to oci_nm-nvidia.com.

GCP

To use Slurm on Google Cloud, NVIDIA recommends the Google Cloud HPC Toolkit. For those using Kubernetes, NVIDIA recommends using Google Kubernetes Engine (GKE). For more information about configuring NeMo on GCP, you can go here or contact us at gcp-nemo-google.com.

AWS

For AWS, NVIDIA recommends using ParallelCluster. You can find the details on how to launch ParallelCluster for running training with NeMo Megatron Launcher using this AWS-Sample: aws-samples/awsome-distributed-training/nemo-launcher. There will soon be a blog published as well, we will link it here once it becomes available.

To launch jobs on AWS, the EFA driver and NCCL plugin must first be installed on top of the training container. NVIDIA recommends building a new container image with Docker, then creating an Enroot image. The Enroot image will be a squashfs file (nemo_megatron_training.sqsh) equivalent to the Docker image, and can be used with the Slurm cluster. For more information on Enroot images see the Enroot GitHub docs.

On the scheduler node:

Install Docker.
Build the image with EFA drivers and NCCL plugin from csp_tools/aws/Dockerfile.
```
cd csp_tools/aws
docker build -t nemofw-training-build:23.07-py3 .
```

Run this command on the Docker image to create an Enroot image:

enroot import --output nemo_megatron_training.sqsh dockerd://<image_name>:<tag>

Move the .sqsh file to the root of NeMo-Framework-Launcher.
Set the container path in launcher_scripts/conf/config.yaml to the new Enroot image:
```
container: /path/to/nemo_megatron_launcher/nemo_megatron_training.sqsh
```
Copy the topology file contents of the target node from aws-ofi-nccl and paste in the csp_tools/aws/topo.xml
1. P5.48xlarge - p5.48xl-topo.xml
2. G5.48xlarge - g5.48xl-topo.xml
3. P4d.24xlarge - p4d-24xl-topo.xml
4. P4de.24xlarge - p4de-24xl-topo.xml

Cluster Validation

Before running the cluster validation script, make sure that an NGC token has been added to ~/.config/enroot/.credentials on all nodes.

The cluster validation script at csp_tools/<csp>/cluster_validation.sh runs GPU diagnostics and tests NCCL node-to-node bus bandwidth. The logs from these tests are stored at results/cluster_validation. The script lists any nodes that fail these tests. Replace these nodes or restart them through the CSP UI.

Validation Script Usage

The script has three required configuration settings:

--nodes: the number of nodes
--nodelist: the list of node names
--partition: the Slurm partition that the nodes are assigned to

The values for these configurations must be in the same format as sinfo, as in this example.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
x8a100       up   infinite      8   idle x8a100-[0000-0007]

To test all eight idle nodes, run the script like this.

bash cluster_validation.sh --nodes=8 --nodelist=x8a100-[0000-0007] --partition=x8a100

The script runs both the GPU diagnostics and the NCCL test by default. To run only one or the other, specify one of the following flags:

--dcgm: run GPU diagnostics only
--nccl: run NCCL test only

See bash cluster_validation.sh -h for more information.

Running tests manually

The cluster_validation.sh script is essentially a wrapper for the two Slurm job scripts in the CSP directories. These jobs can also be run manually. Make sure to use the Slurm job script in the relevant CSP’s path (csp_tools/<csp>/dcgmi_diag.sh and csp_tools/<csp>/nccl.sh)

For the GPU diagnostics job, provide these arguments when submitting the job to Slurm.

sbatch -p <partition> -w <node_list> -o <job_log_file> dcgmi_diag.sh

For the NCCL test job, cluster_validation.sh performs a pair-wise sweep of the nodes, as this is a sufficient test, but a different number of nodes can be used as well.

First build the test binaries.

sbatch -N 1 build-nccl-tests.sh

Then, to run a two-node all_reduce_perf job.

sbatch -w <node_1>,<node_2> -o <job_log_file> nccl.sh

To run the job with more nodes, simply add the node names to the -w flag in the same comma-separated list format.

Configuration Changes

Before launching jobs, some changes must be made to the configuration.

Set NCCL Topology

The NCCL topology file is unique for each CSP, and can be found in CSP’s folder (csp_tools/<csp>/topo.xml).

In launcher_scripts/conf/config.yaml, mount the directory containing the topology file.

container_mounts:
  - /path/to/nemo_megatron_launcher/csp_tools/<csp>/:/nccl

Then set the path of the file in the container.

env_vars:
    NCCL_TOPO_FILE: /nccl/topo.xml

Environment Variables

Some environment variables must be set to ensure correct behavior on CSPs. This can be done through config.yaml.

Azure Variables

Set these environment variables for Azure.

env_vars:
  UCX_IB_PCI_RELAXED_ORDERING: auto
  NCCL_IB_PCI_RELAXED_ORDERING: 2
  NCCL_IB_TIMEOUT: 22
  NCCL_DEBUG: INFO

AWS Variables

AWS recommends setting the following flag to avoid data corruption.

env_vars:
  NCCL_PROTO: simple

Setting this flag reduces training throughput by roughly 2%.