Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Cloud Service Providers
Cluster Bring-Up
Common
For setting up a Slurm cluster for the NVIDIA NeMo™ framework, NVIDIA recommends Nephele. This cluster deployment tool has been tested on Azure, and Oracle Cloud. Nephele can be hosted on a VM instance in any CSP. To get started:
Clone the Nephele repo.
Install the dependencies.
Provide CSP credentials in
nephele.conf
.Change
REPLICAS_x8a100
innephele.conf
to the desired number of nodes.
Finally, run ./nephele init
and ./nephele create
.
NVIDIA also recommends mounting an external persistent NFS once the cluster is up and running (ensure it is mounted on all nodes) and using this to configure and run the NeMo Framework.
The steps above apply to all CSPs, including Azure, and OCI. Some modifications are necessary for OCI, as detailed below. Note that for OCI, a custom image must be imported, which should be done before running ./nephele create
.
OCI
The NeMo Framework supports running training and inference containers on OCI. For more details about orchestration scripts, reach out to oci_nm-nvidia.com.
GCP
To use Slurm on Google Cloud, NVIDIA recommends the Google Cloud HPC Toolkit. For those using Kubernetes, NVIDIA recommends using Google Kubernetes Engine (GKE). For more information about configuring NeMo on GCP, you can go here or contact us at gcp-nemo-google.com.
AWS
For AWS, NVIDIA recommends using ParallelCluster. You can find the details on how to launch ParallelCluster for running training with NeMo Megatron Launcher using this AWS-Sample: aws-samples/awsome-distributed-training/nemo-launcher. There will soon be a blog published as well, we will link it here once it becomes available.
To launch jobs on AWS, the EFA driver and NCCL plugin must first be installed on top of the training container. NVIDIA recommends building a new container image with Docker, then creating an Enroot image. The Enroot image will be a squashfs file (nemo_megatron_training.sqsh
) equivalent to the Docker image, and can be used with the Slurm cluster. For more information on Enroot images see the Enroot GitHub docs.
On the scheduler node:
Install Docker.
Build the image with EFA drivers and NCCL plugin from
csp_tools/aws/Dockerfile
.cd csp_tools/aws docker build -t nemofw-training-build:23.07-py3 .
Run this command on the Docker image to create an Enroot image:
enroot import --output nemo_megatron_training.sqsh dockerd://<image_name>:<tag>
Move the
.sqsh
file to the root of NeMo-Framework-Launcher.Set the container path in
launcher_scripts/conf/config.yaml
to the new Enroot image:container: /path/to/nemo_megatron_launcher/nemo_megatron_training.sqsh
Copy the topology file contents of the target node from aws-ofi-nccl and paste in the
csp_tools/aws/topo.xml
P5.48xlarge - p5.48xl-topo.xml
G5.48xlarge - g5.48xl-topo.xml
P4d.24xlarge - p4d-24xl-topo.xml
P4de.24xlarge - p4de-24xl-topo.xml
Cluster Validation
Before running the cluster validation script, make sure that an NGC token has been added to ~/.config/enroot/.credentials
on all nodes.
The cluster validation script at csp_tools/<csp>/cluster_validation.sh
runs GPU diagnostics and tests NCCL node-to-node bus bandwidth. The logs from these tests are stored at results/cluster_validation
. The script lists any nodes that fail these tests. Replace these nodes or restart them through the CSP UI.
Validation Script Usage
The script has three required configuration settings:
--nodes
: the number of nodes--nodelist
: the list of node names--partition
: the Slurm partition that the nodes are assigned to
The values for these configurations must be in the same format as
sinfo
, as in this example.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST x8a100 up infinite 8 idle x8a100-[0000-0007]
To test all eight idle nodes, run the script like this.
bash cluster_validation.sh --nodes=8 --nodelist=x8a100-[0000-0007] --partition=x8a100
The script runs both the GPU diagnostics and the NCCL test by default. To run only one or the other, specify one of the following flags:
--dcgm
: run GPU diagnostics only--nccl
: run NCCL test only
See bash cluster_validation.sh -h
for more information.
Running tests manually
The cluster_validation.sh
script is essentially a wrapper for the two Slurm job scripts in the CSP directories. These jobs can also be run manually. Make sure to use the Slurm job script in the relevant CSP’s path (csp_tools/<csp>/dcgmi_diag.sh
and csp_tools/<csp>/nccl.sh
)
For the GPU diagnostics job, provide these arguments when submitting the job to Slurm.
sbatch -p <partition> -w <node_list> -o <job_log_file> dcgmi_diag.sh
For the NCCL test job, cluster_validation.sh
performs a pair-wise sweep of the nodes, as this is a sufficient test, but a different number of nodes can be used as well.
First build the test binaries.
sbatch -N 1 build-nccl-tests.sh
Then, to run a two-node all_reduce_perf
job.
sbatch -w <node_1>,<node_2> -o <job_log_file> nccl.sh
To run the job with more nodes, simply add the node names to the -w
flag in the same comma-separated list format.
Configuration Changes
Before launching jobs, some changes must be made to the configuration.
Set NCCL Topology
The NCCL topology file is unique for each CSP, and can be found in CSP’s folder (csp_tools/<csp>/topo.xml
).
In launcher_scripts/conf/config.yaml
, mount the directory containing the topology file.
1container_mounts:
2 - /path/to/nemo_megatron_launcher/csp_tools/<csp>/:/nccl
Then set the path of the file in the container.
1env_vars:
2 NCCL_TOPO_FILE: /nccl/topo.xml
Environment Variables
Some environment variables must be set to ensure correct behavior on CSPs. This can be done through config.yaml
.
Azure Variables
Set these environment variables for Azure.
env_vars:
UCX_IB_PCI_RELAXED_ORDERING: auto
NCCL_IB_PCI_RELAXED_ORDERING: 2
NCCL_IB_TIMEOUT: 22
NCCL_DEBUG: INFO
AWS Variables
AWS recommends setting the following flag to avoid data corruption.
1env_vars:
2 NCCL_PROTO: simple
Setting this flag reduces training throughput by roughly 2%.