Cloud Service Providers
For setting up a Slurm cluster for the NVIDIA NeMo™ framework, NVIDIA recommends Nephele. This cluster deployment tool has been tested on Azure, and Oracle Cloud. Nephele can be hosted on a VM instance in any CSP. To get started:
Clone the Nephele repo.
Install the dependencies.
Provide CSP credentials in
nephele.confto the desired number of nodes.
./nephele init and
NVIDIA also recommends mounting an external persistent NFS once the cluster is up and running (ensure it is mounted on all nodes) and using this to configure and run the NeMo framework.
The steps above apply to all CSPs, including Azure, and OCI. Some modifications are necessary for OCI, as detailed below. Note that for OCI, a custom image must be imported, which should be done before running
The NeMo framework supports running training and inference containers on OCI. For more details about orchestration scripts, reach out to firstname.lastname@example.org.
For AWS, NVIDIA recommends using ParallelCluster. You can find the details on how to launch ParallelCluster for running training with NeMo Megatron Launcher using this AWS-Sample: aws-samples/awsome-distributed-training/nemo-launcher <https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher>–. There will soon be a blog published as well, we will link it here once it becomes available.
To launch jobs on AWS, the EFA driver and NCCL plugin must first be installed on top of the training container. NVIDIA recommends building a new container image with Docker, then creating an Enroot image. The Enroot image will be a squashfs file (
nemo_megatron_training.sqsh) equivalent to the Docker image, and can be used with the Slurm cluster. For more information on Enroot images see the Enroot GitHub docs.
On the scheduler node:
Build the image with EFA drivers and NCCL plugin from
csp_tools/aws/Dockerfile. .. code-block:: bash
cd csp_tools/aws docker build -t nemofw-training-build:23.07-py3 .
Run this command on the Docker image to create an Enroot image:
enroot import --output nemo_megatron_training.sqsh dockerd://<image_name>:<tag>
.sqshfile to the root of NeMo-Megatron-Launcher.
Set the container path in
launcher_scripts/conf/config.yamlto the new Enroot image:
Copy the topology file contents of the target node from aws-ofi-nccl and paste in the
csp_tools/aws/topo.xml#. P5.48xlarge - [p5.48xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/p5.48xl-topo.xml) #. G5.48xlarge - [g5.48xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/g5.48xl-topo.xml) #. P4d.24xlarge - [p4d-24xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/p4d-24xl-topo.xml) #. P4de.24xlarge - [p4de-24xl-topo.xml](https://github.com/aws/aws-ofi-nccl/blob/master/topology/p4de-24xl-topo.xml)
Before running the cluster validation script, make sure that an NGC token has been added to
~/.config/enroot/.credentials on all nodes.
The cluster validation script at
csp_tools/<csp>/cluster_validation.sh runs GPU diagnostics and tests NCCL node-to-node bus bandwidth. The logs from these tests are stored at
results/cluster_validation. The script lists any nodes that fail these tests. Replace these nodes or restart them through the CSP UI.
Validation Script Usage
The script has three required configuration settings:
--nodes: the number of nodes
--nodelist: the list of node names
--partition: the Slurm partition that the nodes are assigned to
The values for these configurations must be in the same format as
sinfo, as in this example.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST x8a100 up infinite 8 idle x8a100-[0000-0007]
To test all eight idle nodes, run the script like this.
bash cluster_validation.sh --nodes=8 --nodelist=x8a100-[0000-0007] --partition=x8a100
The script runs both the GPU diagnostics and the NCCL test by default. To run only one or the other, specify one of the following flags:
--dcgm: run GPU diagnostics only
--nccl: run NCCL test only
bash cluster_validation.sh -h for more information.
Running tests manually
cluster_validation.sh script is essentially a wrapper for the two Slurm job scripts in the CSP directories. These jobs can also be run manually. Make sure to use the Slurm job script in the relevant CSP’s path (
For the GPU diagnostics job, provide these arguments when submitting the job to Slurm.
sbatch -p <partition> -w <node_list> -o <job_log_file> dcgmi_diag.sh
For the NCCL test job,
cluster_validation.sh performs a pair-wise sweep of the nodes, as this is a sufficient test, but a different number of nodes can be used as well.
First build the test binaries.
sbatch -N 1 build-nccl-tests.sh
Then, to run a two-node
sbatch -w <node_1>,<node_2> -o <job_log_file> nccl.sh
To run the job with more nodes, simply add the node names to the
-w flag in the same comma-separated list format.
Before launching jobs, some changes must be made to the configuration.
Set NCCL Topology
The NCCL topology file is unique for each CSP, and can be found in CSP’s folder (
launcher_scripts/conf/config.yaml, mount the directory containing the topology file.
container_mounts: - /path/to/nemo_megatron_launcher/csp_tools/<csp>/:/nccl
Then set the path of the file in the container.
env_vars: NCCL_TOPO_FILE: /nccl/topo.xml
Some environment variables must be set to ensure correct behavior on CSPs. This can be done through
Set these environment variables for Azure.
env_vars: UCX_IB_PCI_RELAXED_ORDERING: auto NCCL_IB_PCI_RELAXED_ORDERING: 2 NCCL_IB_TIMEOUT: 22 NCCL_DEBUG: INFO
AWS recommends setting the following flag to avoid data corruption.
env_vars: NCCL_PROTO: simple
Setting this flag reduces training throughput by roughly 2%.