1. PyTorch and HuggingFace Accelerate with DeepSpeed on DGX Cloud

1.1. Overview

This guide introduces how to finetune a multi-lingual NMT model, namely ALMA (Advanced Language Model-based TrAnslator), on DGX Cloud. ALMA is a LLM based on a many-to-many translation model. Compared to traditional encode-decoder based models, ALMA is a decoder-only model and it is continually trained with monolingual datasets (stage 1) and LoRA tuned (stage 2) on a high quality parallel dataset based on the Llama2 pretrained model, achieving excellent performance at multilingual translation tasks.

The ALMA project uses PyTorch, Hugging Face Accelerate and DeepSpeed libraries in training, which are very popular in many large-scale AI projects with a multi-node-multi-GPU (MGMN) requirement, and it would be representative of other popular community projects to be tested on DGX Cloud. This documentation covers:

Preparing the training container based on an NGC PyTorch image
Setting up the cluster workspace
Cloning the ALMA repo
Composing the Slurm batch script and modifying training parameters
Execution of the LoRA fine-tuning task
Checking multi-node multi-GPU functionality

In the LoRA fine-tuning, the provided translation datasets have already been included in the GitHub repo, therefore data preparation and cleaning are not strictly necessary.

1.2. Prerequisites

Please check that the following prerequisites are met before using this document.

A DGX Cloud Slurm cluster is provisioned and the user has access to launching and running jobs on the cluster (root permissions not necessary).
The user has access to at least two A100 or H100-based compute nodes on the cluster that can run jobs with high-speed multi-node compute networks.
The user has read/write access to at least 100GB of shared storage which is mounted and available on all nodes in the cluster and available within jobs. To identify the shared storage path, please consult with your cluster administrator. The path will be represented by <SHARED_STORAGE_ROOT> in this document. An example for <SHARED_STORAGE_ROOT> might be /lustre/fs0/scratch/demo-user.
The cluster has external internet access on all nodes to download datasets and pre-trained models.

1.2.1. Preparing a Customized Container Image

The Slurm implementation in your DGX Cloud cluster leverages Pyxis and Enroot as the container plugin and runtime. This is relevant since in this section we explain how to build a customized environment into a container image for use on BCM-based clusters with Slurm and Pyxis/Enroot. The build process should be completed in a local Linux environment. By building a containerized environment, we have a frozen environment to perform model training whenever possible without concerns of package compatibility.

1.2.1.1. Creating a Container Image in a Local Linux Machine

To build a customized container environment for ALMA, we need to install Docker engine in the local machine, an open source containerization technology for building and containerizing your applications. Once Docker is installed, we can pull container images and build a container environment to support ALMA model training by providing a list of instructions to assemble an image. First we create a Dockerfile with content as follows:

# Dockerfile
FROM nvcr.io/nvidia/pytorch:24.01-py3

RUN cd /opt && git clone https://github.com/fe1ixxu/ALMA.git
RUN cd /opt/ALMA \
      && git checkout 83ee22f \
      && bash install_alma.sh \
      && pip install transformers==4.30.0

In this Dockerfile, we select a PyTorch container image from NGC as the base image (nvcr.io/nvidia/pytorch:24.01-py3). Next, we pull the ALMA repository from GitHub and checkout to a fixed commit. Finally, we use the internal installation script and install HuggingFace transformers library with a frozen version. Execute the following command in the same folder with Dockerfile and a container image alma:24.01 will be generated. Note that the “dot” in the command line is necessary.

docker build -t alma:24.01 .

Once the container build is completed, we can execute the following command and the container image will be listed and looks like this.

docker images | grep alma
REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
alma         24.01     2dba738ae464   35 minutes ago   22.5GB

1.2.1.2. Setting Up Your Cluster Workspace

Before model training, we need to prepare scripts, dataset, and the containerized environment as detailed in the previous step. Please refer to these details in the Cluster User Guide Setting Up NGC Integration regarding container and Enroot setup.

From here, we have two options to use this image in the Slurm cluster.

1.2.1.2.1. Option 1: Push to an Accessible Container Registry

A container registry is a repository or repository collection to store container images. The NGC Catalog is an example of a container registry. If we have a container registry (denoted as <YOUR_REGISTRY>) with image upload permission that is accessible by the cluster, we can push the image to the registry, where Pyxis/Enroot in the Slurm cluster can pull the image from it. If the registry requires login and password authentication, please execute docker login <YOUR_REGISTRY> and login with credentials first. Next, you can tag the image and push it to your registry. Note that you may need to tag the container image with a different name or additional repository path. For the sake of brevity, we use <YOUR_REGISTRY>/alma:24.01 from now on.

docker tag alma:24.01 <YOUR_REGISTRY>/alma:24.01
docker push <YOUR_REGISTRY>/alma:24.01

1.2.1.2.2. Option 2: Convert to a SquashFS File and Upload to the Slurm Cluster

If uploading our container image to a container registry is not available, we will need to convert the image into a SquashFS file using NVIDIA Enroot, a tool that turns traditional container/OS images into unprivileged sandboxes. First, we will check the current version of Enroot in the cluster. Login to the cluster login node and execute enroot version to confirm the current version of Enroot in the cluster. (3.4.1 as of March 11, 2024)

Next, follow the instructions here to install the Enroot version in your local machine that corresponds to the version in the cluster (in order to ensure compatability). Once installation is completed, we can convert the alma:24.01 image to a SquashFS file alma-24.01.sqsh in our local machine with the following command.

enroot import -o alma-24.01.sqsh dockerd://alma:24.01

Once the conversion is completed, we can upload the SquashFS file to the Slurm cluster using one of the methods described in the Cluster User Guide Moving Data from Your Local Workstation to DGX Cloud. Note that the final destination of the SquashFS file in the Slurm cluster must be in a shared file system so that all compute nodes can access it when the distributed workload is launched. For example, if scp is used to upload the SquashFS file here, we can execute the following command in the local machine:

scp pytorch-vidasr-24.01.sqsh \
      <USERNAME>@<LOGIN_NODE>:<SHARED_STORAGE_ROOT>/

where <USERNAME> is the user name in the cluster and <LOGIN_NODE> is the node address of a login node in the DGX Cloud cluster.

1.3. Running on Slurm

This section covers cluster workspace preparation, Slurm batch script configuration, and checking multi-node training functionality.

1.3.1. Enabling Slurm Commands

If Slurm commands are not enabled yet, please execute the following command.

module load slurm

1.3.2. Pulling Code From the ALMA Repository

Dataset and training scripts of ALMA can be pulled directly from GitHub to the login node. The destination of the cloned repository must be a shared file system. We use <SHARED_STORAGE_ROOT> in this case.

To pull the ALMA repository with compatible contents, we navigate to our user shared storage first and clone the repository, then checkout with the same commit we used when building the image.

cd <SHARED_STORAGE_ROOT>
git clone https://github.com/fe1ixxu/ALMA.git
cd ALMA
git checkout 83ee22f

1.3.3. Running the LoRA Batch Script

Now we can prepare our batch script to submit our training workload. Save the following script into <SHARED_STORAGE_ROOT>/alma-launcher.sh. Note that the #SBATCH parameters which are interpreted by the sbatch tool (not comments) should be modified for the user account, number of nodes, and target Slurm partition.

#!/bin/bash

# Parameters
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --job-name=alma-training
#SBATCH --error=%x-%j.err
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --mem=0
#SBATCH --nodes=<N_NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x-%j.out
#SBATCH --exclusive
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --time=01:00:00

# setup
export TRANSFORMERS_OFFLINE=0
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
export NCCL_ASYNC_ERROR_HANDLING=1

# Additional setting for DGX Cloud
export OMPI_MCA_coll_hcoll_enable=0
export UCX_TLS=tcp
export UCX_NET_DEVICES=eth0
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
export NCCL_DEBUG=INFO
export NCCL_PROTO=LL,LL128,Simple
export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
export MELLANOX_VISIBLE_DEVICES=all
export PMIX_MCA_gds=hash
export PMIX_MCA_psec=native

export SHARED_STORAGE_ROOT=<SHARED_STORAGE_ROOT>

export CONTAINER_IMAGE=<IMAGE_NAME_OR_PATH>

export HF_DATASETS_CACHE=".cache/huggingface_cache/datasets"
export TRANSFORMERS_CACHE=".cache/models/"

# random port between 30000 and 50000
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
export GPUS_PER_NODE=$SLURM_GPUS_PER_NODE
export NNODES=$SLURM_NNODES
export NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)

echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"

export OUTPUT_DIR=/workspace/output

# Language pair option
# Possible options are de-en,cs-en,is-en,zh-en,ru-en,en-de,en-cs,en-is,en-zh,en-ru
export PAIRS=zh-en

export LORA_RANK=16

srun -l --container-image $CONTAINER_IMAGE \
      --container-mounts <SHARED_STORAGE_ROOT>/ALMA:/workspace \
      --container-workdir /workspace \
      --no-container-mount-home \
      bash -c 'echo "Node ID $SLURM_NODEID"; accelerate launch \
      --main_process_ip ${MASTER_ADDR} \
      --main_process_port ${MASTER_PORT} \
      --machine_rank $SLURM_NODEID \
      --num_processes $NUM_PROCESSES \
      --num_machines $NNODES \
      --use_deepspeed \
      --main_training_function main \
      --mixed_precision fp16 \
      --rdzv_backend static \
      --deepspeed_multinode_launcher standard \
      --same_network \
      --gradient_accumulation_steps 10 \
      --gradient_clipping 1.0 \
      --offload_optimizer_device none \
      --offload_param_device cpu \
      --zero3_init_flag false \
      --zero_stage 2 \
run_llmmt.py \
--model_name_or_path haoranxu/ALMA-7B-Pretrain \
--mmt_data_path  ./human_written_data/ \
--use_peft \
--lora_rank ${LORA_RANK} \
--do_train \
--do_eval \
--language_pairs ${PAIRS} \
--load_best_model_at_end \
--low_cpu_mem_usage \
--fp16 \
--learning_rate 2e-3 \
--weight_decay 0.01 \
--gradient_accumulation_steps 10 \
--lr_scheduler_type inverse_sqrt \
--warmup_ratio 0.01 \
--ignore_pad_token_for_loss \
--ignore_prompt_token_for_loss \
--per_device_train_batch_size 12 \
--per_device_eval_batch_size 4 \
--evaluation_strategy steps \
--eval_steps 512 \
--save_strategy steps \
--save_steps 512 \
--save_total_limit 1 \
--logging_strategy steps \
--logging_steps 1 \
--output_dir ${OUTPUT_DIR} \
--num_train_epochs 8 \
--predict_with_generate \
--prediction_loss_only \
--max_new_tokens 256 \
--max_source_length 256 \
--seed 42 \
--overwrite_output_dir \
--num_beams 5 \
--ddp_timeout 999999 \
--report_to none \
--overwrite_cache'

In this job we only check the multi-node training functionality of HuggingFace Accelerate with DeepSpeed. For different settings of Slurm account, container image preparations, and resource preference, there are notable variables to be configured in this batch script.

<SLURM_ACCOUNT_NAME>: The Slurm account to be used for your project. Please consult with the cluster administrator or project manager to determine which account name to be used.

<SHARED_STORAGE_ROOT>: The root path of user shared storage defined in earlier sections.

<N_NODES>: Number of nodes to be used for this training job. The job is very small and has been tested with 1 and 2 nodes.

<SLURM_PARTITION>: The Slurm partition(s) to use for this job. Slurm partitions are defined by cluster administrators and designated for different purposes or accounts. Note that a partition with multi-node job and GPU support must be selected.

<IMAGE_NAME_OR_PATH>: The container image name or path to be used in Slurm with Pyxis/Enroot. This depends on the option in the previous section on container image build.

If Option 1 is used, we can replace it with <YOUR_REGISTRY>/alma:24.01

If Option 2 is used, we will replace it with our upload destination <SHARED_STORAGE_ROOT>/alma-24.01.sqsh

Once all settings are set, we can submit the job with the following command in the login node. In our test with a single 8-A100 node, the job will take about 30 minutes to complete a run.

cd <SHARED_STORAGE_ROOT>
sbatch alma-launcher.sh

1.3.4. Monitoring the Job

Once the job is submitted, its status can be viewed with squeue. You can also choose to list only the jobs launched by yourself with the command squeue -u $USER. The output should look similar to the following.

squeue -u $USER

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
531544    polar2 alma-tra chenghan  R       0:43      1 batch-block2-1079

As designated in the batch script, the job logs can be found at <SHARED_STORAGE_ROOT>/alma-training-NNN where NNN is the job ID and .out and .err are for stdout and stderr, respectively. We can also use the command tail -f <filename> to view live updates of a log file. Since we added an option -l in the srun command, each log line is prepended with a task number, which can be helpful to identify the process source of log generation for each line in a multitask job. For example, if <N_NODES> is set to 2, there will be lines with prepended 0: and 1: in the log files.

1.3.5. Checking Multi-Node Function

The test model is a LoRA-based fine-tuning with data parallelism in multi-node multi-GPU configuration. Given the log files in alma-training-NNN.err and alma-training-NNN.out, there are several ways to confirm its multi-node functionality and data-parallel fine-tuning. In alma-training-NNN.err, you can find a few information logs similar to this (the following example is generated with single-node fine-tuning).

[INFO|trainer.py:1777] 2024-03-12 01:36:28,535 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 01:36:28,535 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 01:36:28,535 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 01:36:28,535 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 01:36:28,535 >>   Total train batch size (w. parallel, distributed & accumulation) = 960
[INFO|trainer.py:1782] 2024-03-12 01:36:28,535 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 01:36:28,535 >>   Total optimization steps = 128
[INFO|trainer.py:1784] 2024-03-12 01:36:28,537 >>   Number of trainable parameters = 7,733,248

In our example, each training step is formed with N nodes, 8 GPUs per node, instantaneous batch size of 12, and 10 gradient accumulation steps. The total batch size with parallel, distributed, and accumulation should be N81210=960N. In alma-training-NNN.err, you can also find a few progress bar-like logs, and the log line of the last training step will look like this. To identify the last training log line, we can check the number of total optimization steps (Ns=128 in our single-node test case) and find the progress indicator (Ns/Ns) before the training completion message.

100%|██████████| 128/128 [27:08<00:00, 12.71s/it][INFO|trainer.py:2044] 2024-03-12 02:03:37,121 >>

Training completed. Do not forget to share your model on huggingface.co/models =)



Breaking down the information we have:

100%: Percentage of training.

128/128: Number of completed steps/total steps

27:08<00:00: Elapsed time of training steps so far, which is 27 minutes and 8 seconds in this case. The later one will be an estimated time to complete.

12.71s/it: It indicates that each step takes 12.71 seconds on average to complete.

1.3.6. Reference Results

Training information from the stderr log (please set the TASK_ID variable before running the command):

cat alma-training-NNN.err | grep "$TASK_ID: " | grep "trainer.py"

Node 0

TASK_ID=0
[INFO|trainer.py:1777] 2024-03-12 01:36:28,535 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 01:36:28,535 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 01:36:28,535 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 01:36:28,535 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 01:36:28,535 >>   Total train batch size (w. parallel, distributed & accumulation) = 960
[INFO|trainer.py:1782] 2024-03-12 01:36:28,535 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 01:36:28,535 >>   Total optimization steps = 128
[INFO|trainer.py:1784] 2024-03-12 01:36:28,537 >>   Number of trainable parameters =  7,733,248

Node 1

TASK_ID=0
[INFO|trainer.py:1777] 2024-03-12 02:35:30,971 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 02:35:30,971 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 02:35:30,971 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 02:35:30,971 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 02:35:30,971 >>   Total train batch size (w. parallel, distributed & accumulation) = 1,920
[INFO|trainer.py:1782] 2024-03-12 02:35:30,971 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 02:35:30,971 >>   Total optimization steps = 64
[INFO|trainer.py:1784] 2024-03-12 02:35:30,973 >>   Number of trainable parameters = 7,733,248

TASK_ID=1
[INFO|trainer.py:1777] 2024-03-12 02:35:30,181 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 02:35:30,181 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 02:35:30,181 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 02:35:30,181 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 02:35:30,181 >>   Total train batch size (w. parallel, distributed & accumulation) = 1,920
[INFO|trainer.py:1782] 2024-03-12 02:35:30,181 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 02:35:30,181 >>   Total optimization steps = 64
[INFO|trainer.py:1784] 2024-03-12 02:35:30,183 >>   Number of trainable parameters = 7,733,248

Note the elapsed training time after the final step:

Node 0

100%|██████████| 128/128 [27:08<00:00, 12.72s/it]

Node 1

TASK_ID=0
100%|██████████| 64/64 [13:34<00:00, 12.72s/it]
TASK_ID=1
100%|██████████| 64/64 [13:34<00:00, 12.73s/it]