1. PyTorch and HuggingFace Accelerate with DeepSpeed on DGX Cloud
1.1. Overview
This guide introduces how to finetune a multi-lingual NMT model, namely ALMA (Advanced Language Model-based TrAnslator), on DGX Cloud. ALMA is a LLM based on a many-to-many translation model. Compared to traditional encode-decoder based models, ALMA is a decoder-only model and it is continually trained with monolingual datasets (stage 1) and LoRA tuned (stage 2) on a high quality parallel dataset based on the Llama2 pretrained model, achieving excellent performance at multilingual translation tasks.
The ALMA project uses PyTorch, Hugging Face Accelerate and DeepSpeed libraries in training, which are very popular in many large-scale AI projects with a multi-node-multi-GPU (MGMN) requirement, and it would be representative of other popular community projects to be tested on DGX Cloud. This documentation covers:
Preparing the training container based on an NGC PyTorch image
Setting up the cluster workspace
Cloning the ALMA repo
Composing the Slurm batch script and modifying training parameters
Execution of the LoRA fine-tuning task
Checking multi-node multi-GPU functionality
In the LoRA fine-tuning, the provided translation datasets have already been included in the GitHub repo, therefore data preparation and cleaning are not strictly necessary.
1.2. Prerequisites
Please check that the following prerequisites are met before using this document.
A DGX Cloud Slurm cluster is provisioned and the user has access to launching and running jobs on the cluster (root permissions not necessary).
The user has access to at least two A100 or H100-based compute nodes on the cluster that can run jobs with high-speed multi-node compute networks.
The user has read/write access to at least 100GB of shared storage which is mounted and available on all nodes in the cluster and available within jobs. To identify the shared storage path, please consult with your cluster administrator. The path will be represented by
<SHARED_STORAGE_ROOT>
in this document. An example for<SHARED_STORAGE_ROOT>
might be/lustre/fs0/scratch/demo-user
.The cluster has external internet access on all nodes to download datasets and pre-trained models.
1.2.1. Preparing a Customized Container Image
The Slurm implementation in your DGX Cloud cluster leverages Pyxis and Enroot as the container plugin and runtime. This is relevant since in this section we explain how to build a customized environment into a container image for use on BCM-based clusters with Slurm and Pyxis/Enroot. The build process should be completed in a local Linux environment. By building a containerized environment, we have a frozen environment to perform model training whenever possible without concerns of package compatibility.
1.2.1.1. Creating a Container Image in a Local Linux Machine
To build a customized container environment for ALMA, we need to install Docker engine in the local machine, an open source containerization technology for building and containerizing your applications. Once Docker is installed, we can pull container images and build a container environment to support ALMA model training by providing a list of instructions to assemble an image. First we create a Dockerfile with content as follows:
1# Dockerfile 2FROM nvcr.io/nvidia/pytorch:24.01-py3 3 4RUN cd /opt && git clone https://github.com/fe1ixxu/ALMA.git 5RUN cd /opt/ALMA \ 6 && git checkout 83ee22f \ 7 && bash install_alma.sh \ 8 && pip install transformers==4.30.0
In this Dockerfile, we select a PyTorch container image from NGC as the base image (nvcr.io/nvidia/pytorch:24.01-py3
). Next, we pull the ALMA repository from GitHub and checkout to a fixed commit. Finally, we use the internal installation script and install HuggingFace transformers library with a frozen version. Execute the following command in the same folder with Dockerfile and a container image alma:24.01
will be generated. Note that the “dot” in the command line is necessary.
docker build -t alma:24.01 .
Once the container build is completed, we can execute the following command and the container image will be listed and looks like this.
1docker images | grep alma 2REPOSITORY TAG IMAGE ID CREATED SIZE 3alma 24.01 2dba738ae464 35 minutes ago 22.5GB
1.2.1.2. Setting Up Your Cluster Workspace
Before model training, we need to prepare scripts, dataset, and the containerized environment as detailed in the previous step. Please refer to these details in the Cluster User Guide Setting Up NGC Integration regarding container and Enroot setup.
From here, we have two options to use this image in the Slurm cluster.
1.2.1.2.1. Option 1: Push to an Accessible Container Registry
A container registry is a repository or repository collection to store container images. The NGC Catalog is an example of a container registry. If we have a container registry (denoted as <YOUR_REGISTRY>
) with image upload permission that is accessible by the cluster, we can push the image to the registry, where Pyxis/Enroot in the Slurm cluster can pull the image from it. If the registry requires login and password authentication, please execute docker login <YOUR_REGISTRY>
and login with credentials first. Next, you can tag the image and push it to your registry. Note that you may need to tag the container image with a different name or additional repository path. For the sake of brevity, we use <YOUR_REGISTRY>/alma:24.01
from now on.
1docker tag alma:24.01 <YOUR_REGISTRY>/alma:24.01 2docker push <YOUR_REGISTRY>/alma:24.01
1.2.1.2.2. Option 2: Convert to a SquashFS File and Upload to the Slurm Cluster
If uploading our container image to a container registry is not available, we will need to convert the image into a SquashFS file using NVIDIA Enroot, a tool that turns traditional container/OS images into unprivileged sandboxes. First, we will check the current version of Enroot in the cluster. Login to the cluster login node and execute enroot version
to confirm the current version of Enroot in the cluster. (3.4.1
as of March 11, 2024)
Next, follow the instructions here to install the Enroot version in your local machine that corresponds to the version in the cluster (in order to ensure compatability). Once installation is completed, we can convert the alma:24.01
image to a SquashFS file alma-24.01.sqsh
in our local machine with the following command.
enroot import -o alma-24.01.sqsh dockerd://alma:24.01
Once the conversion is completed, we can upload the SquashFS file to the Slurm cluster using one of the methods described in the Cluster User Guide Moving Data from Your Local Workstation to DGX Cloud. Note that the final destination of the SquashFS file in the Slurm cluster must be in a shared file system so that all compute nodes can access it when the distributed workload is launched. For example, if scp is used to upload the SquashFS file here, we can execute the following command in the local machine:
1scp pytorch-vidasr-24.01.sqsh \ 2 <USERNAME>@<LOGIN_NODE>:<SHARED_STORAGE_ROOT>/
where <USERNAME>
is the user name in the cluster and <LOGIN_NODE>
is the node address of a login node in the DGX Cloud cluster.
1.3. Running on Slurm
This section covers cluster workspace preparation, Slurm batch script configuration, and checking multi-node training functionality.
1.3.1. Enabling Slurm Commands
If Slurm commands are not enabled yet, please execute the following command.
module load slurm
1.3.2. Pulling Code From the ALMA Repository
Dataset and training scripts of ALMA can be pulled directly from GitHub to the login node. The destination of the cloned repository must be a shared file system. We use <SHARED_STORAGE_ROOT>
in this case.
To pull the ALMA repository with compatible contents, we navigate to our user shared storage first and clone the repository, then checkout with the same commit we used when building the image.
1cd <SHARED_STORAGE_ROOT> 2git clone https://github.com/fe1ixxu/ALMA.git 3cd ALMA 4git checkout 83ee22f
1.3.3. Running the LoRA Batch Script
Now we can prepare our batch script to submit our training workload. Save the following script into <SHARED_STORAGE_ROOT>/alma-launcher.sh
.
Note that the #SBATCH
parameters which are interpreted by the sbatch tool (not comments) should be modified for the user account, number of nodes,
and target Slurm partition.
1#!/bin/bash 2 3# Parameters 4#SBATCH --account=<SLURM_ACCOUNT> 5#SBATCH --job-name=alma-training 6#SBATCH --error=%x-%j.err 7#SBATCH --exclusive 8#SBATCH --gpus-per-node=8 9#SBATCH --mem=0 10#SBATCH --nodes=<N_NODES> 11#SBATCH --ntasks-per-node=1 12#SBATCH --output=%x-%j.out 13#SBATCH --exclusive 14#SBATCH --partition=<SLURM_PARTITION> 15#SBATCH --time=01:00:00 16 17# setup 18export TRANSFORMERS_OFFLINE=0 19export TORCH_NCCL_AVOID_RECORD_STREAMS=1 20export NCCL_NVLS_ENABLE=0 21export NCCL_ASYNC_ERROR_HANDLING=1 22 23# Additional setting for DGX Cloud 24export OMPI_MCA_coll_hcoll_enable=0 25export UCX_TLS=tcp 26export UCX_NET_DEVICES=eth0 27export CUDA_DEVICE_ORDER=PCI_BUS_ID 28export NCCL_SOCKET_IFNAME=eth0 29export NCCL_IB_PCI_RELAXED_ORDERING=1 30export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml 31export NCCL_DEBUG=INFO 32export NCCL_PROTO=LL,LL128,Simple 33export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS 34export MELLANOX_VISIBLE_DEVICES=all 35export PMIX_MCA_gds=hash 36export PMIX_MCA_psec=native 37 38export SHARED_STORAGE_ROOT=<SHARED_STORAGE_ROOT> 39 40export CONTAINER_IMAGE=<IMAGE_NAME_OR_PATH> 41 42export HF_DATASETS_CACHE=".cache/huggingface_cache/datasets" 43export TRANSFORMERS_CACHE=".cache/models/" 44 45# random port between 30000 and 50000 46export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) 47export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 )) 48export GPUS_PER_NODE=$SLURM_GPUS_PER_NODE 49export NNODES=$SLURM_NNODES 50export NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE) 51 52echo "MASTER_ADDR: $MASTER_ADDR" 53echo "MASTER_PORT: $MASTER_PORT" 54 55export OUTPUT_DIR=/workspace/output 56 57# Language pair option 58# Possible options are de-en,cs-en,is-en,zh-en,ru-en,en-de,en-cs,en-is,en-zh,en-ru 59export PAIRS=zh-en 60 61export LORA_RANK=16 62 63srun -l --container-image $CONTAINER_IMAGE \ 64 --container-mounts <SHARED_STORAGE_ROOT>/ALMA:/workspace \ 65 --container-workdir /workspace \ 66 --no-container-mount-home \ 67 bash -c 'echo "Node ID $SLURM_NODEID"; accelerate launch \ 68 --main_process_ip ${MASTER_ADDR} \ 69 --main_process_port ${MASTER_PORT} \ 70 --machine_rank $SLURM_NODEID \ 71 --num_processes $NUM_PROCESSES \ 72 --num_machines $NNODES \ 73 --use_deepspeed \ 74 --main_training_function main \ 75 --mixed_precision fp16 \ 76 --rdzv_backend static \ 77 --deepspeed_multinode_launcher standard \ 78 --same_network \ 79 --gradient_accumulation_steps 10 \ 80 --gradient_clipping 1.0 \ 81 --offload_optimizer_device none \ 82 --offload_param_device cpu \ 83 --zero3_init_flag false \ 84 --zero_stage 2 \ 85run_llmmt.py \ 86--model_name_or_path haoranxu/ALMA-7B-Pretrain \ 87--mmt_data_path ./human_written_data/ \ 88--use_peft \ 89--lora_rank ${LORA_RANK} \ 90--do_train \ 91--do_eval \ 92--language_pairs ${PAIRS} \ 93--load_best_model_at_end \ 94--low_cpu_mem_usage \ 95--fp16 \ 96--learning_rate 2e-3 \ 97--weight_decay 0.01 \ 98--gradient_accumulation_steps 10 \ 99--lr_scheduler_type inverse_sqrt \ 100--warmup_ratio 0.01 \ 101--ignore_pad_token_for_loss \ 102--ignore_prompt_token_for_loss \ 103--per_device_train_batch_size 12 \ 104--per_device_eval_batch_size 4 \ 105--evaluation_strategy steps \ 106--eval_steps 512 \ 107--save_strategy steps \ 108--save_steps 512 \ 109--save_total_limit 1 \ 110--logging_strategy steps \ 111--logging_steps 1 \ 112--output_dir ${OUTPUT_DIR} \ 113--num_train_epochs 8 \ 114--predict_with_generate \ 115--prediction_loss_only \ 116--max_new_tokens 256 \ 117--max_source_length 256 \ 118--seed 42 \ 119--overwrite_output_dir \ 120--num_beams 5 \ 121--ddp_timeout 999999 \ 122--report_to none \ 123--overwrite_cache'
In this job we only check the multi-node training functionality of HuggingFace Accelerate with DeepSpeed. For different settings of Slurm account, container image preparations, and resource preference, there are notable variables to be configured in this batch script.
<SLURM_ACCOUNT_NAME>
: The Slurm account to be used for your project. Please consult with the cluster administrator or project manager to determine which account name to be used.
<SHARED_STORAGE_ROOT>
: The root path of user shared storage defined in earlier sections.
<N_NODES>
: Number of nodes to be used for this training job. The job is very small and has been tested with 1 and 2 nodes.
<SLURM_PARTITION>
: The Slurm partition(s) to use for this job. Slurm partitions are defined by cluster administrators and designated for different purposes or accounts. Note that a partition with multi-node job and GPU support must be selected.
<IMAGE_NAME_OR_PATH>
: The container image name or path to be used in Slurm with Pyxis/Enroot. This depends on the option in the previous section on container image build.
If Option 1 is used, we can replace it with
<YOUR_REGISTRY>/alma:24.01
If Option 2 is used, we will replace it with our upload destination
<SHARED_STORAGE_ROOT>/alma-24.01.sqsh
Once all settings are set, we can submit the job with the following command in the login node. In our test with a single 8-A100 node, the job will take about 30 minutes to complete a run.
1cd <SHARED_STORAGE_ROOT> 2sbatch alma-launcher.sh
1.3.4. Monitoring the Job
Once the job is submitted, its status can be viewed with squeue. You can also choose to list only the jobs launched by yourself with the command squeue -u $USER. The output should look similar to the following.
1squeue -u $USER 2 3JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4531544 polar2 alma-tra chenghan R 0:43 1 batch-block2-1079
As designated in the batch script, the job logs can be found at <SHARED_STORAGE_ROOT>/alma-training-NNN
where NNN is the job ID and .out
and .err
are for stdout and stderr, respectively. We can also use the command tail -f <filename>
to view live updates of a log file. Since we added an option -l in the srun command, each log line is prepended with a task number, which can be helpful to identify the process source of log generation for each line in a multitask job. For example, if <N_NODES> is set to 2, there will be lines with prepended 0: and 1: in the log files.
1.3.5. Checking Multi-Node Function
The test model is a LoRA-based fine-tuning with data parallelism in multi-node multi-GPU configuration. Given the log files in alma-training-NNN.err
and alma-training-NNN.out
, there are several ways to confirm its multi-node functionality and data-parallel fine-tuning. In alma-training-NNN.err
, you can find a few information logs similar to this (the following example is generated with single-node fine-tuning).
10: [INFO|trainer.py:1777] 2024-03-12 01:36:28,535 >> ***** Running training ***** 20: [INFO|trainer.py:1778] 2024-03-12 01:36:28,535 >> Num examples = 15,406 30: [INFO|trainer.py:1779] 2024-03-12 01:36:28,535 >> Num Epochs = 8 40: [INFO|trainer.py:1780] 2024-03-12 01:36:28,535 >> Instantaneous batch size per device = 12 50: [INFO|trainer.py:1781] 2024-03-12 01:36:28,535 >> Total train batch size (w. parallel, distributed & accumulation) = 960 60: [INFO|trainer.py:1782] 2024-03-12 01:36:28,535 >> Gradient Accumulation steps = 10 70: [INFO|trainer.py:1783] 2024-03-12 01:36:28,535 >> Total optimization steps = 128 80: [INFO|trainer.py:1784] 2024-03-12 01:36:28,537 >> Number of trainable parameters = 7,733,248
In our example, each training step is formed with N nodes, 8 GPUs per node, instantaneous batch size of 12, and 10 gradient accumulation steps. The total batch size with parallel, distributed, and accumulation should be N81210=960N. In alma-training-NNN.err
, you can also find a few progress bar-like logs, and the log line of the last training step will look like this. To identify the last training log line, we can check the number of total optimization steps (Ns=128 in our single-node test case) and find the progress indicator (Ns/Ns) before the training completion message.
1100%|██████████| 128/128 [27:08<00:00, 12.71s/it][INFO|trainer.py:2044] 2024-03-12 02:03:37,121 >> 20: 30: Training completed. Do not forget to share your model on huggingface.co/models =) 40: 50:
Breaking down the information we have:
100%
: Percentage of training.
128/128
: Number of completed steps/total steps
27:08<00:00
: Elapsed time of training steps so far, which is 27 minutes and 8 seconds in this case. The later one will be an estimated time to complete.
12.71s/it
: It indicates that each step takes 12.71 seconds on average to complete.
1.3.6. Reference Results
Training information from the stderr log (please set the TASK_ID variable before running the command):
cat alma-training-NNN.err | grep "$TASK_ID: " | grep "trainer.py"
Node 0
1TASK_ID=0 20: [INFO|trainer.py:1777] 2024-03-12 01:36:28,535 >> ***** Running training ***** 30: [INFO|trainer.py:1778] 2024-03-12 01:36:28,535 >> Num examples = 15,406 40: [INFO|trainer.py:1779] 2024-03-12 01:36:28,535 >> Num Epochs = 8 50: [INFO|trainer.py:1780] 2024-03-12 01:36:28,535 >> Instantaneous batch size per device = 12 60: [INFO|trainer.py:1781] 2024-03-12 01:36:28,535 >> Total train batch size (w. parallel, distributed & accumulation) = 960 70: [INFO|trainer.py:1782] 2024-03-12 01:36:28,535 >> Gradient Accumulation steps = 10 80: [INFO|trainer.py:1783] 2024-03-12 01:36:28,535 >> Total optimization steps = 128 90: [INFO|trainer.py:1784] 2024-03-12 01:36:28,537 >> Number of trainable parameters = 7,733,248
Node 1
1TASK_ID=0 20: [INFO|trainer.py:1777] 2024-03-12 02:35:30,971 >> ***** Running training ***** 30: [INFO|trainer.py:1778] 2024-03-12 02:35:30,971 >> Num examples = 15,406 40: [INFO|trainer.py:1779] 2024-03-12 02:35:30,971 >> Num Epochs = 8 50: [INFO|trainer.py:1780] 2024-03-12 02:35:30,971 >> Instantaneous batch size per device = 12 60: [INFO|trainer.py:1781] 2024-03-12 02:35:30,971 >> Total train batch size (w. parallel, distributed & accumulation) = 1,920 70: [INFO|trainer.py:1782] 2024-03-12 02:35:30,971 >> Gradient Accumulation steps = 10 80: [INFO|trainer.py:1783] 2024-03-12 02:35:30,971 >> Total optimization steps = 64 90: [INFO|trainer.py:1784] 2024-03-12 02:35:30,973 >> Number of trainable parameters = 7,733,248 10 11TASK_ID=1 121: [INFO|trainer.py:1777] 2024-03-12 02:35:30,181 >> ***** Running training ***** 131: [INFO|trainer.py:1778] 2024-03-12 02:35:30,181 >> Num examples = 15,406 141: [INFO|trainer.py:1779] 2024-03-12 02:35:30,181 >> Num Epochs = 8 151: [INFO|trainer.py:1780] 2024-03-12 02:35:30,181 >> Instantaneous batch size per device = 12 161: [INFO|trainer.py:1781] 2024-03-12 02:35:30,181 >> Total train batch size (w. parallel, distributed & accumulation) = 1,920 171: [INFO|trainer.py:1782] 2024-03-12 02:35:30,181 >> Gradient Accumulation steps = 10 181: [INFO|trainer.py:1783] 2024-03-12 02:35:30,181 >> Total optimization steps = 64 191: [INFO|trainer.py:1784] 2024-03-12 02:35:30,183 >> Number of trainable parameters = 7,733,248
Note the elapsed training time after the final step:
Node 0
1100%|██████████| 128/128 [27:08<00:00, 12.72s/it]
Node 1
1TASK_ID=0 2100%|██████████| 64/64 [13:34<00:00, 12.72s/it] 3TASK_ID=1 4100%|██████████| 64/64 [13:34<00:00, 12.73s/it]