4. Workload Examples#

4.1. PyTorch and Hugging Face Accelerate with DeepSpeed on DGX Cloud#

4.1.1. Overview#

This guide introduces how to finetune a multi-lingual NMT model, namely ALMA (Advanced Language Model-based TrAnslator), on DGX Cloud. ALMA is a LLM based on a many-to-many translation model. Compared to traditional encode-decoder based models, ALMA is a decoder-only model and it is continually trained with monolingual datasets (stage 1) and LoRA tuned (stage 2) on a high quality parallel dataset based on the Llama2 pretrained model, achieving excellent performance at multilingual translation tasks.

The ALMA project uses PyTorch, Hugging Face Accelerate and DeepSpeed libraries in training, which are very popular in many large-scale AI projects with a multi-node-multi-GPU (MGMN) requirement, and it would be representative of other popular community projects to be tested on DGX Cloud. This documentation covers:

Preparing the training container based on an NGC PyTorch image
Setting up the cluster workspace
Cloning the ALMA repo
Composing the Slurm batch script and modifying training parameters
Execution of the LoRA fine-tuning task
Checking multi-node multi-GPU functionality

In the LoRA fine-tuning, the provided translation datasets have already been included in the GitHub repo, therefore data preparation and cleaning are not strictly necessary.

4.1.2. Prerequisites#

Please check that the following prerequisites are met before using this document.

A DGX Cloud Slurm cluster is provisioned and the user has access to launching and running jobs on the cluster (root permissions not necessary).
The user has access to at least two A100 or H100-based compute nodes on the cluster that can run jobs with high-speed multi-node compute networks.
The user has read/write access to at least 100GB of shared storage which is mounted and available on all nodes in the cluster and available within jobs. To identify the shared storage path, please consult with your cluster administrator. The path will be represented by <SHARED_STORAGE_ROOT> in this document. An example for <SHARED_STORAGE_ROOT> might be /lustre/fs0/scratch/demo-user.
The cluster has external internet access on all nodes to download datasets and pre-trained models.

4.1.3. Preparing a Customized Container Image#

The Slurm implementation in your DGX Cloud cluster leverages Pyxis and Enroot as the container plugin and runtime. This is relevant since in this section we explain how to build a customized environment into a container image for use on BCM-based clusters with Slurm and Pyxis/Enroot. The build process should be completed in a local Linux environment. By building a containerized environment, we have a frozen environment to perform model training whenever possible without concerns of package compatibility.

4.1.3.1. Creating a Container Image in a Local Linux Machine#

To build a customized container environment for ALMA, we need to install Docker engine in the local machine, an open source containerization technology for building and containerizing your applications. Once Docker is installed, we can pull container images and build a container environment to support ALMA model training by providing a list of instructions to assemble an image. First we create a Dockerfile with content as follows:

# Dockerfile
FROM nvcr.io/nvidia/pytorch:24.01-py3

RUN cd /opt && git clone https://github.com/fe1ixxu/ALMA.git
RUN cd /opt/ALMA \
      && git checkout 83ee22f \
      && bash install_alma.sh \
      && pip install transformers==4.30.0

In this Dockerfile, we select a PyTorch container image from NGC as the base image (nvcr.io/nvidia/pytorch:24.01-py3). Next, we pull the ALMA repository from GitHub and checkout to a fixed commit. Finally, we use the internal installation script and install Hugging Face transformers library with a frozen version. Execute the following command in the same folder with Dockerfile and a container image alma:24.01 will be generated. Note that the “dot” in the command line is necessary.

docker build -t alma:24.01 .

Once the container build is completed, we can execute the following command and the container image will be listed and looks like this.

docker images | grep alma
REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
alma         24.01     2dba738ae464   35 minutes ago   22.5GB

4.1.3.2. Setting Up Your Cluster Workspace#

Before model training, we need to prepare scripts, dataset, and the containerized environment as detailed in the previous step. Please refer to these details in the Cluster User Guide Setting Up NGC Integration regarding container and Enroot setup.

From here, we have two options to use this image in the Slurm cluster.

4.1.3.2.1. Option 1: Push to an Accessible Container Registry#

A container registry is a repository or repository collection to store container images. The NGC Catalog is an example of a container registry. If we have a container registry (denoted as <YOUR_REGISTRY>) with image upload permission that is accessible by the cluster, we can push the image to the registry, where Pyxis/Enroot in the Slurm cluster can pull the image from it. If the registry requires login and password authentication, please execute docker login <YOUR_REGISTRY> and login with credentials first. Next, you can tag the image and push it to your registry. Note that you may need to tag the container image with a different name or additional repository path. For the sake of brevity, we use <YOUR_REGISTRY>/alma:24.01 from now on.

docker tag alma:24.01 <YOUR_REGISTRY>/alma:24.01
docker push <YOUR_REGISTRY>/alma:24.01

4.1.3.2.2. Option 2: Convert to a SquashFS File and Upload to the Slurm Cluster#

If uploading our container image to a container registry is not available, we will need to convert the image into a SquashFS file using NVIDIA Enroot, a tool that turns traditional container/OS images into unprivileged sandboxes. First, we will check the current version of Enroot in the cluster. Login to the cluster login node and execute enroot version to confirm the current version of Enroot in the cluster. (3.4.1 as of March 11, 2024)

Next, follow the instructions here to install the Enroot version in your local machine that corresponds to the version in the cluster (in order to ensure compatability). Once installation is completed, we can convert the alma:24.01 image to a SquashFS file alma-24.01.sqsh in our local machine with the following command.

enroot import -o alma-24.01.sqsh dockerd://alma:24.01

Once the conversion is completed, we can upload the SquashFS file to the Slurm cluster using one of the methods described in the Cluster User Guide Moving Data from Your Local Workstation to DGX Cloud. Note that the final destination of the SquashFS file in the Slurm cluster must be in a shared file system so that all compute nodes can access it when the distributed workload is launched. For example, if scp is used to upload the SquashFS file here, we can execute the following command in the local machine:

scp pytorch-vidasr-24.01.sqsh \
      <USERNAME>@<LOGIN_NODE>:<SHARED_STORAGE_ROOT>/

where <USERNAME> is the user name in the cluster and <LOGIN_NODE> is the node address of a login node in the DGX Cloud cluster.

4.1.4. Running on Slurm#

This section covers cluster workspace preparation, Slurm batch script configuration, and checking multi-node training functionality.

4.1.5. Enabling Slurm Commands#

If Slurm commands are not enabled yet, please execute the following command.

module load slurm

4.1.6. Pulling Code From the ALMA Repository#

Dataset and training scripts of ALMA can be pulled directly from GitHub to the login node. The destination of the cloned repository must be a shared file system. We use <SHARED_STORAGE_ROOT> in this case.

To pull the ALMA repository with compatible contents, we navigate to our user shared storage first and clone the repository, then checkout with the same commit we used when building the image.

cd <SHARED_STORAGE_ROOT>
git clone https://github.com/fe1ixxu/ALMA.git
cd ALMA
git checkout 83ee22f

4.1.7. Running the LoRA Batch Script#

Now we can prepare our batch script to submit our training workload. Save the following script into <SHARED_STORAGE_ROOT>/alma-launcher.sh. Note that the #SBATCH parameters which are interpreted by the sbatch tool (not comments) should be modified for the user account, number of nodes, and target Slurm partition.

#!/bin/bash

# Parameters
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --job-name=alma-training
#SBATCH --error=%x-%j.err
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --mem=0
#SBATCH --nodes=<N_NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x-%j.out
#SBATCH --exclusive
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --time=01:00:00

# setup
export TRANSFORMERS_OFFLINE=0
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
export NCCL_ASYNC_ERROR_HANDLING=1

export SHARED_STORAGE_ROOT=<SHARED_STORAGE_ROOT>

export CONTAINER_IMAGE=<IMAGE_NAME_OR_PATH>

export HF_DATASETS_CACHE=".cache/huggingface_cache/datasets"
export TRANSFORMERS_CACHE=".cache/models/"

# random port between 30000 and 50000
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
export GPUS_PER_NODE=$SLURM_GPUS_PER_NODE
export NNODES=$SLURM_NNODES
export NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)

echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"

export OUTPUT_DIR=/workspace/output

# Language pair option
# Possible options are de-en,cs-en,is-en,zh-en,ru-en,en-de,en-cs,en-is,en-zh,en-ru
export PAIRS=zh-en

export LORA_RANK=16

srun -l --container-image $CONTAINER_IMAGE \
      --container-mounts <SHARED_STORAGE_ROOT>/ALMA:/workspace \
      --container-workdir /workspace \
      --no-container-mount-home \
      bash -c 'echo "Node ID $SLURM_NODEID"; accelerate launch \
      --main_process_ip ${MASTER_ADDR} \
      --main_process_port ${MASTER_PORT} \
      --machine_rank $SLURM_NODEID \
      --num_processes $NUM_PROCESSES \
      --num_machines $NNODES \
      --use_deepspeed \
      --main_training_function main \
      --mixed_precision fp16 \
      --rdzv_backend static \
      --deepspeed_multinode_launcher standard \
      --same_network \
      --gradient_accumulation_steps 10 \
      --gradient_clipping 1.0 \
      --offload_optimizer_device none \
      --offload_param_device cpu \
      --zero3_init_flag false \
      --zero_stage 2 \
run_llmmt.py \
--model_name_or_path haoranxu/ALMA-7B-Pretrain \
--mmt_data_path  ./human_written_data/ \
--use_peft \
--lora_rank ${LORA_RANK} \
--do_train \
--do_eval \
--language_pairs ${PAIRS} \
--load_best_model_at_end \
--low_cpu_mem_usage \
--fp16 \
--learning_rate 2e-3 \
--weight_decay 0.01 \
--gradient_accumulation_steps 10 \
--lr_scheduler_type inverse_sqrt \
--warmup_ratio 0.01 \
--ignore_pad_token_for_loss \
--ignore_prompt_token_for_loss \
--per_device_train_batch_size 12 \
--per_device_eval_batch_size 4 \
--evaluation_strategy steps \
--eval_steps 512 \
--save_strategy steps \
--save_steps 512 \
--save_total_limit 1 \
--logging_strategy steps \
--logging_steps 1 \
--output_dir ${OUTPUT_DIR} \
--num_train_epochs 8 \
--predict_with_generate \
--prediction_loss_only \
--max_new_tokens 256 \
--max_source_length 256 \
--seed 42 \
--overwrite_output_dir \
--num_beams 5 \
--ddp_timeout 999999 \
--report_to none \
--overwrite_cache'

In this job we only check the multi-node training functionality of Hugging Face Accelerate with DeepSpeed. For different settings of Slurm account, container image preparations, and resource preference, there are notable variables to be configured in this batch script.

<SLURM_ACCOUNT_NAME>: The Slurm account to be used for your project. Please consult with the cluster administrator or project manager to determine which account name to be used.
<SHARED_STORAGE_ROOT>: The root path of user shared storage defined in earlier sections.
<N_NODES>: Number of nodes to be used for this training job. The job is very small and has been tested with 1 and 2 nodes.
<SLURM_PARTITION>: The Slurm partition(s) to use for this job. Slurm partitions are defined by cluster administrators and designated for different purposes or accounts. Note that a partition with multi-node job and GPU support must be selected.
<IMAGE_NAME_OR_PATH>: The container image name or path to be used in Slurm with Pyxis/Enroot. This depends on the option in the previous section on container image build.
- If Option 1 is used, we can replace it with <YOUR_REGISTRY>/alma:24.01
- If Option 2 is used, we will replace it with our upload destination <SHARED_STORAGE_ROOT>/alma-24.01.sqsh

Once all settings are set, we can submit the job with the following command in the login node. In our test with a single 8-A100 node, the job will take about 30 minutes to complete a run.

cd <SHARED_STORAGE_ROOT>
sbatch alma-launcher.sh

4.1.8. Monitoring the Job#

Once the job is submitted, its status can be viewed with squeue. You can also choose to list only the jobs launched by yourself with the command squeue -u $USER. The output should look similar to the following.

squeue -u $USER

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
531544    polar2 alma-tra chenghan  R       0:43      1 batch-block2-1079

As designated in the batch script, the job logs can be found at <SHARED_STORAGE_ROOT>/alma-training-NNN where NNN is the job ID and .out and .err are for stdout and stderr, respectively. We can also use the command tail -f <filename> to view live updates of a log file. Since we added an option -l in the srun command, each log line is prepended with a task number, which can be helpful to identify the process source of log generation for each line in a multitask job. For example, if <N_NODES> is set to 2, there will be lines with prepended 0: and 1: in the log files.

4.1.9. Checking Multi-Node Function#

The test model is a LoRA-based fine-tuning with data parallelism in multi-node multi-GPU configuration. Given the log files in alma-training-NNN.err and alma-training-NNN.out, there are several ways to confirm its multi-node functionality and data-parallel fine-tuning. In alma-training-NNN.err, you can find a few information logs similar to this (the following example is generated with single-node fine-tuning).

[INFO|trainer.py:1777] 2024-03-12 01:36:28,535 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 01:36:28,535 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 01:36:28,535 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 01:36:28,535 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 01:36:28,535 >>   Total train batch size (w. parallel, distributed & accumulation) = 960
[INFO|trainer.py:1782] 2024-03-12 01:36:28,535 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 01:36:28,535 >>   Total optimization steps = 128
[INFO|trainer.py:1784] 2024-03-12 01:36:28,537 >>   Number of trainable parameters = 7,733,248

In our example, each training step is formed with N nodes, 8 GPUs per node, instantaneous batch size of 12, and 10 gradient accumulation steps. The total batch size with parallel, distributed, and accumulation should be N81210=960N. In alma-training-NNN.err, you can also find a few progress bar-like logs, and the log line of the last training step will look like this. To identify the last training log line, we can check the number of total optimization steps (Ns=128 in our single-node test case) and find the progress indicator (Ns/Ns) before the training completion message.

100%|██████████| 128/128 [27:08<00:00, 12.71s/it][INFO|trainer.py:2044] 2024-03-12 02:03:37,121 >>

Training completed. Do not forget to share your model on huggingface.co/models =)



Breaking down the information we have:

100%: Percentage of training.
128/128: Number of completed steps/total steps
27:08<00:00: Elapsed time of training steps so far, which is 27 minutes and 8 seconds in this case. The later one will be an estimated time to complete.
12.71s/it: It indicates that each step takes 12.71 seconds on average to complete.

4.1.10. Reference Results#

Training information from the stderr log (please set the TASK_ID variable before running the command):

cat alma-training-NNN.err | grep "$TASK_ID: " | grep "trainer.py"

Node 0

TASK_ID=0
[INFO|trainer.py:1777] 2024-03-12 01:36:28,535 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 01:36:28,535 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 01:36:28,535 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 01:36:28,535 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 01:36:28,535 >>   Total train batch size (w. parallel, distributed & accumulation) = 960
[INFO|trainer.py:1782] 2024-03-12 01:36:28,535 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 01:36:28,535 >>   Total optimization steps = 128
[INFO|trainer.py:1784] 2024-03-12 01:36:28,537 >>   Number of trainable parameters =  7,733,248

Node 1

TASK_ID=0
[INFO|trainer.py:1777] 2024-03-12 02:35:30,971 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 02:35:30,971 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 02:35:30,971 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 02:35:30,971 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 02:35:30,971 >>   Total train batch size (w. parallel, distributed & accumulation) = 1,920
[INFO|trainer.py:1782] 2024-03-12 02:35:30,971 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 02:35:30,971 >>   Total optimization steps = 64
[INFO|trainer.py:1784] 2024-03-12 02:35:30,973 >>   Number of trainable parameters = 7,733,248

TASK_ID=1
[INFO|trainer.py:1777] 2024-03-12 02:35:30,181 >> ***** Running training *****
[INFO|trainer.py:1778] 2024-03-12 02:35:30,181 >>   Num examples = 15,406
[INFO|trainer.py:1779] 2024-03-12 02:35:30,181 >>   Num Epochs = 8
[INFO|trainer.py:1780] 2024-03-12 02:35:30,181 >>   Instantaneous batch size per device = 12
[INFO|trainer.py:1781] 2024-03-12 02:35:30,181 >>   Total train batch size (w. parallel, distributed & accumulation) = 1,920
[INFO|trainer.py:1782] 2024-03-12 02:35:30,181 >>   Gradient Accumulation steps = 10
[INFO|trainer.py:1783] 2024-03-12 02:35:30,181 >>   Total optimization steps = 64
[INFO|trainer.py:1784] 2024-03-12 02:35:30,183 >>   Number of trainable parameters = 7,733,248

Note the elapsed training time after the final step:

Node 0

100%|██████████| 128/128 [27:08<00:00, 12.72s/it]

Node 1

TASK_ID=0
100%|██████████| 64/64 [13:34<00:00, 12.72s/it]
TASK_ID=1
100%|██████████| 64/64 [13:34<00:00, 12.73s/it]

4.2. NeMo Framework on DGX Cloud#

4.2.1. Overview#

This guide provides a basic starting point for using NVIDIA’s framework for pre-training, fine-tuning, and deploying Large Language Models (LLMs), called NeMo Framework.

The document walks through using a DGX Cloud Slurm cluster as a user to launch a simple pretraining job, targeting synthetic data to minimize dependencies for initial use.

4.2.2. Prerequisites#

To follow this document, the following items are assumed to be true:

The user has a valid NGC key which can be generated by following these steps. Save this key for future steps during cluster setup.
A DGX Cloud Slurm cluster is provisioned and the user has access to launching and running jobs on the cluster (administrator permissions not necessary). More on cluster-specific requirements can be found later in the document.
The user has access to at least two A100 or H100-based compute nodes on the cluster.
The user has read/write access to at least 100GB of shared storage.
The user can install additional Python packages via pip on the login node (available by running module add python39 after logging in).
The user has the Slurm module configured as part of their account (typically configured by an admin at user creation time, but available by running module add slurm after logging in).

4.2.3. Setting Up Your Cluster Workspace#

Before testing can begin, a few steps need to be taken to properly configure the user workspace for interacting with NeMo Framework. The following sections assume you are connected to a login node via SSH as the user you intended to run jobs as.

4.2.3.1. Authenticating with NGC#

In order to pull the NeMo FW training container from NGC, the previously noted NGC API key needs to be added to a configuration file in the DGX Cloud Slurm cluster. Authorization can be provided by following the appropriate section in the DGX Cloud User Guide if not already completed.

4.2.3.2. Pulling the NeMo Framework repository#

The NeMo Framework which is used to launch data prep and training jobs is available on GitHub. The repository can be pulled directly from GitHub onto the login node. The location where the repository is cloned to needs to be on a Lustre filesystem, which will be accessible on all compute nodes.

This location is cluster dependent and should have been provided to you during onboarding to the cluster. Ask cluster admins for this information if not available.

Note: if your cluster was set up according to the DGX Cloud Admin Guide, your shared storage will be located at /lustre/fs0/scratch/<user-name>.

Navigate to your user’s Lustre filesystem directory. Next, clone the repository from GitHub. A git reset command is used to ensure a match with the code tested as part of this guide.

cd /lustre/fs0/scratch/<user-name>
git clone https://github.com/nvidia/nemo-megatron-launcher
cd nemo-megatron-launcher
git reset --hard 51df3f36f5bc51b7bbdc3f540a43b01cdc28c8be

4.2.3.3. Configuring NeMo Framework#

With the repository cloned, the Python dependencies used to launch the script need to be installed. To do so we need to load the python39 and slurm modules if not already loaded. DGX Cloud comes pre-configured with several modules relating to various applications like Python, Slurm, OpenMPI, etc. To load the python39 and slurm modules, run:

module add python39 slurm

You can also run the following, which will add these modules to your user profile automatically during future logins to the cluster.

module initadd python39 slurm

Install the Python dependencies for NeMo FW with the following. This assumes you are in the nemo-megatron-launcher directory that was cloned in a previous step.

pip3 install -r requirements.txt

NeMo Framework has a series of config files which are used to tailor training, fine-tuning, data prep, and more for your specific needs. The config files are located at launcher_scripts/conf inside the nemo-megatron-launcher directory. The main config file is launcher_scripts/conf/config.yaml which contains high-level configuration settings that will be used for all stages of NeMo FW. Open the config.yaml file and make the following edits:

Line 6: Change gpt3/5b to gpt3/7b_improved.
Line 32: Uncomment training. This indicates we want to run the training stage.
Line 33: Comment out conversion by adding # to the beginning of the line following the pattern of the other lines in this section. We do not want to run model conversion to convert the distributed checkpoint to the .nemo format at this time. For our performance and validation purposes we do not care about model conversion.
Line 44: Replace the ??? with the full path to the nemo-megatron-launcher/launcher_scripts directory. Again, this will be cluster dependent but must match the location of where the repository was cloned in the previous section. As an example, if your repository was cloned to /lustre/fs0/scratch/<user-name>/nemo-megatron-launcher you would enter /lustre/fs0/scratch/<user-name>/nemo-megatron-launcher/launcher_scripts. The path must end with launcher_scripts.
Line 48: replace null with /cm/shared.

In the env_vars section starting on line 56: DGX Cloud clusters need the following settings in order to use the compute network for optimal performance. Choose the tab that matches the environment you are using.

Azure A100

NCCL_TOPO_FILE: /cm/shared/etc/ndv4-topo.xml
UCX_IB_PCI_RELAXED_ORDERING: null
NCCL_IB_PCI_RELAXED_ORDERING: 1
NCCL_IB_TIMEOUT: null
NCCL_DEBUG: null
NCCL_PROTO: LL,LL128,Simple
TRANSFORMERS_OFFLINE: 0
TORCH_NCCL_AVOID_RECORD_STREAMS: 1
NCCL_NVLS_ENABLE: 0
NVTE_DP_AMAX_REDUCE_INTERVAL: 0
NVTE_ASYNC_AMAX_REDUCTION: 1
NVTE_FUSED_ATTN: 0
HYDRA_FULL_ERROR: 1
OMPI_MCA_coll_hcoll_enable: 0
UCX_TLS: rc
UCX_NET_DEVICES: mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
CUDA_DEVICE_ORDER: PCI_BUS_ID
NCCL_SOCKET_IFNAME: eth0
NCCL_ALGO: Tree,Ring,CollnetDirect,CollnetChain,NVLS
MELLANOX_VISIBLE_DEVICES: all
PMIX_MCA_gds: hash
PMIX_MCA_psec: native

OCI A100

NCCL_TOPO_FILE: null # Should be a path to an XML file describing the topology
UCX_IB_PCI_RELAXED_ORDERING: null # Needed to improve Azure performance
NCCL_IB_PCI_RELAXED_ORDERING: null # Needed to improve Azure performance
NCCL_IB_TIMEOUT: null # InfiniBand Verbs Timeout. Set to 22 for Azure
NCCL_DEBUG: null # Logging level for NCCL. Set to "INFO" for debug information
NCCL_PROTO: null # Protocol NCCL will use. Set to "simple" for AWS
TRANSFORMERS_OFFLINE: 0
TORCH_NCCL_AVOID_RECORD_STREAMS: 1
NCCL_NVLS_ENABLE: 0
NVTE_DP_AMAX_REDUCE_INTERVAL: 0 # Diable FP8 AMAX reduction in the data-parallel domain
NVTE_ASYNC_AMAX_REDUCTION: 1 # Enable asynchronous FP8 AMAX reduction
NVTE_FUSED_ATTN: 0 # Disable cudnn FA until we've tested it more

OCI H100

NCCL_TOPO_FILE: null # Should be a path to an XML file describing the topology
UCX_IB_PCI_RELAXED_ORDERING: null # Needed to improve Azure performance
NCCL_IB_PCI_RELAXED_ORDERING: null # Needed to improve Azure performance
NCCL_IB_TIMEOUT: null # InfiniBand Verbs Timeout. Set to 22 for Azure
NCCL_DEBUG: null # Logging level for NCCL. Set to "INFO" for debug information
NCCL_PROTO: null # Protocol NCCL will use. Set to "simple" for AWS
TRANSFORMERS_OFFLINE: 0
TORCH_NCCL_AVOID_RECORD_STREAMS: 1
NCCL_NVLS_ENABLE: 0
NVTE_DP_AMAX_REDUCE_INTERVAL: 0 # Diable FP8 AMAX reduction in the data-parallel domain
NVTE_ASYNC_AMAX_REDUCTION: 1 # Enable asynchronous FP8 AMAX reduction
NVTE_FUSED_ATTN: 0 # Disable cudnn FA until we've tested it more

Regardless of the cluster-specific settings to enable high-speed compute networking, set TRANSFORMERS_OFFLINE to 0 instead of 1. This will allow tokenizers to be downloaded from the internet if not found locally which is expected to be the case on new clusters.

In addition to the main config file, the cluster config might need to be updated. A DGX Cloud Slurm cluster with default configurations will not require modifications to this file. Specifically, this depends on how the cluster is configured and if there are any custom partitions or account names that need to be used. Open the cluster config file at launcher_scripts/conf/cluster/bcm.yaml. If the Slurm cluster has a custom non-default partition or account that jobs need to run on, specify those in the file on the account and partition lines.

4.2.4. Running a Training Job on Slurm Using Synthetic Data#

Similar to the core config file there are a few training-specific config changes that need to be made for this sample synthetic data-based training job.

4.2.4.1. Configuring the Training Job#

Next, we need to update some of the settings for the 7b_improved model. Open the model’s config file at launcher_scripts/conf/training/gpt3/7b_improved.yaml. Note that the launcher_scripts/conf/training/gpt3 directory contains all of the default configurations for the various model sizes that NVIDIA has validated. Make the following changes to the 7b_improved.yaml config file:

Line 8: Update this to a time_limit value of 0-02:00:00. This will end the test run at two hours.
Line 12: Update this to the number of nodes you want to test on. For example, if you have 4 nodes that you want to run this job on set this value to 4.
Line 20: Set the maximum number of steps to 2000. This is the number of steps the training will run for. The higher the number of steps, the more stable performance will be, though it will take longer to reach higher steps.
Line 21: Set the max_time value to 00:01:30:00. This will ensure the training run allows for 30 minutes of time in the overall training run to cleanly end the run and write a checkpoint to shared storage.
Line 23: Set the val_check_interval to 240. This is the number of steps the training job will take before running a validation pass. This number must be less than or equal to the maximum number of steps listed above.
Line 169: Change the data_impl value to from mmap to mock - this ensures synthetic data is generated and used.

4.2.4.2. Running the Training Job#

After updating the training config file, it’s time to launch the training job.

Navigate to the nemo-megatron-launcher/launcher_scripts directory and run:

python3 main.py

This will queue up a training job for the 7b_improved model once resources are available.

4.2.4.3. Monitoring the Training Job#

Once the training jobs are submitted, they can be viewed in the queue with squeue. The output should look similar to the following:

squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
181        defq nemo-meg demo-use  R       0:01      4 gpu[001-004]

The training logs of each job associated with a specific model can be found at nemo-megatron-launcher/launcher_scripts/results/<model name>/log-nemo-megatron-<model name>_NNN where NNN is the job number. There will be both a .out file and a .err file for the job’s stdout and stderr, respectively. For the sample we are running, <model name> will be gpt_7b_improved. To view the live progress, these output of these files can be followed with tail -f <filename>, which will display updates on the terminal as they are written to the file.

4.2.4.4. Interpreting Training Performance#

While the model is training, progress will be updated at the bottom of the log files. Open the log files and you should see a line that looks similar to this at the end:

Epoch 0: :   2%|▏         | 180/10050 [40:46<37:16:04, v_num=m9yd, reduced_train_loss=5.640, global_step=179.0, consumed_samples=92160.0, train_step_timing in s=13.40]

This is the training progress bar and status information. Breaking down the line we have:

Epoch 0: This indicates we are in the first epoch of the training dataset.
2%: The job is 2% through the maximum number of steps we wanted to train for.
40:46<37:16:04: Training has run for 40 minutes and 46 seconds so far and it is expected to take another 37 hours to finish.
v_num=m9yd: This is the validation loss but since validation passes are only done every 2,000 steps by default, this is an uninitialized value.
reduced_train_loss=5.640: This is the latest training loss. Ideally this should decrease over time while training.
global_step=179.0: This is the number of steps that have been registered in the results. It roughly matches the steps listed earlier.
consumed_samples=92160.0: This is the number of samples in the dataset that have been processed so far. If multiplied by the sequence length, typically 2048, it equals the total number of tokens the model has been trained on so far.
train_step_timing in s=13.40: This is the key information for determining training throughput. This indicates it takes 13.40 seconds for every step in the training pass. This can be used to measure and compare performance across clusters and models.

To collect the performance measurement, find the train_step_timing in s value at the end of the training log for each model. This value is used to determine the overall performance and can be labeled with seconds/iteration. The inverse, iterations/second, can also be useful.

4.3. Video Classification and ASR with Hugging Face Accelerate on DGX Cloud#

4.3.1. Overview#

This guide demonstrates how to enable distributed training of two different models with Hugging Face Accelerate. The first example is to fine-tune a video classification model with data processing and augmentation based on PyTorchVideo. The second one is quantized LoRA (QLoRA) fine-tuning of whisper-large-v2, a high-performance automatic speech recognition (ASR) model. These examples represent training throughput scalability in a Slurm cluster, and the importance of data pre-processing for efficient training.

4.3.2. Prerequisites#

Please check that the following prerequisites are met before using this guide.

A DGX Cloud Slurm cluster is provisioned and the user has access to launching and running jobs on the cluster (root permissions not necessary).
The user has access to at least two A100 or H100-based compute nodes on the cluster that can run jobs with high-speed multi-node compute networks.
The user has read/write access to at least 100GB of shared storage which is mounted and available on all nodes in the cluster and available within jobs. To identify the shared storage path, please consult with your cluster administrator. The path will be represented by <SHARED_STORAGE_ROOT> in this document. An example for <SHARED_STORAGE_ROOT> might be /lustre/fs0/scratch/demo-user.
The cluster has external internet access on all nodes to download datasets and pre-trained models.
The user has a valid Kaggle account and API token. This is required for our ASR example. To obtain a Kaggle API token, please refer to this Kaggle documentation.
The cluster login node has a Python environment available that allows users to install their own packages locally.

4.3.3. Preparing a Customized Container Image#

The Slurm implementation in your DGX Cloud cluster leverages Pyxis and Enroot as the container plugin and runtime. This section explains how to build a customized environment into a container image for use on BCM-based clusters with Slurm and Pyxis/Enroot. The build process should be completed in a local Linux environment. By building a containerized environment, we have a frozen environment to perform model training whenever possible without concerns about package compatibility.

4.3.3.1. Creating a Container Image in a Local Linux Machine#

To build a customized container environment for this workload, we need to install the Docker engine on the local machine, an open-source containerization technology for building and containerizing your applications. Once Docker is installed, we can pull container images and build a container environment to support the model training by providing a list of instructions on how to assemble an image. First, we create a Dockerfile with content as follows:

# Dockerfile
FROM nvcr.io/nvidia/pytorch:24.01-py3

RUN pip install lightning==2.2.1 \
    transformers==4.39.3 \
    evaluate==0.4.1 \
    accelerate==0.29.2 \
    jiwer==3.0.3 \
    bitsandbytes==0.43.1 \
    peft==0.10.0 \
    librosa==0.10.1 \
    datasets==2.18.0 \
    opendatasets==0.1.22 \
    gradio==4.26.0

RUN cd /opt && \
      git clone https://github.com/facebookresearch/pytorchvideo.git && \
      cd /opt/pytorchvideo && \
      git checkout 1fadaef40dd393ca09680f55582399f4679fc9b7 && \
      pip install -e .

In this Dockerfile, we select a PyTorch container image from NGC as the base image (nvcr.io/nvidia/pytorch:24.01-py3). We use pip to install the Hugging Face Transformers library and other dependencies. Next, we pull the PyTorch repository from GitHub and checkout to a fixed commit. Execute the following command in the same folder with the Dockerfile, and a container image pytorch-vidasr:24.01 will be created. Note that the “dot” in the command line is necessary.

docker build -t pytorch-vidasr:24.01 .

Once the container build is completed, we can execute the following command, and the container image will be listed and will look like this.

docker images | grep vidasr
REPOSITORY      TAG    IMAGE ID       CREATED          SIZE
pytorch-vidasr  24.01  45909445a2ee   11 minutes ago   22.7GB

4.3.3.2. Setting Up Your Cluster Workspace#

Before model training, we must prepare scripts, dataset, and the containerized environment as detailed in the previous step. Please refer to these details in the Cluster User Guide Setting Up NGC Integration regarding container and Enroot setup.

From here, we have two options to use this image in the Slurm cluster.

4.3.3.2.1. Option 1: Push to an Accessible Container Registry#

A container registry is a repository or repository collection to store container images. The NGC Catalog is an example of a container registry. If we have a container registry (denoted as <YOUR_REGISTRY>) with image upload permission that is accessible by the cluster, we can push the image to the registry, where Pyxis/Enroot in the Slurm cluster can pull the image from it. If the registry requires login and password authentication, execute docker login <YOUR_REGISTRY> and login with credentials first. Next, you can tag the image and push it to your registry. Note that you may need to tag the container image with a different name or additional repository path. For the sake of brevity, we use <YOUR_REGISTRY>/pytorch-vidasr:24.01 from now on.

docker tag pytorch-vidasr:24.01 <YOUR_REGISTRY>/pytorch-vidasr:24.01
docker push <YOUR_REGISTRY>/pytorch-vidasr:24.01

4.3.3.2.2. Option 2: Convert to a SquashFS File and Upload to the Slurm Cluster#

If uploading our container image to a container registry is unavailable, we will need to convert the image into a SquashFS file using NVIDIA Enroot. This tool turns traditional container/OS images into unprivileged sandboxes. First, we will check the current version of Enroot in the cluster. Login to the cluster login node and execute enroot version to confirm the current version of Enroot in the cluster. (3.4.1 as of March 11, 2024)

Next, follow the instructions here to install the Enroot version on your local machine that corresponds to the version in the cluster (to ensure compatability). Once installation is completed, we can convert the pytorch-vidasr:24.01 image to a SquashFS file pytorch-vidasr-24.01.sqsh on our local machine with the following command.

enroot import -o pytorch-vidasr-24.01.sqsh dockerd://pytorch-vidasr:24.01

Once the conversion is completed, we can upload the SquashFS file to the Slurm cluster using one of the methods described in the Cluster User Guide Moving Data from Your Local Workstation to DGX Cloud. Note that the final destination of the SquashFS file in the Slurm cluster must be in a shared file system so that all compute nodes can access it when the distributed workload is launched. For example, if scp is used to upload the SquashFS file here, we can execute the following command on the local machine:

scp pytorch-vidasr-24.01.sqsh \
      <USERNAME>@<LOGIN_NODE>:<SHARED_STORAGE_ROOT>/

where <USERNAME> is the user name in the cluster and <LOGIN_NODE> is the node address of a login node in the DGX Cloud cluster.

4.3.4. Running on Slurm#

This section covers cluster workspace preparation, Slurm batch script configuration, and checking multi-node training functionality.

4.3.5. Enabling Slurm Commands#

Both of the use cases require Slurm. If Slurm commands are not enabled yet, execute the following command.

module load slurm

4.3.6. Use Case 1: Fine-Tuning a Video Classification Model with Slurm#

Use SSH to access the cluster login node, where we will use the shell to execute various steps.

4.3.6.1. Workspace and Video Dataset Preparation#

We first create directories in the shared scratch space as our workspace with the following command.

mkdir -p <SHARED_STORAGE_ROOT>/videocls/hf_workspace

Next, we use a subset of the UCF101 dataset for a basic test, which can be downloaded from the Hugging Face dataset repository. We first install the huggingface_hub Python package in the login node so that we can use it to download a dataset later.

module load python3
pip install huggingface_hub

Now we can create a data preparation file with the contents below and save it to our workspace directory.

# <SHARED_STORAGE_ROOT>/videocls/data_prep.py
from huggingface_hub import hf_hub_download
import os
import pathlib

hf_dataset_identifier = "sayakpaul/ucf101-subset"
filename = "UCF101_subset.tar.gz"
file_path = hf_hub_download(repo_id=hf_dataset_identifier,
                filename=filename,
                repo_type="dataset")
os.system("tar xf %s" % file_path)

Next, we execute the script in our training folder.

cd <SHARED_STORAGE_ROOT>/videocls/hf_workspace
python ../data_prep.py

4.3.6.2. Training Script#

ViViT is a Transformer-based model for video classification from Google. It extracts spatio-temporal tokens from the input video and handles long sequences by factorizing the spatial and temporal dimensions of the input. This aspect makes it especially compelling for showcasing scaling of data distribution, as it can effectively handle very long video sequences. The Training script uses supervised fine-tuning on a pre-trained ViViT base model. The tuning is performed on a subset of the UCF101 dataset, including ten unique classes with 30 videos in each. The key steps in the script flow:

Preprocess and augment (scale, crop, flip, resize, subsample, etc.) the videos using the PyTorchVideo library.
Train the model using data-parallel distribution.

Save the following code to a script at <SHARED_STORAGE_ROOT>/videocls/hf_workspace/train_vivit.py.

# <SHARED_STORAGE_ROOT>/videocls/hf_workspace/train_vivit.py
import pathlib
import pytorchvideo.data
from pytorchvideo.transforms import (
      ApplyTransformToKey,
      Normalize,
      RandomShortSideScale,
      RemoveKey,
      ShortSideScale,
      UniformTemporalSubsample,
)

from torchvision.transforms import (
      Compose,
      Lambda,
      RandomCrop,
      RandomHorizontalFlip,
      Resize,
)

from transformers import VivitImageProcessor, VivitForVideoClassification
from transformers import TrainingArguments, Trainer
import evaluate
import torch
from torch.utils.data import SequentialSampler
import os
import numpy as np

from accelerate import Accelerator
from accelerate.data_loader import IterableDatasetShard

def preprocess_dataset(dataset_root_path, image_processor, num_frames_to_sample):
      mean = image_processor.image_mean
      std = image_processor.image_std
      if "shortest_edge" in image_processor.size:
            height = width = image_processor.size["shortest_edge"]
      else:
            height = image_processor.size["height"]
            width = image_processor.size["width"]
      resize_to = (height, width)

      sample_rate = 1
      fps = 30
      clip_duration = num_frames_to_sample * sample_rate / fps

      # Training dataset transformations
      train_transform = Compose(
                  [
                  ApplyTransformToKey(
                        key="video",
                        transform=Compose(
                              [
                              UniformTemporalSubsample(num_frames_to_sample),
                              Lambda(lambda x: x / 255.0),
                              Normalize(mean, std),
                              RandomShortSideScale(min_size=256, max_size=320),
                              RandomCrop(resize_to),
                              RandomHorizontalFlip(p=0.5),
                              ]
                              ),
                        ),
                  ]
                  )

      # Training dataset
      train_dataset = pytorchvideo.data.Ucf101(
                  data_path=os.path.join(dataset_root_path, "train"),
                  clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration),
                  decode_audio=False,
                  transform=train_transform,
                  )

      # Validation and evaluation datasets' transformations
      val_transform = Compose(
            [
                  ApplyTransformToKey(
                  key="video",
                  transform=Compose(
                        [
                              UniformTemporalSubsample(num_frames_to_sample),
                              Lambda(lambda x: x / 255.0),
                              Normalize(mean, std),
                              Resize(resize_to),
                              ]
                        ),
                  ),
                  ]
            )

      # Validation and evaluation datasets
      val_dataset = pytorchvideo.data.Ucf101(
                  data_path=os.path.join(dataset_root_path, "val"),
                  clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
                  decode_audio=False,
                  transform=val_transform,
                  )

      test_dataset = pytorchvideo.data.Ucf101(
                  data_path=os.path.join(dataset_root_path, "test"),
                  clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration),
                  decode_audio=False,
                  transform=val_transform,
                  )

      return train_dataset, val_dataset, test_dataset

accelerator = Accelerator()
print("Process ID: %d of %d" % (accelerator.process_index, accelerator.num_processes))
print("Available GPU devices: %d" % torch.cuda.device_count())

dataset_root_path = "UCF101_subset"
model_ckpt = "google/vivit-b-16x2-kinetics400" # pre-trained model from which to fine-tune
batch_size = 4 # Per-device batch size for training and evaluation

image_processor = VivitImageProcessor.from_pretrained(model_ckpt)

dataset_root_path = pathlib.Path(dataset_root_path)
video_count_train = len(list(dataset_root_path.glob("train/*/*.avi")))
video_count_val = len(list(dataset_root_path.glob("val/*/*.avi")))
video_count_test = len(list(dataset_root_path.glob("test/*/*.avi")))
video_total = video_count_train + video_count_val + video_count_test
print(f"Total videos: {video_total}")

all_video_file_paths = (
list(dataset_root_path.glob("train/*/*.avi"))
      + list(dataset_root_path.glob("val/*/*.avi"))
      + list(dataset_root_path.glob("test/*/*.avi"))
)

class_labels = sorted({str(path).split("/")[2] for path in all_video_file_paths})
label2id = {label: i for i, label in enumerate(class_labels)}
id2label = {i: label for label, i in label2id.items()}

print(f"Unique classes: {list(label2id.keys())}.")
model = VivitForVideoClassification.from_pretrained(
      model_ckpt,
      label2id=label2id,
      id2label=id2label,
      ignore_mismatched_sizes=True,  # provide this in order to fine-tune an already fine-tuned checkpoint
)
train_dataset, val_dataset, test_dataset = (
    preprocess_dataset(dataset_root_path, image_processor, model.config.num_frames)
    )

# Training setup

model_name = model_ckpt.split("/")[-1]
new_model_name = ("%s-finetuned-ucf101-subset-%s-n-%s-g-%d-b" %
                  (model_name, os.getenv("SLURM_NNODES"), os.getenv("SLURM_GPUS_PER_NODE"), batch_size))

args = TrainingArguments(
      new_model_name,
      remove_unused_columns=False,
      evaluation_strategy="epoch",
      save_strategy="epoch",
      save_on_each_node=False,
      learning_rate=5e-5,
      per_device_train_batch_size=batch_size,
      per_device_eval_batch_size=batch_size,
      warmup_ratio=0.1,
      logging_steps=10,
      load_best_model_at_end=True,
      metric_for_best_model="accuracy",
      push_to_hub=False,
      dataloader_num_workers=15, # Set it to 1 for single preprocess worker
      dataloader_prefetch_factor=64,
      max_steps=(train_dataset.num_videos // batch_size)*2,
)

# Next, we need to define a function for how to compute the metrics from the predictions,
# which will just use the metric we'll load now. The only preprocessing we have to do
# is to take the argmax of our predicted logits:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
      """Computes accuracy on a batch of predictions."""
      predictions = np.argmax(eval_pred.predictions, axis=1)
      return metric.compute(predictions=predictions, references=eval_pred.label_ids)

def collate_fn(examples):
      """The collation function to be used by `Trainer` to prepare data batches."""
      # permute to (num_frames, num_channels, height, width)
      pixel_values = torch.stack(
            [example["video"].permute(1, 0, 2, 3) for example in examples]
      )
      labels = torch.tensor([example["label"] for example in examples])
      return {"pixel_values": pixel_values, "labels": labels}

trainer = Trainer(
      model,
      args,
      train_dataset=train_dataset,
      eval_dataset=val_dataset,
      tokenizer=image_processor,
      compute_metrics=compute_metrics,
      data_collator=collate_fn,
)

train_results = trainer.train()

trainer.save_model()
test_results = trainer.evaluate(test_dataset)
trainer.log_metrics("test", test_results)
trainer.save_metrics("test", test_results)
trainer.save_state()

Important variables are noted as follows.

dataloader_num_workers is set to 15 in our training arguments. Since we use a dataset with preprocessing and augmentation based on the PyTorchVideo library, more workers are required to enhance GPU utilization. We also set it to 1 and performed another test as a comparison.
We set a fixed value of max_steps to (train_dataset.num_videos / batch_size)*2, which is 150 in our case. Therefore, the completed training epoch will scale with the number of GPUs.

Note that the purpose of the provided script is to validate the data-parallel training function. To optimize for other datasets, developers can tune the training arguments and other parameters in the script as necessary.

4.3.6.3. Batch Submission Script#

Now we prepare the batch script with the content below and save it in our workspace folder (<SHARED_STORAGE_ROOT>/videocls/train-vivit-hf.sh).

#!/bin/bash

##SBATCH --job-name
##SBATCH --nodes
##SBATCH --gpus-per-node
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --time=01:00:00
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1

# Environment variables added for DGX Cloud
export OMPI_MCA_coll_hcoll_enable=0
export UCX_TLS=tcp
export UCX_NET_DEVICES=eth0
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_TOPO_FILE=/cm/shared/etc/ndv4-topo.xml
export NCCL_PROTO=LL,LL128,Simple
export NCCL_ALGO=Tree,Ring,CollnetDirect,CollnetChain,NVLS
export MELLANOX_VISIBLE_DEVICES=all
export PMIX_MCA_gds=hash
export PMIX_MCA_psec=native

export SHARED_STORAGE_ROOT=<SHARED_STORAGE_ROOT>
export CONTAINER_WORKSPACE_MOUNT=$SHARED_STORAGE_ROOT/videocls/hf_workspace
export CONTAINER_IMAGE=$SHARED_STORAGE_ROOT/pytorch-vidasr-24.01.sqsh

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
export GPUS_PER_NODE=$SLURM_GPUS_PER_NODE
export NNODES=$SLURM_NNODES
export NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)
export MULTIGPU_FLAG="--multi_gpu"

if [ $NNODES == "1" ]
then
        export MULTIGPU_FLAG=""
fi

echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"

srun -l --container-image $CONTAINER_IMAGE \
        --container-mounts $CONTAINER_WORKSPACE_MOUNT:/workspace \
        --container-workdir /workspace \
        --no-container-mount-home \
        bash -c 'accelerate launch  --main_process_ip ${MASTER_ADDR} \
                                    --main_process_port ${MASTER_PORT} \
                                    --machine_rank $SLURM_NODEID \
                                    $MULTIGPU_FLAG \
                                    --same_network \
                                    --num_processes $NUM_PROCESSES \
                                    --num_cpu_threads_per_process 4 \
                                    --num_machines $NNODES train_vivit.py'

There are notable variables to be configured in this batch script for different settings of the Slurm account, Slurm partition, container image, and resource preference.

<SLURM_ACCOUNT_NAME>: The Slurm account to be used for your project. Please consult with the cluster administrator or project manager to determine which account name to use.
<SHARED_STORAGE_ROOT>: The root path of user shared storage defined in earlier sections.
<SLURM_PARTITION>: The Slurm partition(s) to use for this job. Slurm partitions are defined by cluster administrators and designated for different purposes or accounts. Note that a partition with multi-node job and GPU support must be selected.
<CONTAINER_IMAGE>: The container image name or path to be used in Slurm with Pyxis/Enroot. This depends on the option in the previous section on container image build.
- If Option 1 is used, we can replace it with <YOUR_REGISTRY>/pytorch-vidasr-24.01.sqsh
- If Option 2 is used, we will replace it with our upload destination <SHARED_STORAGE_ROOT>/pytorch-vidasr-24.01.sqsh

Several variables, such as job-name, nodes, and gpus-per-node, are not set directly in this batch script, but we can assign them in the submission commands to exercise different resource configurations. In this example, we start from one node with 1 GPU, scaling to 2, 4, and 8 GPUs, and two 8-GPU nodes. The commands to assign these variables are listed in the table below.

sbatch run configurations#
Number of nodes	GPUs per node	Job submission command (in the script folder <SHARED_STORAGE_ROOT>/videocls)
1	1	`sbatch --job-name=vivit-train-acc-n1g1b4 --nodes=1 --gpus-per-node=1 train-vivit-hf.sh`
1	2	`sbatch --job-name=vivit-train-acc-n1g2b4 --nodes=1 --gpus-per-node=2 train-vivit-hf.sh`
1	4	`sbatch --job-name=vivit-train-acc-n1g2b4 --nodes=1 --gpus-per-node=4 train-vivit-hf.sh`
1	8	`sbatch --job-name=vivit-train-acc-n1g2b4 --nodes=1 --gpus-per-node=8 train-vivit-hf.sh`
2	8	`sbatch --job-name=vivit-train-acc-n1g2b4 --nodes=2 --gpus-per-node=8 train-vivit-hf.sh`

4.3.6.4. Training Steps, Epochs, and Time#

We can retrieve training epoch information and elapsed training time of each job using the trainer_state.json in the model saving folder listed in the table below. The epoch number can be obtained by looking for the final epoch value in log_history section. Note that we only check the integer part of this value.

# trainer_state.json snippet for 1-Node, 1-GPU
{
  .....
  "log_history": [
    {
      "epoch": 0.07,
      "grad_norm": 14.361384391784668,
      "learning_rate": 3.3333333333333335e-05,
      "loss": 2.4146,
      "step": 10
    },
    {
      "epoch": 0.13,
      "grad_norm": 11.673120498657227,
      "learning_rate": 4.814814814814815e-05,
      "loss": 1.6931,
      "step": 20
    },
    {
      "epoch": 0.2,
      "grad_norm": 8.026626586914062,
      "learning_rate": 4.4444444444444447e-05,
      "loss": 1.106,
      "step": 30
    },
    .....
    {
      "epoch": 1.43,
      "grad_norm": 0.2679580748081207,
      "learning_rate": 3.7037037037037037e-06,
      "loss": 0.0442,
      "step": 140
    },
    {
      "epoch": 1.5,
      "grad_norm": 0.6468069553375244,
      "learning_rate": 0.0,
      "loss": 0.0889,
      "step": 150
    },
    {
      "epoch": 1.5,
      "eval_accuracy": 1.0,
      "eval_loss": 0.0483052060008049,
      "eval_runtime": 14.0177,
      "eval_samples_per_second": 10.629,
      "eval_steps_per_second": 2.711,
      "step": 150
    },
    {
      "epoch": 1.5,
      "step": 150,
      "total_flos": 1.537335139321774e+18,
      "train_loss": 0.5025620261828104,
      "train_runtime": 150.2643,
      "train_samples_per_second": 3.993,
      "train_steps_per_second": 0.998
    },
    {
      "epoch": 1.5,
      "eval_accuracy": 1.0,
      "eval_loss": 0.04270438104867935,
      "eval_runtime": 32.402,
      "eval_samples_per_second": 10.709,
      "eval_steps_per_second": 2.685,
      "step": 150
    }
  ],
  .....

To find the training time, we can look for several final entries in log_history. The value is recorded in train_runtime in a unit of seconds. We also use different process numbers of data loaders for comparison. With only one data loader process, data processing becomes the major bottleneck of training throughput even with single-GPU training. Using 15 processes yields significant performance gains with up to 4-GPU training.

Run configuration and training epochs#
Number of nodes	GPUs per node	Model save subfolder	Global batch size ($B_g$)	Integer part of the last epoch logging ($Ceil(150/Ceil(300/B_g))-1$)
1	1	vivit-b-16x2-kinetics400-finetuned-ucf101-subset-1-n-1-g-4-b/	4	1
1	2	vivit-b-16x2-kinetics400-finetuned-ucf101-subset-1-n-2-g-4-b/	8	3
1	4	vivit-b-16x2-kinetics400-finetuned-ucf101-subset-1-n-4-g-4-b/	16	7
1	8	vivit-b-16x2-kinetics400-finetuned-ucf101-subset-1-n-8-g-4-b/	32	14
2	8	vivit-b-16x2-kinetics400-finetuned-ucf101-subset-2-n-8-g-4-b/	64	29

Note that the subset of UCF101 dataset is a derivative of IterableDataset, in which case Hugging Face Accelerate will enable dispatch_batches mechanism by default for multi-GPU training. In other words, the designated number of data loader workers will handle all data processing and augmentation, then dispatch processed data to all GPUs. Further chunking of the dataset is recommended for practical use cases with a larger scale of dataset with more GPU resources.

Data loaders and training runtime#
Number of nodes	GPUs per node	Training runtime with 15 data loader processes (seconds)	Training runtime with 1 data loader process (seconds)
1	1	110.8	150.3
1	2	128.0	300.3
1	4	172.7	619.1
1	8	324.2	1154.0
2	8	623.1	2374.5

NOTE: Timings shown are for reference only.

4.3.7. Use Case 2: QLoRA Fine-Tuning of ASR with Slurm#

4.3.7.1. Workspace and ASR Dataset Preparation#

We create a new folder in the shared scratch space as our workspace for the ASR example with the following command.

mkdir -p <SHARED_STORAGE_ROOT>/asr

Next, we install the opendatasets Python package in a login node so that we can use it to download a dataset later.

module load python3
pip install opendatasets

Now we can download a dataset from Kaggle. A small dataset bengali-ai-asr-10k is used in our example for a quick test.

cd <SHARED_STORAGE_ROOT>/asr
python -c "import opendatasets as od;\
      od.download(\"https://www.kaggle.com/datasets/nbroad/bengali-ai-asr-10k\")"
# Depending on your local Kaggle API setup, a prompt appears for Kaggle user name and key
# For example
# Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
# Your Kaggle username:

4.3.7.2. Training Script#

The training script uses Hugging Face PEFT to tune Whisper on the Kaggle Bengali ASR dataset (1 GB). We will import the model in 8-bit and add the LoRA adapter. We will only retain the LoRA weights and train on a part of the training dataset. LoRA tuning keeps the original weights frozen and adapts the frozen weights by adding a low rank matrix to the original weights. We use a rank of size 16 for this script.

Whisper is a pre-trained model for ASR by OpenAI, and its architecture is a seq2seq model with an audio encoder and text decoder. The feature extractor turns the 1D audio signal into a log-mel spectrogram, while the encoder creates hidden states which are passed to the decoder to generate text. Unlike its predecessor ASR model, Whisper was pre-trained on a vast quantity of labeled audio transcription data (Wav2Vec2.0 was pre-trained on unlabeled data). Bengali is a good use case as based on the Whisper paper since Whisper wasn’t trained on much Bengali data. The key preprocessing step is the unique data collator, which dynamically pads all audio samples such that they have an identical input length of 30 seconds. The script can easily be modified to support larger ASR datasets to showcase further scaling of data distributed training (for example, the Bengali ASR 80GB or the librispeech-clean 30GB dataset).

Save the following script to the path <SHARED_STORAGE_ROOT>/asr/qlora-asr.py. Note that for the purpose of a quick test, we only run one training epoch and parquet in our job to observe multi-GPU scalability.

# <SHARED_STORAGE_ROOT>/asr/qlora-asr.py
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperForConditionalGeneration, BitsAndBytesConfig
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer, TrainerCallback, TrainingArguments, TrainerState, TrainerControl
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from peft import prepare_model_for_kbit_training
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model
import datasets
from datasets import DatasetDict, load_dataset
from pathlib import Path
import opendatasets as od
import os
import pandas
import evaluate

from accelerate import Accelerator, DistributedDataParallelKwargs

def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor ([`WhisperProcessor`])
            The processor used for processing the data.
        decoder_start_token_id (`int`)
            The begin-of-sentence of the decoder.
        forward_attention_mask (`bool`)
            Whether to return attention_mask.
    """

    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        model_input_name = self.processor.model_input_names[0]
        input_features = [
            {model_input_name: feature[model_input_name]} for feature in features
        ]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.feature_extractor.pad(
            input_features, return_tensors="pt"
        )

        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}")

        peft_model_path = os.path.join(checkpoint_folder, "adapter_model")
        kwargs["model"].save_pretrained(peft_model_path)

        pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        if os.path.exists(pytorch_model_path):
            os.remove(pytorch_model_path)
        return control

ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
device_index = Accelerator(kwargs_handlers=[ddp_kwargs]).local_process_index

device_map = {"": device_index}

model_name_or_path = "openai/whisper-large-v2"
task = "transcribe"
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)
tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language='bn', task=task)
processor = WhisperProcessor.from_pretrained(model_name_or_path, language='bn', task=task)

data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
)

metric = evaluate.load("wer")
model = (WhisperForConditionalGeneration.from_pretrained(model_name_or_path,
          quantization_config=BitsAndBytesConfig(load_in_8bit=True), device_map=device_map))
print(model.hf_device_map)
model = prepare_model_for_kbit_training(model)
model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

vectorized_datasets = DatasetDict()
train_data_dir = os.getenv("ASR_DATASETS")
validation_data_dir = os.getenv("ASR_DATASETS")

train_files = list(map(str, Path(train_data_dir).glob("train*.parquet")))
vectorized_datasets["train"] = load_dataset("parquet", data_files=train_files[:1], split="train")

eval_files = list(map(str, Path(validation_data_dir).glob("eval*.parquet")))
vectorized_datasets["eval"] = load_dataset(
    "parquet", data_files=eval_files, split="train"
)

training_args = Seq2SeqTrainingArguments(
    # change to a repo name of your choice
    output_dir="lora/%s-%s-n-%s-g" % (train_data_dir, os.getenv("SLURM_NNODES"), os.getenv("SLURM_GPUS_PER_NODE")),
    report_to="none", ### comment this out to login to wandb
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    ddp_find_unused_parameters=False,
    learning_rate=1e-5,
    warmup_steps=50,
    num_train_epochs=1,
    evaluation_strategy="steps",
    fp16=True,
    gradient_checkpointing_kwargs={'use_reentrant':False},
    per_device_eval_batch_size=8,
    logging_steps=250,
    # required as the PeftModel forward doesn't have the signature of the wrapped model's forward
    remove_unused_columns=False,
    label_names=["labels"],  # same reason as above
)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=vectorized_datasets["train"],
    eval_dataset=vectorized_datasets["eval"],
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    callbacks=[SavePeftModelCallback],

)
model.config.use_cache = False

trainer.train()
trainer.save_model()

4.3.7.3. Batch Submission Script#

Now prepare a batch script with content shown below and save it in our workspace folder (<SHARED_STORAGE_ROOT>/asr/train-whisper-qlora.sh)

#!/bin/bash

##SBATCH --job-name=asr
##SBATCH --nodes=2
##SBATCH --gpus-per-node=8
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --time=01:00:00
#SBATCH --exclusive
#SBATCH --ntasks-per-node=1

export SHARED_STORAGE_ROOT=<SHARED_STORAGE_ROOT>
export CONTAINER_WORKSPACE_MOUNT=$SHARED_STORAGE_ROOT/asr
export CONTAINER_IMAGE=$SHARED_STORAGE_ROOT/pytorch-vidasr-24.01.sqsh

export ASR_DATASETS=bengali-ai-asr-10k

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1 ) + 30000 ))
export GPUS_PER_NODE=$SLURM_GPUS_PER_NODE
export NNODES=$SLURM_NNODES
export NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)
export MULTIGPU_FLAG="--multi_gpu"

if [ $NNODES == "1" ]
then
        export MULTIGPU_FLAG=""
fi

echo "MASTER_ADDR: $MASTER_ADDR"
echo "MASTER_PORT: $MASTER_PORT"
echo "Using $NNODES nodes, $NUM_PROCESSES GPUs total"

srun -l --container-image $CONTAINER_IMAGE \
        --container-mounts /cm/shared/etc:/cm/shared/etc,$CONTAINER_WORKSPACE_MOUNT:/workspace \
        --container-workdir /workspace \
        --no-container-mount-home \
        bash -c 'accelerate launch      --main_process_ip ${MASTER_ADDR} \
                                        --main_process_port ${MASTER_PORT} \
                                        --machine_rank $SLURM_NODEID \
                                        $MULTIGPU_FLAG \
                                        --same_network \
                                        --num_processes $NUM_PROCESSES \
                                        --num_cpu_threads_per_process 4 \
                                        --num_machines $NNODES qlora-asr.py'

As earlier in the Video ASR example, we need to configure some parts of the batch script for the Slurm environment.

<SLURM_ACCOUNT>: The Slurm account to be selected for the job.
<SLURM_PARTITION>: The Slurm partition to submit for the job.
<SHARED_STORAGE_ROOT>: The root of the user shared scratch space.

However, we leave the options for job-name, nodes, and gpus-per-node unset in this script and assign them by passing them as arguments to the sbatch command as shown in the following table.

Job submission configurations#
Number of nodes	GPUs per node	Job submission command (In the script folder `<SHARED_STORAGE_ROOT>/asr`)
1	1	`sbatch --job-name=asr-n1g1 --nodes=1 --gpus-per-node=1 train-whisper-qlora.sh`
1	2	`sbatch --job-name=asr-n1g2 --nodes=1 --gpus-per-node=2 train-whisper-qlora.sh`
1	4	`sbatch --job-name=asr-n1g4 --nodes=1 --gpus-per-node=4 train-whisper-qlora.sh`
1	8	`sbatch --job-name=asr-n1g8 --nodes=1 --gpus-per-node=8 train-whisper-qlora.sh`
2	8	`sbatch --job-name=asr-n2g8 --nodes=2 --gpus-per-node=8 train-whisper-qlora.sh`

4.3.7.4. Training Steps, Epochs, and Time#

The number of training steps will be inversely proportional to the number of GPUs (round to ceiling) with a fixed number of training epochs. Run the following command in the workspace folder to get the last section of the stderr of each job to obtain the training time and the number of training steps for each job.

cd <SHARED_STORAGE_ROOT>/asr
tail <job-name>_<job_id>.err

where <job-name> and <job_id> are the designated job name from the sbatch command and the job ID given by Slurm when submitted. The last line of the file should have a progress bar of 100% completion. The following is a result from a single-node, single-GPU job as an example.

100%|██████████| 125/125 [09:22<00:00,  4.50s/it]

125/125: This is the completed/total number of training steps.
09:22<00:00: The left number is the elapsed time, and the right number (right of the < symbol) is the estimated remaining training time. This example shows the final total elapsed time with no remaining training time.

The number of training steps and reference training timings are in the table below.

Job submission configurations and timings#
Number of nodes	GPUs per node	Number of steps with 1 training epoch	Elapsed time of 1 training epoch (seconds)
1	1	125	371
1	2	63	183
1	4	32	93
1	8	16	49
2	8	8	25