BioNeMo Training Scripts for SLURM
This guide provides example SLURM scripts for running BioNeMo training jobs on HPC systems. The scripts are configured for the DGX cloud SLURM environment. While some settings are specific to our cluster resources, these scripts can be adapted for other HPC environments.
For a list of general SLURM commands, check out the SLURM Quick Start User Guide.
Example SBATCH Script
Below is a sample SLURM batch script that runs BioNeMo Evo 2 training in a containerized environment. The script demonstrates how to:
- Configure SLURM job parameters (nodes, partitions, etc.)
- Set up training hyperparameters
- Mount data and model directories
- Launch distributed training
We'll walk through each section of the script below.
#!/bin/bash
# SLURM directives
# =========================
#SBATCH --account=[INSERT ACCOUNT HERE]
#SBATCH --nodes=1
#SBATCH --partition=[INSERT PARTITIONs HERE]
#SBATCH --ntasks-per-node=8
#SBATCH --time=00:15:00
#SBATCH --mem=0
#SBATCH --job-name=[INSERT JOB NAME HERE]
#SBATCH --mail-type=FAIL
#SBATCH --exclusive
set -x # Enable debugging (prints executed commands)
# Job-specific parameters
# =========================
IMAGE_NAME=[INSERT IMAGE NAME HERE]
EXPERIMENT_NAME=[INSERT PROJECT NAME HERE]
MODEL_SIZE=7b
CP_SIZE=1
TP_SIZE=1
PP_SIZE=1
MICRO_BATCH_SIZE=2
GRAD_ACC_BATCHES=1
SEQ_LEN=8192
MAX_STEPS=100
VAL_CHECK=50
CLIP_GRAD=250
EXTRA_ARGS="--enable-preemption --use-megatron-comm-overlap-llama3-8k --ckpt-async-save --overlap-grad-reduce --clip-grad $CLIP_GRAD --eod-pad-in-loss-mask"
EXTRA_ARG_DESC="BF16_perf_cg250_continue"
LR=0.0003
MIN_LR=0.00003
WU_STEPS=2500
# 0xDEADBEEF
SEED=1234
WD=0.1
ADO=0.01
HDO=0.01
# Mounts
# =========================
DATA_PATH=/lustre/.../[INSERT DATA PATH HERE]
DATA_MOUNT=/workspace/bionemo2/data
MODEL_PATH=/lustre/.../[INSERT MODEL PATH HERE]
MODEL_MOUNT=/workspace/bionemo2/model
RESULTS_PATH=$MODEL_PATH/experiments/${EXPERIMENT_NAME}
mkdir -p $RESULTS_PATH
MOUNTS=${DATA_PATH}:${DATA_MOUNT},${MODEL_PATH}:${MODEL_MOUNT},$HOME/.cache:/root/.cache
# Training command
# =========================
read -r -d '' COMMAND <<EOF
echo "*******STARTING********" \
&& echo "---------------" \
&& echo "Starting training" \
&& \
python /workspace/bionemo2/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py \
-d /workspace/bionemo2/sub-packages/bionemo-evo2/examples/configs/full_pretrain_shortphase_config.yaml \
--num-nodes=${SLURM_JOB_NUM_NODES} \
--devices=${SLURM_NTASKS_PER_NODE} \
--grad-acc-batches $GRAD_ACC_BATCHES \
--max-steps=$MAX_STEPS \
--seed $SEED \
${EXTRA_ARGS} \
--no-wandb \
--lr $LR \
--wd $WD \
--min-lr $MIN_LR \
--warmup-steps $WU_STEPS \
--attention-dropout $ADO \
--hidden-dropout $HDO \
--limit-val-batches=20 \
--val-check-interval=${VAL_CHECK} \
--experiment-dir=/workspace/bionemo2/model/checkpoints/${EXPERIMENT_NAME} \
--seq-length=${SEQ_LEN} \
--tensor-parallel-size=${TP_SIZE} \
--context-parallel-size=${CP_SIZE} \
--pipeline-model-parallel-size=${PP_SIZE} \
--workers 8 \
--micro-batch-size=${MICRO_BATCH_SIZE} \
--model-size=${MODEL_SIZE}
EOF
srun \
--output ${RESULTS_PATH}/slurm-%j.out \
--error ${RESULTS_PATH}/error-%j.out \
--container-image=$IMAGE_NAME \
--container-mounts ${MOUNTS} \
bash -c "${COMMAND}"
set +x # Disable debugging
SLURM Directives (Resource Allocation)
After the first shebang line, you'll need to add some #SBATCH
directives to define how SLURM manages the job.
Some of these are user-specific or project-specific, so you'll need to ask your sysadmin for the correct values.
Directive | Description |
---|---|
#SBATCH --account=[INSERT ACCOUNT HERE] |
Specifies the SLURM account for billing. |
#SBATCH --nodes=1 |
Requests 1 compute node. |
#SBATCH --partition=[INSERT PARTITIONs HERE] |
Specifies job queue (partition). |
#SBATCH --ntasks-per-node=8 |
Requests 8 tasks per node (often maps to GPUs/CPUs). |
#SBATCH --time=00:15:00 |
Limits execution time to 15 minutes. |
#SBATCH --mem=0 |
Uses default max memory available. |
#SBATCH --job-name=[INSERT JOB NAME HERE] |
Names the job for tracking. |
#SBATCH --mail-type=FAIL |
Sends an email if the job fails. |
#SBATCH --exclusive |
Ensures the job has exclusive access to the node. |
Tip: You can check partition limits and node availability using sinfo
These must be specified before the job-specific parameters below. They will be passed to our script as SLURM-provided variables,
e.g. ${SLURM_JOB_NUM_NODES}
.
Job-Specific Parameters
These environment variables configure training. They can be whatever you want. Here we set various parameters that we will use in the training script below.
It's not necessary to specify these as variables, you could just specify them directly in the training script below, but specifying them as variables makes the script more readable and easier to modify.
IMAGE_NAME=[INSERT QA IMAGE NAME HERE]
EXPERIMENT_NAME=[INSERT QA PROJECT HERE]
SEQ_LEN=8192
MAX_STEPS=100
...
ADO=0.01
HDO=0.01
Mounts
SLURM jobs run in an HPC environment where filesystems are often shared across nodes.
In our case, we store stuff in the Lustre file system directories, so you'll want to mount the correct Lustre path to the correct container path.
The Lustre filesystem is mounted in the /lustre
directory, and the container is mounted in the /workspace
directory.
To specify a mount, first specify the Lustre path, then the container path, separated by a :
. You can specify multiple mounts by separating them with commas.
Once again, we specify the paths as variables for readability and ease of modification.
DATA_PATH=/lustre/.../[INSERT DATA PATH HERE]
DATA_MOUNT=/workspace/bionemo2/data
MODEL_PATH=/lustre/.../[INSERT MODEL PATH HERE]
MODEL_MOUNT=/workspace/bionemo2/model
RESULTS_PATH=$MODEL_PATH/experiments/${EXPERIMENT_NAME}
mkdir -p $RESULTS_PATH
MOUNTS=${DATA_PATH}:${DATA_MOUNT},${MODEL_PATH}:${MODEL_MOUNT},$HOME/.cache:/root/.cache
Note that paths on EOS and ORD are different, so you'll want to mount the correct Lustre path to the correct container path.
Pay special attention to RESULTS_PATH
, as this will be the location of your experiment results.
It's common to set this EXPERIMENT_NAME
specific to the parameters of the experiment you're running, e.g.
EXPERIMENT_NAME=EVO2_SEQLEN${SEQ_LEN}_PP${PP_SIZE}_TP${TP_SIZE}_CP${CP_SIZE}_LR${LR}_MINLR${MIN_LR}_WU${WU_STEPS}_GA${GRAD_ACC_BATCHES}_...
Training Command Execution
After setting up all parameters and mounts, the training script is launched within the SLURM job using a compound command. This command string—stored in the COMMAND
variable—calls the Python training script with all the environment-specific arguments and hyperparameters defined earlier.
python /workspace/bionemo2/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py \
-d /workspace/bionemo2/sub-packages/bionemo-evo2/examples/configs/full_pretrain_shortphase_config.yaml \
--num-nodes=${SLURM_JOB_NUM_NODES} \
--devices=${SLURM_NTASKS_PER_NODE} \
--grad-acc-batches $GRAD_ACC_BATCHES \
--max-steps=$MAX_STEPS \
--seed $SEED \
...
--model-size=${MODEL_SIZE}
EOF
In the command above, we invoke the training script with all the environment-specific arguments and hyperparameters defined earlier, including the SLURM-provided variables and our own parameters such as gradient accumulation, learning rate, etc.
The command is executed in a containerized environment to ensure that dependencies and runtime conditions remain consistent. This is accomplished by using the srun
command with container options:
srun --output ${RESULTS_PATH}/slurm-%j.out \
--error ${RESULTS_PATH}/error-%j.out \
--container-image=$IMAGE_NAME \
--container-mounts ${MOUNTS} \
bash -c "${COMMAND}"
srun
initiates the job across allocated resources.--container-image
flag ensures that the job runs with the correct environment. Here you can point to an image or a.sqsh
file.--container-mounts
flag maps the Lustre file system directories to the container’s workspace, as we specified above, ensuring that data, models, and results are accessible.- Log Redirection: the
--error
and--output
flags redirect standard output and error messages to dedicated log files (slurm-%j.out
anderror-%j.out
) under the results directory.
To access logs, you can use the following commands:
# Standard output logs
cat ${RESULTS_PATH}/slurm-<JOB_ID>.out
# Error logs
cat ${RESULTS_PATH}/error-<JOB_ID>.out
Using The Sbatch Script
To kick off the job, you can submit the script using sbatch
:
sbatch my_training_script.sbatch
This will submit the job to the SLURM scheduler, and you can check the status of the job using squeue
:
squeue -u $USER