Confirming ARE is Operational#

ARE functionality can be tested by running simple bash scripts to test both modes of ARE (requeue and singleton dependency)

Singleton Dependency#

#!/bin/bash

#SBATCH -t 00:05:00
#SBATCH --dependency=singleton
#SBATCH --comment='{"APS": {"auto_resume_mode": "singleton_dependency"}}'

DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
echo 'ARE light-weight fault simulation test'

# This srun command sets the output file for the job step. In the job step, it simply sleep for 30s, then print a
# log line that will be recognized by ARE as segment fault, then sleep for another 60s before exiting with 1.

srun --output="$(pwd)/%x_%j_$DATETIME.log" bash -c \
     "echo 'Start time: $DATETIME' && echo 'Sleeping 30 seconds' && \
     sleep 30 && echo 'Rank0: (1) segmentation fault: artificial segfault' && \
     echo 'Sleep another 60 seconds' && \
     sleep 60 && exit 1"
  1. Copy the content of the bash script above and save it into a file in Slurm, e.g.: sbatch_test.sh.

  2. Submit two jobs to Slurm with the sbatch script. Note that you have to specify the job name (with -J option) and the partition name (with -p option) since they are not encoded in the script. Also, please make sure the second command is run with the --hold option as well.

sbatch -p defq -J ARE-verification-test sbatch_test.sh

sbatch --hold -p defq -J ARE-verification-test sbatch_test.sh

Requeue#

#!/bin/bash

#SBATCH -t 00:05:00
#SBATCH --comment='{"APS": {"auto_resume_mode": "requeue", "max_requeue_times": 1}}'

DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
echo 'ARE light-weight fault simulation test (auto resume mode: requeue)'

# This srun command sets the output file for the job step. In the job step, it simply sleep for 30s, then print a
# log line that will be recognized by ARE as segment fault, then sleep for another 60s before exiting with 1.

srun --output="$(pwd)/%x_%j_$DATETIME.log" bash -c \
     "echo 'Start time: $DATETIME' && echo 'Sleeping 30 seconds' && \
     sleep 30 && echo 'Rank0: (1) segmentation fault: artificial segfault' && \
     echo 'Sleep another 60 seconds' && \
     sleep 60 && exit 1"