Confirming ARE is Operational#
ARE functionality can be tested by running simple bash scripts to test both modes of ARE (requeue and singleton dependency)
Singleton Dependency#
#!/bin/bash
#SBATCH -t 00:05:00
#SBATCH --dependency=singleton
#SBATCH --comment='{"APS": {"auto_resume_mode": "singleton_dependency"}}'
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
echo 'ARE light-weight fault simulation test'
# This srun command sets the output file for the job step. In the job step, it simply sleep for 30s, then print a
# log line that will be recognized by ARE as segment fault, then sleep for another 60s before exiting with 1.
srun --output="$(pwd)/%x_%j_$DATETIME.log" bash -c \
"echo 'Start time: $DATETIME' && echo 'Sleeping 30 seconds' && \
sleep 30 && echo 'Rank0: (1) segmentation fault: artificial segfault' && \
echo 'Sleep another 60 seconds' && \
sleep 60 && exit 1"
Copy the content of the bash script above and save it into a file in Slurm, e.g.:
sbatch_test.sh
.Submit two jobs to Slurm with the
sbatch
script. Note that you have to specify the job name (with -J option) and the partition name (with -p option) since they are not encoded in the script. Also, please make sure the second command is run with the--hold
option as well.
sbatch -p defq -J ARE-verification-test sbatch_test.sh
sbatch --hold -p defq -J ARE-verification-test sbatch_test.sh
Requeue#
#!/bin/bash
#SBATCH -t 00:05:00
#SBATCH --comment='{"APS": {"auto_resume_mode": "requeue", "max_requeue_times": 1}}'
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
echo 'ARE light-weight fault simulation test (auto resume mode: requeue)'
# This srun command sets the output file for the job step. In the job step, it simply sleep for 30s, then print a
# log line that will be recognized by ARE as segment fault, then sleep for another 60s before exiting with 1.
srun --output="$(pwd)/%x_%j_$DATETIME.log" bash -c \
"echo 'Start time: $DATETIME' && echo 'Sleeping 30 seconds' && \
sleep 30 && echo 'Rank0: (1) segmentation fault: artificial segfault' && \
echo 'Sleep another 60 seconds' && \
sleep 60 && exit 1"