ARE Job Monitoring#
In order to make ARE monitor the job, the user needs to add the following comment to their SBATCH file:
#SBATCH --comment='{"APS": {"auto_resume_mode": "requeue"}}'
The job log filename should be in the format:
{SLURM_JOB_NAME}_${SLURM_JOB_ID}_${DATETIME}.log where DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
Application Log Format Requirements#
To ensure NMC ARE can effectively monitor and manage your training jobs, follow these formatting requirements for Slurm application logs:
Job ID in filename: The log filename should include the Slurm job ID. For example,
1234567
in:megatron_1234567_date_2024-07-31_time_18-15-55.log
Regular heartbeat: The application should log at least one line every two minutes. This helps NMC ARE detect application hangs.
Throughput logging: The application should log periodic iteration throughput containing the substring “TFLOP” to detect straggler anomalies.
Timestamps: Ensure that all log entries contain timestamps.
The maximum number of requeues for a job is determined by three factors:
System Level Configuration: A configurable upper bound set at the system level, typically in the order of dozens. This value should be large enough that it is not reached unless a job runs for an extremely long time with persistent failures.
Job Level Configuration: Individual jobs can specify their own maximum requeue limit using the
max_requeue_times
parameter in the JSON configuration. The system will use the smaller value between the job-level setting and the system-level maximum.#SBATCH --comment='{"APS": {"auto_resume_mode": "requeue", "max_requeue_times": 10}}'
Checkpoint-based Requeue Early Stop: When a job has more than X consecutive failed attempts without successfully saving a new checkpoint, the job will be considered as crash-looping and the requeue sequence will be stopped. Here X is a configurable threshold at the system level.