ARE Job Monitoring#

In order to make ARE monitor the job, the user needs to add the following comment to their SBATCH file:

#SBATCH --comment='{"APS": {"auto_resume_mode": "requeue"}}'

The job log filename should be in the format:

{SLURM_JOB_NAME}_${SLURM_JOB_ID}_${DATETIME}.log where DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`

Application Log Format Requirements#

To ensure NMC ARE can effectively monitor and manage your training jobs, follow these formatting requirements for Slurm application logs:

Job ID in filename: The log filename should include the Slurm job ID. For example, 1234567 in: megatron_1234567_date_2024-07-31_time_18-15-55.log
Regular heartbeat: The application should log at least one line every two minutes. This helps NMC ARE detect application hangs.
Throughput logging: The application should log periodic iteration throughput containing the substring “TFLOP” to detect straggler anomalies.
Timestamps: Ensure that all log entries contain timestamps.

The maximum number of requeues for a job is determined by three factors:

System Level Configuration: A configurable upper bound set at the system level, typically in the order of dozens. This value should be large enough that it is not reached unless a job runs for an extremely long time with persistent failures.
Job Level Configuration: Individual jobs can specify their own maximum requeue limit using the max_requeue_times parameter in the JSON configuration. The system will use the smaller value between the job-level setting and the system-level maximum.
```
#SBATCH --comment='{"APS": {"auto_resume_mode": "requeue", "max_requeue_times": 10}}'
```
Checkpoint-based Requeue Early Stop: When a job has more than X consecutive failed attempts without successfully saving a new checkpoint, the job will be considered as crash-looping and the requeue sequence will be stopped. Here X is a configurable threshold at the system level.