Grafana Cloud Setup#

  • Dashboards: Ensure dashboards for services like ADI, IRM, job manager, KPI service, watcher, and controller are set up in Grafana Cloud.

  • Logs Explorer: Use the logs explorer to monitor service logs and troubleshoot issues.

  • Job Logs: Job logs from customers have a special label appName=slurm-app-log

Grafana Cloud Dashboard Training Efficiency Dashboard

The Training Efficiency KPI dashboard displays the downtime for each job and the aggregated downtime across all jobs. For example, the training time efficiency for job 1778—which automatically restarted twice—is shown to be 70.5%. The low efficiency is due to the second restart (1778_2), where the job was manually canceled before reaching the first checkpoint save to illustrate the impact of a failure restart. As expected, the performance data for this restart shows no checkpoint save. NMC ARE uses the first checkpoint save as the criterion for determining restart success or failure. This threshold is configurable per system deployment.

Job Performance Details