Accessing the Cockpit#

SSH port forwarding is used to access the cockpit to see why a job failed for a user and what anomalies ARE identified

kubectl get svc heimdall-dashboard -n heimdall
NAME                TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
heimdall-dashboard  ClusterIP   10.22.210.1   <none>        3001/TCP   31d

$ ssh -L 3001:<cockpit k8s service>:3001 <unix username>@<login node>

Go to http://localhost:3001/mission-control/recovery-engine/dashboard/workflow

Cockpit Dashboard

Monitoring and Performance#

  • The Cockpit performance info section logs the exact times when each checkpoint was saved. For example, this link shows the first job submitted by a user, 1778(0) - where 1778 is the Slurm job ID and 0 indicates this is the first job.

Performance Information