Accessing the Cockpit#
SSH port forwarding is used to access the cockpit to see why a job failed for a user and what anomalies ARE identified
kubectl get svc heimdall-dashboard -n heimdall
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
heimdall-dashboard ClusterIP 10.22.210.1 <none> 3001/TCP 31d
$ ssh -L 3001:<cockpit k8s service>:3001 <unix username>@<login node>
Go to http://localhost:3001/mission-control/recovery-engine/dashboard/workflow

Monitoring and Performance#
The Cockpit performance info section logs the exact times when each checkpoint was saved. For example, this link shows the first job submitted by a user, 1778(0) - where 1778 is the Slurm job ID and 0 indicates this is the first job.
