Troubleshooting#
Jobs stuck in the queue pending#
- Check the PRS daemon
Verify that the PRS service is running (e.g.,
systemctl status nvidia-prs
).- Inspect Slurm pending reason
Run
squeue
; if you seeGlobalResourceServerUnavailable
, it indicates a problem with PRS↔Slurm communication.- Verify PDs
Verify the PRS status using
cmsh
. Ensure the PDs are operational and the status appears as expected (e.g. Running) with the expected budgets.
Confirm sufficient power budget for your job
Query the PRS status for the maximum number of nodes that can be allocated to a job. Ensure the number is reasonable.
If the number is as expected, the job might be too large. In this case, you may see
GlobalResourceInsufficient
as a reason insqueue
. Consider reducing the job size.If the number seems incorrect and too small, verify the power budget and increase it if necessary.
Check recent BCM events
[basecm11->partition[base]]% events 100
- Rule out other Slurm issues
Network, configuration errors, or node outages can also cause pending jobs.