Troubleshooting#

Jobs stuck in the queue pending#

Check the PRS daemon: Verify that the PRS service is running (e.g., systemctl status nvidia-prs).
Inspect Slurm pending reason: Run squeue; if you see GlobalResourceServerUnavailable, it indicates a problem with PRS↔Slurm communication.
Verify PDs: Verify the PRS status using cmsh. Ensure the PDs are operational and the status appears as expected (e.g. Running) with the expected budgets.

Confirm sufficient power budget for your job

Query the PRS status for the maximum number of nodes that can be allocated to a job. Ensure the number is reasonable.
If the number is as expected, the job might be too large. In this case, you may see GlobalResourceInsufficient as a reason in squeue. Consider reducing the job size.
If the number seems incorrect and too small, verify the power budget and increase it if necessary.

Check recent BCM events

[basecm11->partition[base]]% events 100

Rule out other Slurm issues: Network, configuration errors, or node outages can also cause pending jobs.