Troubleshooting#

Jobs stuck in the queue pending#

Check the PRS daemon

Verify that the PRS service is running (e.g., systemctl status nvidia-prs).

Inspect Slurm pending reason

Run squeue; if you see GlobalResourceServerUnavailable, it indicates a problem with PRS↔Slurm communication.

Verify PDs

Verify the PRS status using cmsh. Ensure the PDs are operational and the status appears as expected (e.g. Running) with the expected budgets.

Confirm sufficient power budget for your job

  1. Query the PRS status for the maximum number of nodes that can be allocated to a job. Ensure the number is reasonable.

  2. If the number is as expected, the job might be too large. In this case, you may see GlobalResourceInsufficient as a reason in squeue. Consider reducing the job size.

  3. If the number seems incorrect and too small, verify the power budget and increase it if necessary.

Check recent BCM events

[basecm11->partition[base]]% events 100
Rule out other Slurm issues

Network, configuration errors, or node outages can also cause pending jobs.