Per-Job MPS Control: A Torque/PBS Example

Note: Torque installations are highly customized. Conventions for specifying job resources vary from site to site and we expect that, analogously, the convention for enabling MPS could vary from site to site as well. Check with your system's administrator to find out if they already have a means to provision MPS on your behalf.

Tinkering with nodes outside the queuing convention is generally discouraged since jobs are usually dispatched as nodes are released by completing jobs. It is possible to enable MPS on a per-job basis by using the Torque prologue and epilogue scripts to start and stop the nvidia-cuda-mps-control daemon. In this example, we re-use the “account” parameter to request MPS for a job, so that the following command.

qsub -A “MPS=true” …

will result in the prologue script starting MPS as shown:

# Activate MPS if requested by user

USER=$2

ACCTSTR=$7

echo $ACCTSTR | grep -i "MPS=true"

if [ $? -eq 0 ]; then

nvidia-smi -c 3

USERID=`id -u $USER`

export CUDA_VISIBLE_DEVICES=0

nvidia-cuda-mps-control -d && echo "MPS control daemon started"

sleep 1

echo "start_server -uid $USERID" | nvidia-cuda-mps-control && echo "MPS server started for $USER"

and the epilogue script stopping MPS as shown:

# Reset compute mode to default

nvidia-smi -c 0

# Quit cuda MPS if it's running

ps aux | grep nvidia-cuda-mps-control | grep -v grep > /dev/null

if [ $? -eq 0 ]; then

echo quit | nvidia-cuda-mps-control

# Test for presence of MPS zombie

ps aux | grep nvidia-cuda-mps | grep -v grep > /dev/null

if [ $? -eq 0 ]; then

logger "`hostname` epilogue: MPS refused to quit! Marking offline"

pbsnodes -o -N "Epilogue check: MPS did not quit" `hostname`

# Check GPU sanity, simple check

nvidia-smi > /dev/null

if [ $? -ne 0 ]; then

logger "`hostname` epilogue: GPUs not sane! Marking `hostname` offline"

pbsnodes -o -N "Epilogue check: nvidia-smi failed" `hostname`