Note: Torque installations are highly customized. Conventions for specifying job resources vary from site to site and we expect that, analogously, the convention for enabling MPS could vary from site to site as well. Check with your system's administrator to find out if they already have a means to provision MPS on your behalf.
Tinkering with nodes outside the queuing convention is generally discouraged since jobs are usually dispatched as nodes are released by completing jobs. It is possible to enable MPS on a per-job basis by using the Torque prologue and epilogue scripts to start and stop the nvidia-cuda-mps-control daemon. In this example, we re-use the “account” parameter to request MPS for a job, so that the following command.
qsub -A “MPS=true” …
will result in the prologue script starting MPS as shown:
# Activate MPS if requested by user
USER=$2
ACCTSTR=$7
echo $ACCTSTR | grep -i "MPS=true"
if [ $? -eq 0 ]; then
nvidia-smi -c 3
USERID=`id -u $USER`
export CUDA_VISIBLE_DEVICES=0
nvidia-cuda-mps-control -d && echo "MPS control daemon started"
sleep 1
echo "start_server -uid $USERID" | nvidia-cuda-mps-control && echo "MPS server started for $USER"
fi
and the epilogue script stopping MPS as shown:
# Reset compute mode to default
nvidia-smi -c 0
# Quit cuda MPS if it's running
ps aux | grep nvidia-cuda-mps-control | grep -v grep > /dev/null
if [ $? -eq 0 ]; then
echo quit | nvidia-cuda-mps-control
fi
# Test for presence of MPS zombie
ps aux | grep nvidia-cuda-mps | grep -v grep > /dev/null
if [ $? -eq 0 ]; then
logger "`hostname` epilogue: MPS refused to quit! Marking offline"
pbsnodes -o -N "Epilogue check: MPS did not quit" `hostname`
fi
# Check GPU sanity, simple check
nvidia-smi > /dev/null
if [ $? -ne 0 ]; then
logger "`hostname` epilogue: GPUs not sane! Marking `hostname` offline"
pbsnodes -o -N "Epilogue check: nvidia-smi failed" `hostname`
fi