Workload Management#

Introduction#

Workload management is the submission and control of work on the system. Slurm is the workload management system used. Slurm is an open-source job scheduling system for Linux clusters, most frequently used for HPC applications. This guide covers some of the basics to get started using Slurm as a user on the DGX SuperPOD, including how to use Slurm commands such as sinfo, srun, sbatch, squeue, and scancel.

The basic flow of a workload management system is that the user submits a job to the queue. A job is a collection of work to be executed. Shell scripts are the most common because a job often consists of many different commands.

The system will take all the jobs submitted that are not yet running, look at the state of the system, and then map those jobs to the available resources. This workflow enables users to manage their work within large groups with the system determining the optimal way to order jobs for maximum system utilization (or other metrics that system administrators can configure).

Viewing System State#

To see all nodes in the cluster and their current state, ssh to the Slurm login node for your cluster and run the sinfo command:

sinfo
PARTITION   AVAIL           TIMELIMIT       NODES           STATE           NODELIST
batch*              up              infinite        9               idle            dgx[1-9]

There are nine nodes available in this example, all in an idle state. If a node is busy, its state will change from idle to alloc when the node is in use:

sinfo
PARTITION   AVAIL           TIMELIMIT       NODES           STATE           NODELIST
batch*              up              infinite        1               alloc           dgx1
batch*              up              infinite        8               idle            dgx[2-9]

Running Jobs#

There are three ways to run jobs under Slurm. Jobs can be run with sbatch, where the work is queued in the system and control is returned to the prompt. The second is with srun, which will run the job on the system and the command will block while it waits to run and then runs to completion. The third way is to submit interactive jobs where srun is used to create the job, but shell access is given.

Running Jobs with sbatch#

While the srun command blocks any other execution in the terminal, sbatch can be run to queue a job for execution when resources are available in the cluster. Also, a batch job will enable several jobs to queue up and run as nodes become available. It is therefore good practice to encapsulate everything that must be run into a script and then execute with sbatch.

cat script.sh
#!/bin/bash
/bin/hostname sleep 30

sbatch script.sh
2322

squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2322 batch script.sh user R 0:00 1 dgx1

ls
slurm-2322.out

cat slurm-2322.out
dgx1

Running Jobs with srun#

To run a job, use the srun command:

srun hostname
dgx1

This instructed Slurm to find the first available node and run hostname on it. It returned the result in our command prompt. It is just as easy to run a different command that runs a python script or a container using srun.

Sometimes it is necessary to run on multiple systems:

srun --ntasks 2 -l hostname
dgx1
dgx2
NVIDIA DGX SuperPOD User Guide DU-10264-001 V3 | 6

Running Interactive Jobs with srun#

When developing and experimenting, it is helpful to run an interactive job, which requests a resource and provides a command prompt as an interface to it:

srun --pty /bin/bash
dgx1:~$ hostname
dgx1
dgx1:~$ exit

During interactive mode, the resource is being reserved for use until the prompt is exited. Commands can be run in succession.

Before starting an interactive session with srun, it may be helpful to create a session on the login node with a tool like tmux or screen. This will prevent a user from losing interactive jobs if there is a network outage or the terminal is closed.

Note

Local administrative policies may restrict or prevent interactive jobs. Ask a local system administrator for specific information about running interactive jobs.

Specifying Resources when Submitting Jobs#

When submitting a job with srun or sbatch, request the specific resources needed for the job. Allocations are all based on tasks. A task is a unit of execution. Multiple GPUs, CPUs, or other resources can be associated to a task. A task cannot span a node. A single task or multiple tasks can be assigned to a node. As shown in Table 2 Resources can be requested several different ways.

Table 2. Methods to specify sbatch and srun options

sbatch/srun Option	Description
-N, –nodes=	Specify the total number of nodes to request
-n, –ntasks=	Specify the total number of tasks to request
–ntasks-per-node=	Specify the number of tasks per node
-G, –gpus=	Total number of GPUs to allocate for the job
–gpus-per-task=	Number of GPUs per task
–gpus-per-node=	Number of GPUs to be allocated per node
–exclusive	Guarantee that nodes are not shared among jobs

While there are many combinations of options, here are a few common ways to submit jobs:

Request two tasks:
srun -n 2 <cmd>

Request two nodes, eight tasks per node, and one GPU per task:

sbatch -N 2 –-ntasks-per-node=8 –-gpus-per-task=1 <cmd>

Request 16 nodes, eight GPUs per node:

sbatch -N 16 –-gpus-per-node=8 –-exclusive <cmd>

Monitoring Jobs#

To see which jobs are running in the cluster, use the squeue command:

squeue -a -l
Tue Nov 17 19:08:18 2020
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
9      batch    bash user01 RUNNING 5:43 UNLIMITED 1 dgx1
10     batch    Bash user02 RUNNING 6:33 UNLIMITED 2 dgx[2-3]

To see just the running jobs for a particular user USERNAME:

squeue -l -u USERNAME

The squeue command has many different options available. See the main page for more details.

Canceling Jobs#

To cancel a job, use the scancel command:

scancel JOBID

Additional Resources#

Additional resources include: