Workload Management
Introduction
Workload management is the submission and control of work on the system. Slurm is the workload management system used. Slurm is an open-source job scheduling system for Linux clusters, most frequently used for HPC applications. This guide covers some of the basics to get started using Slurm as a user on the DGX SuperPOD, including how to use Slurm commands such as sinfo, srun, sbatch, squeue, and scancel.
The basic flow of a workload management system is that the user submits a job to the queue. A job is a collection of work to be executed. Shell scripts are the most common because a job often consists of many different commands.
The system will take all the jobs submitted that are not yet running, look at the state of the system, and then map those jobs to the available resources. This workflow enables users to manage their work within large groups with the system determining the optimal way to order jobs for maximum system utilization (or other metrics that system administrators can configure).
Viewing System State
To see all nodes in the cluster and their current state, ssh to the Slurm login node for your cluster and run the sinfo command:
1sinfo
2PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
3batch* up infinite 9 idle dgx[1-9]
There are nine nodes available in this example, all in an idle state. If a node is busy, its state will change from idle to alloc when the node is in use:
1sinfo
2PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
3batch* up infinite 1 alloc dgx1
4batch* up infinite 8 idle dgx[2-9]
Running Jobs
There are three ways to run jobs under Slurm. Jobs can be run with sbatch, where the work is queued in the system and control is returned to the prompt. The second is with srun, which will run the job on the system and the command will block while it waits to run and then runs to completion. The third way is to submit interactive jobs where srun is used to create the job, but shell access is given.
Running Jobs with sbatch
While the srun command blocks any other execution in the terminal, sbatch can be run to queue a job for execution when resources are available in the cluster. Also, a batch job will enable several jobs to queue up and run as nodes become available. It is therefore good practice to encapsulate everything that must be run into a script and then execute with sbatch.
1cat script.sh
2#!/bin/bash
3/bin/hostname sleep 30
1sbatch script.sh
22322
1squeue
2JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
32322 batch script.sh user R 0:00 1 dgx1
1ls
2slurm-2322.out
1cat slurm-2322.out
2dgx1
Running Jobs with srun
To run a job, use the srun command:
1srun hostname
2dgx1
This instructed Slurm to find the first available node and run hostname on it. It returned the result in our command prompt. It is just as easy to run a different command that runs a python script or a container using srun.
Sometimes it is necessary to run on multiple systems:
1srun --ntasks 2 -l hostname
2dgx1
3dgx2
4NVIDIA DGX SuperPOD User Guide DU-10264-001 V3 | 6
Running Interactive Jobs with srun
When developing and experimenting, it is helpful to run an interactive job, which requests a resource and provides a command prompt as an interface to it:
1srun --pty /bin/bash
2dgx1:~$ hostname
3dgx1
4dgx1:~$ exit
During interactive mode, the resource is being reserved for use until the prompt is exited. Commands can be run in succession.
Before starting an interactive session with srun, it may be helpful to create a session on the login node with a tool like tmux or screen. This will prevent a user from losing interactive jobs if there is a network outage or the terminal is closed.
Note
Local administrative policies may restrict or prevent interactive jobs. Ask a local system administrator for specific information about running interactive jobs.
Specifying Resources when Submitting Jobs
When submitting a job with srun or sbatch, request the specific resources needed for the job. Allocations are all based on tasks. A task is a unit of execution. Multiple GPUs, CPUs, or other resources can be associated to a task. A task cannot span a node. A single task or multiple tasks can be assigned to a node. As shown in Table 2 Resources can be requested several different ways.
Table 2. Methods to specify sbatch and srun options
sbatch/srun Option |
Description |
---|---|
-N, –nodes= |
Specify the total number of nodes to request |
-n, –ntasks= |
Specify the total number of tasks to request |
–ntasks-per-node= |
Specify the number of tasks per node |
-G, –gpus= |
Total number of GPUs to allocate for the job |
–gpus-per-task= |
Number of GPUs per task |
–gpus-per-node= |
Number of GPUs to be allocated per node |
–exclusive |
Guarantee that nodes are not shared among jobs |
While there are many combinations of options, here are a few common ways to submit jobs:
Request two tasks:
srun -n 2 <cmd>
Request two nodes, eight tasks per node, and one GPU per task:
sbatch -N 2 –-ntasks-per-node=8 –-gpus-per-task=1 <cmd>
Request 16 nodes, eight GPUs per node:
sbatch -N 16 –-gpus-per-node=8 –-exclusive <cmd>
Monitoring Jobs
To see which jobs are running in the cluster, use the squeue command:
1squeue -a -l
2Tue Nov 17 19:08:18 2020
3JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
49 batch bash user01 RUNNING 5:43 UNLIMITED 1 dgx1
510 batch Bash user02 RUNNING 6:33 UNLIMITED 2 dgx[2-3]
To see just the running jobs for a particular user USERNAME:
squeue -l -u USERNAME
The squeue command has many different options available. See the main page for more details.
Canceling Jobs
To cancel a job, use the scancel command:
scancel JOBID
Additional Resources
Additional resources include: