Workload Management

Introduction

Workload management is the submission and control of work on the system. Slurm is the workload management system used. Slurm is an open-source job scheduling system for Linux clusters, most frequently used for HPC applications. This guide covers some of the basics to get started using Slurm as a user on the DGX SuperPOD, including how to use Slurm commands such as sinfo, srun, sbatch, squeue, and scancel.

The basic flow of a workload management system is that the user submits a job to the queue. A job is a collection of work to be executed. Shell scripts are the most common because a job often consists of many different commands.

The system will take all the jobs submitted that are not yet running, look at the state of the system, and then map those jobs to the available resources. This workflow enables users to manage their work within large groups with the system determining the optimal way to order jobs for maximum system utilization (or other metrics that system administrators can configure).

Viewing System State

To see all nodes in the cluster and their current state, ssh to the Slurm login node for your cluster and run the sinfo command:

1sinfo
2PARTITION   AVAIL           TIMELIMIT       NODES           STATE           NODELIST
3batch*              up              infinite        9               idle            dgx[1-9]

There are nine nodes available in this example, all in an idle state. If a node is busy, its state will change from idle to alloc when the node is in use:

1sinfo
2PARTITION   AVAIL           TIMELIMIT       NODES           STATE           NODELIST
3batch*              up              infinite        1               alloc           dgx1
4batch*              up              infinite        8               idle            dgx[2-9]

Running Jobs

There are three ways to run jobs under Slurm. Jobs can be run with sbatch, where the work is queued in the system and control is returned to the prompt. The second is with srun, which will run the job on the system and the command will block while it waits to run and then runs to completion. The third way is to submit interactive jobs where srun is used to create the job, but shell access is given.

Running Jobs with sbatch

While the srun command blocks any other execution in the terminal, sbatch can be run to queue a job for execution when resources are available in the cluster. Also, a batch job will enable several jobs to queue up and run as nodes become available. It is therefore good practice to encapsulate everything that must be run into a script and then execute with sbatch.

1cat script.sh
2#!/bin/bash
3/bin/hostname sleep 30
1sbatch script.sh
22322
1squeue
2JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
32322 batch script.sh user R 0:00 1 dgx1
1ls
2slurm-2322.out
1cat slurm-2322.out
2dgx1

Running Jobs with srun

To run a job, use the srun command:

1srun hostname
2dgx1

This instructed Slurm to find the first available node and run hostname on it. It returned the result in our command prompt. It is just as easy to run a different command that runs a python script or a container using srun.

Sometimes it is necessary to run on multiple systems:

1srun --ntasks 2 -l hostname
2dgx1
3dgx2
4NVIDIA DGX SuperPOD User Guide DU-10264-001 V3 | 6

Running Interactive Jobs with srun

When developing and experimenting, it is helpful to run an interactive job, which requests a resource and provides a command prompt as an interface to it:

1srun --pty /bin/bash
2dgx1:~$ hostname
3dgx1
4dgx1:~$ exit

During interactive mode, the resource is being reserved for use until the prompt is exited. Commands can be run in succession.

Before starting an interactive session with srun, it may be helpful to create a session on the login node with a tool like tmux or screen. This will prevent a user from losing interactive jobs if there is a network outage or the terminal is closed.

Note

Local administrative policies may restrict or prevent interactive jobs. Ask a local system administrator for specific information about running interactive jobs.

Specifying Resources when Submitting Jobs

When submitting a job with srun or sbatch, request the specific resources needed for the job. Allocations are all based on tasks. A task is a unit of execution. Multiple GPUs, CPUs, or other resources can be associated to a task. A task cannot span a node. A single task or multiple tasks can be assigned to a node. As shown in Table 2 Resources can be requested several different ways.

Table 2. Methods to specify sbatch and srun options

sbatch/srun Option

Description

-N, –nodes=

Specify the total number of nodes to request

-n, –ntasks=

Specify the total number of tasks to request

–ntasks-per-node=

Specify the number of tasks per node

-G, –gpus=

Total number of GPUs to allocate for the job

–gpus-per-task=

Number of GPUs per task

–gpus-per-node=

Number of GPUs to be allocated per node

–exclusive

Guarantee that nodes are not shared among jobs

While there are many combinations of options, here are a few common ways to submit jobs:

  • Request two tasks:

    srun -n 2 <cmd>
    
  • Request two nodes, eight tasks per node, and one GPU per task:

    sbatch -N 2 –-ntasks-per-node=8 –-gpus-per-task=1 <cmd>
    
  • Request 16 nodes, eight GPUs per node:

    sbatch -N 16 –-gpus-per-node=8 –-exclusive <cmd>
    

Monitoring Jobs

To see which jobs are running in the cluster, use the squeue command:

1squeue -a -l
2Tue Nov 17 19:08:18 2020
3JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
49      batch    bash user01 RUNNING 5:43 UNLIMITED 1 dgx1
510     batch    Bash user02 RUNNING 6:33 UNLIMITED 2 dgx[2-3]

To see just the running jobs for a particular user USERNAME:

squeue -l -u USERNAME

The squeue command has many different options available. See the main page for more details.

Canceling Jobs

To cancel a job, use the scancel command:

scancel JOBID

Additional Resources

Additional resources include: