Managing Slurm#

Introduction#

Workload management is the submission and control of work on the system. Slurm is the workload management system used on the DGX SuperPOD. It is an open-source job scheduling system for Linux clusters, most frequently used for high-performance computing (HPC) applications. This section will cover some of the basics to get started using Slurm as a user on the DGX SuperPOD. More advanced information about Slurm usage can be found in the Slurm documentation.

The basic flow of a workload management system is the user submits a job to the queue. A job is a collection of work to be executed. What gets submitted is a command, either defined by a shell script or a binary. Shell scripts are the most common because a job often consists of many different commands.

The system will take all the jobs submitted that are not yet running, look at the state of the system, and then map those jobs to the available resources. This workflow allows users to manage their work within large groups and the system will find the optimal way to order the jobs to maximize system utilization or other metrics that can be configured by the system administrators.

Key Slurm terms are detailed in Table 10.

Table 10. Slurm key terms

Term	Definition
job	A unit of work that can be scheduled on the cluster. Each job requests a particular number of compute nodes and may be started by Slurm after the requested number of nodes is reached.
batch job	Submitted to the cluster with a job script, which is an executable such as a bash script. Slurm will wait for the requested number of nodes to be available and will then allocate those nodes to the job and run the script on the first node in the group. The commands in the script are then responsible for running the user workload across the nodes in the allocation. A batch job can be submitted using the sbatch command. When a job is submitted with sbatch, the command will return immediately and place the job in the queue.
interactive job	Submitted to the cluster and requests a pseudo-terminal, so that a user can work on the cluster interactively without having to write a script. Interactive jobs can be submitted by using the srun command with the –pty flag. When you submit a job interactively, the srun command will block until the requested nodes are available, and then provide an interactive terminal to the user.
Slurm controller	The server that is responsible for keeping track of all the servers in the cluster, accepting job submissions, and scheduling work on the cluster. The controller runs a slurmctld daemon for managing work on the cluster, and a slurmdbd daemon for keeping track of the job accounting database.
compute node	A compute node, or just node, is an individual server that runs jobs in the cluster. For example, a single DGX A100 system is a Slurm node. Each node runs a slurmd daemon that manages jobs running on that node.
login node	A server that regular users SSH to submit work to the cluster. The login node does not run either a slurmd or slurmctld daemon but has the Slurm tools installed so that users can query job information and submit work.
partition	A logical group of compute nodes in Slurm. Each compute node may belong to more than one partition. Jobs are submitted to run in a particular partition and will use nodes from that group.
queue	The list of jobs that are either currently running, or which are waiting to be allocated nodes to run on. If resources are available when a job is submitted, it will run immediately. If there are not sufficient resources to run a job, it will be placed in the queue and wait until resources are available.

Checking Node Status#

Use the sinfo command to check the status of all the nodes on the cluster.

dgxa100@pg-login-mgmt001:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug        up 1-00:00:00      1 drain* dgx049
debug        up 1-00:00:00      1  drain dgx017
debug        up 1-00:00:00      4  alloc dgx[070,081,090,097]
batch*       up 1-00:00:00     96  alloc dgx[001-012,018-048,050-069,091,098-129]
batch*       up 1-00:00:00     38   idle dgx[013-016,071-080,082-089,092-096,130-140]
su01         up 1-00:00:00     15  alloc dgx[001-012,018-020]
su01         up 1-00:00:00      4   idle dgx[013-016]
su02         up 1-00:00:00     20  alloc dgx[021-040]
su03         up 1-00:00:00     19  alloc dgx[041-048,050-060]
su04         up 1-00:00:00      9  alloc dgx[061-069]
su04         up 1-00:00:00     10   idle dgx[071-080]
su05         up 1-00:00:00      4  alloc dgx[091,098-100]
su05         up 1-00:00:00     13   idle dgx[082-089,092-096]
su06         up 1-00:00:00     20  alloc dgx[101-120]
su07         up 1-00:00:00      9  alloc dgx[121-129]
su07         up 1-00:00:00     11   idle dgx[130-140]

sinfo lists all the partitions in the cluster, and then groups each list of nodes by its status. Status descriptions are in Table 11.

Table 11. sinfo status descriptions

Status	Description
idle	Nodes are online, not currently running a job, and available to run a job
alloc	Nodes are online and allocated to a running job
drain	Nodes are online, but they have been marked as “drain” to prevent jobs from running on them. This might be because they failed a health check, or because an administrator manually marked them to drain so the administrator could do maintenance.
drng	Nodes are “draining”: they have been marked to drain, but still have a job running on them. When that job completes, no further jobs will run on them.
down	Nodes are not online and Slurm cannot contact them
boot	Nodes are being rebooted by Slurm

sinfo can be restricted to showing information about a particular partition using the -p option:

dgxa100@pg-login-mgmt001:~$ sinfo -p batch
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*       up 1-00:00:00     32  alloc dgx[098-129]
batch*       up 1-00:00:00    102   idle dgx[001-016,018-048,050-069,071-080,082-089,091-096,130-140]

It can also display only the nodes that are down or drained and show the reason that they are in that state by using the -R option:

dgxa100@pg-login-mgmt001:~$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
crashing on boot     root      2020-12-04T10:33:15 dgx049
NHC: check_hw_ib:  N root      2020-12-04T10:32:42 dgx017

The sinfo -R command only shows a few characters of the reason for the failure. To see longer details about the reason, run the following command:

dgxa100@pg-login-mgmt001:~$ scontrol show node dgx017 | grep Reason
   Reason=NHC: check_hw_ib:  No IB port mlx5_4:1 is ACTIVE (LinkUp 40 Gb/sec). [root@2020-12-04T10:32:42]

Showing Detailed Node Information#

Detailed information about a Slurm node is shown by using the scontrol command:

dgxa100@pg-login-mgmt001:~$ scontrol show node dgx017
NodeName=dgx017 Arch=x86_64 CoresPerSocket=64
   CPUAlloc=0 CPUTot=256 CPULoad=1.78
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:8(S:0-1)
   NodeAddr=dgx017 NodeHostName=dgx017 Version=20.02.4
   OS=Linux 5.4.0-54-generic #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020
   RealMemory=1031000 AllocMem=0 FreeMem=1017487 Sockets=2 Boards=1
   State=DOWN+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug
   BootTime=2020-12-05T19:26:50 SlurmdStartTime=2020-12-05T19:30:32
   CfgTRES=cpu=256,mem=1031000M,billing=256
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=NHC: check_hw_ib:  No IB port mlx5_4:1 is ACTIVE (LinkUp 40 Gb/sec). [root@2020-12-04T10:32:42]

Draining a Node#

Draining a node will prevent any further jobs from running on a node. A node should be drained if it is unhealthy or for maintenance work that requires jobs not to be running.

Nodes can be drained by Slurm; by NHC; or manually by an administrator.

To manually drain a node:

[headnode01->device]% use dgx-001
[headnode01->device[dgx-001]]% drain
Engine            Node             Status           Reason
----------------  ---------------- ---------------- --------------------
slurm-dgxsuperpod dgx-001                Drained          Drained by CMDaemon

To un-drain a node that you want to run jobs on again:

[headnode01->device[dgx-001]]% undrain
Engine            Node             Status           Reason
----------------  ---------------- ---------------- ----------------
slurm-dgxsuperpod dgx-001

Updating Slurm Configuration#

Slurm is configured using cm-wlm-setup. This will set up slurm.conf file in /cm/shared/apps/slurm/var/etc/slum. The file is shared to every node in the cluster and will have the same contents on every node.

Most of the options of slurm.conf are outside the scope of this document. To read about the available options for configuring Slurm, see the slurm.conf documentation.

Note

Since Slurm is being managed by BCM, any changes manually made to this file will be overwritten. Changes should be made using Base View or cmsh.

Slurm Prolog and Epilog#

The workload manager runs prolog scripts before job execution, and epilog scripts after job execution. The purpose of these scripts can include:

Checking if a node is ready before submitting a job execution that may use it.
Preparing a node in some way to manage the job execution.
Cleaning up resources after job execution has ended.

Although there are global prolog and epilog scripts, editing them should be avoided. The scripts cannot be set using Base View or cmsh. Instead, the scripts must be placed by the administrator in the software image, and the relevant nodes updated from the image.

Details of Prolog and Epilog Scripts#

Even though it is not recommended, some administrators may nonetheless want to link and edit the scripts directly for their own needs, outside of the Base View or cmsh front ends. A more detailed explanation of how the prolog scripts work follows.

When a Slurm is configured using cm-wlm-setup or the Base View setup wizard, it is configured to run the generic prolog located in /cm/local/apps/cmd/scripts/prolog, and the generic epilog located in /cm/local/apps/cmd/scripts/epilog. The generic prolog and epilog scripts call a sequence of scripts for a particular workload manager in special directories.

The directories have paths in the format:

/cm/local/apps/slurm/var/prologs/
/cm/local/apps/slurm/var/epilogs/

In these directories, scripts are stored with names that have suffixes and prefixes associated with them that makes them run in special ways, as follows:

Suffixes used in the prolog/epilog directory:
- -prejob script runs before all jobs.
- -cmjob script runs before job run in a cloud.
Prefixes used in the prolog/epilog directory: 00- to 99-.

Number prefixes determine the order of script execution, with scripts with a lower number running earlier. The script names can therefore look like:

01-prolog-prejob
10-prolog-cmjob

Return values for the prolog/epilog scripts have these meanings:

0: the next script in the directory is run.
A non-zero return value: no further scripts are executed from the prolog/epilog directory.

Often, the script in a prolog/epilog directory is not a real script but a symlink, with the symlink going to a real file located in a different directory. The general script is then able to take care of what is expected of the symlink. The name of the symlink, and destination file, usually hints at what the script is expected to do.

For example, if any health checks are marked to run as prejob checks during cm wlm setup configuration, then each of the PBS workload manager variants use the symlink 01-prolog-prejob within the prolog directory /cm/local/apps/<workload manager>/var/prologs/. The symlink links to the script is /cm/local/apps/cmd/scripts/prolog-prejob.

In this case, the script is expected to run before the job.

[root©headnode apps]# pwd
/cm/local/apps
[root©headnode apps]# ls -l *pbs*/var/prologs/ openpbs/var/prologs/:
total  0
lrwxrwxrwx 1 root root ... 01-prolog-prejob -> /cm/local/apps/cmd/scripts/prolog-prejob
pbspro/var/prologs/:
total  0
lrwxrwxrwx 1 root root ... 01-prolog-prejob -> /cm/local/apps/cmd/scripts/prolog-prejob

Epilog scripts (which run after a job run) have the location /cm/local/apps/<workload manager>/var/epilogs/. Epilog script names follow the same execution sequence pattern as prolog script names.

The 01-prolog-prejob symlink is created and removed by the cluster manager on each compute node where prejob is enabled in the workload manager entity. Each such entity provides an Enable Prejob parameter that affects the symlink existence:

[head->wlm[openpbs]]% get enableprejob yes
[head->wlm[openpbs]]%

This parameter is set to yes by cm-wlm-setup when at least one health check is selected as a prejob one. If any healthcheck was configured as a prejob check before cm wlm setup execution, and the administrator had a checkmark for that health check, then the prejob is considered enabled.

Workload Manager Configuration For Prolog and Epilog Scripts#

The cluster manager configures generic prologs and epilogs during workload manager setup with cm-wlm-setup. The administrator can configure prologs and epilogs using appropriate parameters in the configuration of the workload managers, by creating the symlinks in the local prologs and epilogs directories.

Generic prologs and epilogs are configured by default to run on job compute nodes (one run per each node per job) for Slurm.

The following parameters for prologs and epilogs can be configured with cmsh or Base View:

Prolog Slurmctld: the fully qualified path of a program to execute before granting a new job allocation. The program is executed on the same node where the slurm server role is assigned. The path corresponds to the PrologSlurmctld parameter in slurm.conf.
Epilog Slurmctld: the fully qualified path of a program to execute upon termination of a job allocation. The program is executed on the same node where the slurm server role is assigned. Corresponds with the EpilogSlurmctld parameter in slurm.conf.
Prolog: the fully qualified path of a program to execute on job compute nodes before granting a new job or step allocation. The program corresponds to the Prolog parameter, and by default points to the generic prolog. This prolog runs on every node of the job if the Prolog flags parameter contains the flag Alloc (the default value), otherwise it is executed only on the first node of the job.
Epilog: the fully qualified path of a program to execute on job compute nodes when the job allocation is released.

Listing Slurm Jobs in the Queue#

The jobs currently running on the cluster, or waiting in the queue to run, can be shown using the squeue command:

dgxa100@pg-login-mgmt001:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             25668     debug submit_h  dgxa100 PD       0:00      1 (Priority)
             25669     debug submit_h  dgxa100 PD       0:00      1 (Priority)
             25670     debug submit_h  dgxa100 PD       0:00      1 (Priority)
             25592     debug submit_h  dgxa100  R       1:13      1 dgx081
             25593     debug submit_h  dgxa100  R       1:13      1 dgx090
             25594     debug submit_h  dgxa100  R       1:13      1 dgx097

The fifth column of the squeue output is the job state (ST in the header):

Jobs that are running with show in R state. The last column will list the nodes that the job is running on.
Jobs that are waiting to run will show in PD state, for pending. The last column will show the reason that the job is not running yet. This may be some reason such as Resources if there are not enough nodes available or Priority if the job is waiting in line behind another job with higher priority.

See the squeue documentation for information about the available job states and ways to filter the output.

Canceling a Slurm Job#

Use the scancel command to cancel a Slurm job.

1$ scancel <job-id>

If a running job is canceled, Slurm will send a SIGTERM signal to all the processes in the job. If the job processes do not end within a certain number of seconds (30s by default, configured with KillWait), then Slurm will send a SIGKILL signal.

If a pending job is canceled, Slurm will simply remove it from the queue.

Managing the Parameters on a Job#

Each job has several configuration parameters associated with it, such as the time limit or the partition it is running in. These parameters can be viewed with the following command:

dgxa100@pg-login-mgmt001:~$ scontrol show job 25632
    JobId=25632 JobName=submit_hpl_cuda11.0.sh
    UserId=dgxa100(13338) GroupId=dgxa100(13338) MCS_label=N/A
    Priority=44385 Nice=0 Account=compute-account QOS=normal
    JobState=PENDING Reason=Priority Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
    SubmitTime=2020-12-06T06:36:59 EligibleTime=2020-12-06T06:36:59
    AccrueTime=2020-12-06T06:36:59
    StartTime=2020-12-06T22:04:00 EndTime=2020-12-07T00:04:00 Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-06T08:04:40
    Partition=debug AllocNode:Sid=pg-login-mgmt001:1107071
    ReqNodeList=dgx081 ExcNodeList=(null)
    NodeList=(null) SchedNodeList=dgx081
    NumNodes=1-1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=8,node=1,billing=8
    Socks/Node=* NtasksPerN:B:S:C=8:0:*:1 CoreSpec=*
    MinCPUsNode=8 MinMemoryCPU=0 MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
    Command=/mnt/test/deepops/workloads/burn-in/submit_hpl_cuda11.0.sh
    WorkDir=/mnt/test/deepops/workloads/burn-in
    StdErr=/mnt/test/deepops/workloads/burn-in/results/1node_dgxa100_20201206063703/1node_dgxa100_20201206063703-25632.out
    StdIn=/dev/null
    StdOut=/mnt/test/deepops/workloads/burn-in/results/1node_dgxa100_20201206063703/1node_dgxa100_20201206063703-25632.out
    Power=
    TresPerNode=gpu:8
    MailUser=(null) MailType=NONE

Some of these parameters can be updated dynamically. For example, to extend the time limit of a job that might not finish in the time allowed, use the scontrol update command:

dgxa100@pg-login-mgmt001:~$ scontrol update job id=25632 timelimit=02:10:00

See the scontrol documentation for more information about viewing or modifying Slurm configurations.

Additional Resources#

Slurm provides many advanced features that can provide more fine-grained control over job scheduling, system use, user and group accounting, and fairness of system use. See these links for more information: