Workload Management#

Introduction#

Workload management is the submission and control of work on the system. Slurm is the workload management system used. Slurm is an open-source job scheduling system for Linux clusters, most frequently used for HPC applications. It provides three key functions:

Resource allocation: Allocating exclusive or non-exclusive access to compute nodes for users to perform work
Job scheduling: Providing a framework for starting, executing, and monitoring work on allocated nodes
Queue management: Managing a queue of pending work and arbitrating contention for resources

SLURM manages compute resources through partitions (queues), nodes (compute servers), and jobs (units of work). Understanding these components and their interactions is essential for efficient cluster usage.

This guide covers some of the basics to get started using Slurm as a user on the DGX SuperPOD, including how to use Slurm commands such as sinfo, srun, sbatch, squeue, and scancel.

The basic flow of a workload management system is that the user submits a job to the queue. A job is a collection of work to be executed. Shell scripts are the most common because a job often consists of many different commands.

The system will take all the jobs submitted that are not yet running, look at the state of the system, and then map those jobs to the available resources. This workflow enables users to manage their work within large groups with the system determining the optimal way to order jobs for maximum system utilization (or other metrics that system administrators can configure).

Viewing System State#

Viewing Partition Information#

To see all nodes in the cluster and their current state, ssh to the Slurm login node for your cluster and run the sinfo command.

sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch\*   up    infinite  9     idle  dgx[1-9]

There are nine nodes available in this example, all in an idle state. If a node is busy, its state will change from idle to alloc when the node is in use.

sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch\*   up    infinite  1     alloc dgx1
batch\*   up    infinite  8     idle  dgx[2-9]

Common node states include:

idle: Available for job allocation
allocated: Currently running jobs
mixed: Some resources like GPUs allocated, others available
down: Unavailable due to failure or maintenance
drained: Administratively unavailable

Viewing Node Details#

Get detailed node information:

# Show all nodes with details
sinfo -N -l

# Show specific node information
scontrol show node node_name

Checking Queue Status#

View current job queue:

# Show all jobs in queue
squeue

# Show your jobs only
squeue -u $USER

# Show jobs by partition
squeue -p partition_name

# Detailed job information
squeue -l

Running Jobs#

There are three ways to run jobs under Slurm. Jobs can be run with sbatch, where the work is queued in the system and control is returned to the prompt. The second is with srun, which will run the job on the system and the command will block while it waits to run and then runs to completion. The third way is to submit interactive jobs where srun is used to create the job, but shell access is given.

Running Jobs with sbatch#

While the srun command blocks any other execution in the terminal, sbatch can be run to queue a job for execution when resources are available in the cluster. Also, a batch job will enable several jobs to queue up and run as nodes become available. It is therefore good practice to encapsulate everything that must be run into a script and then execute with sbatch.

Basic SBatch Script Structure – #SBATCH lines are optional.

#!/bin/bash

#SBATCH --job-name=my_job # Job name
#SBATCH --partition=compute # Partition name
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # Tasks per node
#SBATCH --time=01:00:00 # Time limit (HH:MM:SS)
#SBATCH --output=%j.out # Output file (%j = job ID)
#SBATCH --error=%j.err # Error file
#SBATCH –exclusive # Indicates no other jobs will share these nodes

# Content of the script like srun etc.

Example:

cat script.sh

#!/bin/bash

/bin/hostname sleep 30
2322

squeue

JOBID PARTITION NAME  USER ST TIME NODES NODELIST(REASON)
2322  batch script.sh user R  0:00 1     dgx1

ls

slurm-2322.out

cat slurm-2322.out
dgx1

Note

Common sbatch options are described in Specifying Resources when Submitting Jobs.

Running Jobs with srun#

srun is a versatile command that can run jobs in three different contexts: as standalone jobs, interactively, or as job steps within batch scripts.

Standalone Jobs#

To run a job, use the srun command:

srun hostname

dgx1

This instructed Slurm to find the first available node and run hostname on it. It returned the result in our command prompt. It is just as easy to run a different command that runs a python script or a container using srun.

Sometimes it is necessary to run on multiple systems.

srun --ntasks 2 -l hostname

dgx1
dgx2
NVIDIA DGX SuperPOD User Guide DU-10264-001 V3 \| 6

Interactive Jobs#

When developing and experimenting, it is helpful to run an interactive job, which requests a resource and provides a command prompt as an interface to it. Be noted the resource won’t be released until session got terminated.

This can also be used for launching jupyter notebooks.

# Basic interactive session
srun --pty bash

# Interactive session with GPU
srun --gres=gpu:1 --time=01:00:00 --pty bash

# Interactive Python session
srun --cpus-per-task=4 --mem=8G --pty python3

Job Steps in Sbatch Scripts#

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
# Run parallel job step
srun my_mpi_application

Running Interactive Jobs with salloc#

salloc allocates resources and returns a shell on the login node, allowing multiple commands within the allocation.

# Allocate resources
salloc --nodes=1 --time=01:00:00

# Within the allocation, use srun
srun hostname
srun my_application

# Exit allocation
exit

During interactive mode, the resource is being reserved for use until the prompt is exited. Commands can be run in succession.

Before starting an interactive session with srun, it may be helpful to create a session on the login node with a tool like tmux or screen. This will prevent a user from losing interactive jobs if there is a network outage or the terminal is closed.

Note

Local administrative policies may restrict or prevent interactive jobs. Ask a local system administrator for specific information about running interactive jobs.

Running Jobs on Specific Queues/Partitions#

Slurm organizes resources into different queues, known as partitions. Each partition may have unique properties, such as limits on wall time, node types, or hardware features. By default, jobs are submitted to the default partition (often named batch), but you can specify a different queue if your job has particular requirements or if you want to take advantage of specialized resources.

To submit a job to a specific queue, use the –partition (or -p) option with either sbatch or srun. For example, to submit a batch job to the gpu partition.

sbatch --partition=gpu script.sh

Running a Real World Job#

Here’s a complete example of a typical computational job of NemoTron LLM with sbatch script:

#!/bin/bash

# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION &
# AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied.

# See the License for the specific language governing permissions and
# limitations under the License.

# For each dataset a user elects to use, the user is responsible for
# checking if the dataset license is fit for the intended purpose.

# Parameters
#SBATCH --dependency=singleton
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --ntasks-per-node=4
#SBATCH --time=12:00:00
#SBATCH --output /tmp/nemotron-test.%j.%N.log
#SBATCH --error /tmp/nemotron-test.%j.%N.err

# setup environment variable properly

export TRANSFORMERS_OFFLINE=1
export HUGGINGFACE_HUB_CACHE=/root/.cache/huggingface/hub
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export HYDRA_FULL_ERROR=1

STAGE_PATH="/cm/shared/nemotron"
CONT="docker://nvcr.io#nvidia/nemo:25.02.rc6"
PRE_CMD="

export CUDA_DEVICE_MAX_CONNECTIONS=1;

CUDA_VISIBLE_DEVICES=0,1,2,3"

MODEL_SIZE=15b
SYNTHETIC_DATA_ENABLED=True

GBS=$(( SLURM_JOB_NUM_NODES \* 32 ))
RESULT_DIR=$STAGE_PATH/results

FP8_ENABLED=${ENABLE_FP8:-False}
if [[ "${FP8_ENABLED,,}" = true ]]; then
RESULT_DIR=$RESULT_DIR/fp8
else
RESULT_DIR=$RESULT_DIR/bf16
fi

TP=4
PP=1
MP=$(( TP \* PP ))
CONFIG_OVERRIDES="model.global_batch_size=$GBS \\
trainer.num_nodes=${SLURM_JOB_NUM_NODES} \\
trainer.devices=4 \\
trainer.accelerator='gpu' \\
run.results_dir=$RESULT_DIR \\
model.tokenizer.model=$STAGE_PATH/nemotron_2_256k.model \\
model.data.index_mapping_dir=$STAGE_PATH/dataset-index \\
model.tensor_model_parallel_size=$TP \\
model.pipeline_model_parallel_size=$PP \\
exp_manager.checkpoint_callback_params.model_parallel_size=$MP \\
model.fp8=${FP8_ENABLED} \\
model.virtual_pipeline_model_parallel_size=null \\
model.ub_tp_comm_overlap=False \\
model.sequence_parallel=True \\
model.micro_batch_size=4 \\
model.tp_comm_atomic_ag=False \\
model.tp_comm_atomic_rs=False \\
model.mcore_gpt=True \\
model.transformer_engine=True \\
model.fp8_hybrid=True \\
model.apply_rope_fusion=True \\
+model.fp8_params=${FP8_ENABLED} \\
+model.gc_interval=100 \\
+model.train_samples=8154297 \\
+model.lr_decay_samples=7746382 \\
+model.lr_warmup_samples=4077 \\
trainer.enable_checkpointing=False \\
exp_manager.create_checkpoint_callback=False"

srun --gres=gpu:4 --msg-timeout=180 \\
--container-image ${CONT} \\
--container-mounts ${RESULT_DIR},${STAGE_PATH} \\
--container-writable \\
--no-container-mount-home \\
--mpi=pmix \\
--export=ALL \\
bash -c "$PRE_CMD python3 -u
/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \\
--config-path=${STAGE_PATH} \\
--config-name=nemotron4-${MODEL_SIZE}-synth.yaml \\
$CONFIG_OVERRIDES"

Using Checkpoints#

Checkpointing allows jobs to save their state and resume from that point if interrupted. This is crucial for long-running jobs that might be interrupted by system maintenance, preemption, or resource limits.

Why Use Checkpoints?#

Fault tolerance: Recover from unexpected job termination
Time limit management: Continue beyond partition time limits
Resource optimization: Allow preemption for higher priority jobs
Progress preservation: Avoid losing hours or days of computation

Storage Best Practices:#

Use shared storage for restart capability: Checkpoints on local scratch are lost when job ends
Consider I/O performance: Local scratch is fastest for frequent checkpoints
Plan for storage space: Checkpoints can be large, monitor usage
Use compression: Reduce checkpoint size when possible
Implement cleanup: Remove old checkpoints to save space

Checkpoints Example:#

This example demonstrates checkpointing only and is not a fully functional MegaTron sbatch script.

#!/bin/bash

#SBATCH --partition=defq
#SBATCH -t 48:00:00
#SBATCH --exclusive --mem=0
#SBATCH --ntasks-per-node=4

CHECKPOINT_DIR="/nfs/megatron/checkpoints/\\${NAME}"

mkdir -p \\${CHECKPOINT_DIR}

srun --mpi=pmix \\
--export=ALL \\
--container-writable \\
bash –c "python3 -u /opt/megatron-lm/pretrain_gpt.py --load
${CHECKPOINT_DIR} --save-interval 1000000 --exit-on-missing-checkpoint"

Job Arrays#

Job arrays allow you to submit multiple similar jobs with a single command:

#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --array=1-100 # Job array indices
#SBATCH --output=output_%A_%a.out # %A = array job ID, %a = array index
#SBATCH --error=error_%A_%a.err
# Use SLURM_ARRAY_TASK_ID to differentiate tasks
echo "Processing task $SLURM_ARRAY_TASK_ID"
./process_data input_$SLURM_ARRAY_TASK_ID.dat output_$SLURM_ARRAY_TASK_ID.dat

Advanced array options:

# Array with specific indices
#SBATCH --array=1,5,10-20:2 # Indices 1, 5, 10, 12, 14, 16, 18, 20
# Limit concurrent array jobs
#SBATCH --array=1-1000%50 # Run max 50 jobs simultaneously

Job Dependencies#

Create workflows by making jobs depend on others:

# Submit first job
job1=$(sbatch --parsable job1.sh)
# Submit job that depends on job1 completion
job2=$(sbatch --parsable --dependency=afterok:$job1 job2.sh)
# Submit job that runs after job2, regardless of exit status
job3=$(sbatch --parsable --dependency=afterany:$job2 job3.sh)
# Submit job that runs only if job1 fails
job4=$(sbatch --parsable --dependency=afternotok:$job1 cleanup.sh)

Dependency types:

after:jobid: Start after specified job begins
afterok:jobid: Start after specified job completes successfully
afternotok:jobid: Start after specified job fails
afterany:jobid: Start after specified job completes (any exit status)

Specifying Resources when Submitting Jobs#

When submitting a job with srun or sbatch, request the specific resources needed for the job. Allocations are all based on tasks. A task is a unit of execution. Multiple GPUs, CPUs, or other resources can be associated with a task. A task cannot span a node. A single task or multiple tasks can be assigned to a node. As shown in the table below, resources can be requested several different ways.

Table 1 Methods to specify sbatch and srun options#
sbatch/srun Option	Description
-N, –nodes=	Specify the total number of nodes to request
-n, –ntasks=	Specify the total number of tasks to request
–ntasks-per-node=	Specify the number of tasks per node
-G, –gpus=	Total number of GPUs to allocate for the job
–gpus-per-task=	Number of GPUs per task
–gpus-per-node=	Number of GPUs to be allocated per node
–exclusive	Guarantee that nodes are not shared among jobs

While there are many combinations of options, here are a few common ways to submit jobs:

Request two tasks.

srun -n 2 <cmd>

Request two nodes, eight tasks per node, and one GPU per task.

sbatch -N 2 –-ntasks-per-node=8 –-gpus-per-task=1 <cmd>

Request 16 nodes, eight GPUs per node.

sbatch -N 16 –-gpus-per-node=8 –-exclusive <cmd>

Monitoring Jobs#

To see which jobs are running in the cluster, use the squeue command.

squeue -a -l

Tue Nov 17 19:08:18 2020
JOBID PARTITION NAME USER   STATE   TIME TIME_LIMIT NODES NODELIST(REASON)
9     batch bash     user01 RUNNING 5:43 UNLIMITED  1     dgx1
10    batch Bash     user02 RUNNING 6:33 UNLIMITED  2     dgx[2-3]

To see just the running jobs for a particular user USERNAME:

squeue -l -u USERNAME

The squeue command has many different options available. See the main page for more details.

Understanding Job States#

PD (Pending): Job is waiting for resources
R (Running): Job is currently running
CG (Completing): Job is completing
CD (Completed): Job finished successfully
F (Failed): Job terminated with non-zero exit code
CA (Cancelled): Job was cancelled
TO (Timeout): Job terminated due to time limit

Canceling Jobs#

To cancel a job, use the scancel command.

scancel JOBID

Job History and Accounting#

SLURM maintains detailed accounting information for all jobs, which is essential for monitoring resource usage, debugging failed jobs, and optimizing future submissions.

Basic Job History with sacct#

The sacct command provides access to job accounting information:

# Show your recent jobs (default: today)
sacct -u $USER

# Show jobs from specific date range
sacct -u $USER --starttime=2025-06-01 --endtime=2025-06-17

# Show jobs from last week
sacct -u $USER --starttime=now-7days

# Show specific job details
sacct -j job_id --format=JobID,JobName,State,ExitCode,Start,End

# Show job array details
sacct -j job_array_id --format=JobID,JobName,State,ExitCode,Start,End