Slurm Workload Management#

NVIDIA Mission Control utilizes Slurm, an open-source workload manager developed by SchedMD, to orchestrate and schedule jobs across the DGX SuperPOD cluster. This guide assumes that administrators are already familiar with basic Slurm operations and configuration.

Detailed instructions on managing Slurm queues, submitting and monitoring jobs, managing node states (drain, resume), and configuring job prolog/epilog scripts can be found in the following resources:

Slurm and IMEX#

New to NVIDIA GB200 system, NVIDIA Internode Memory Exchange Service (IMEX) is a secure service that facilitates the mapping of GPU memory over NVLink between the GPUs in an NVLIink domain. BCM can enable the IMEX daemon either globally, or per job for Slurm.

It is recommended to implement IMEX daemon per-job for Slurm. Running the IMEX daemon per job has the advantage that one user running a job cannot read the memory of a job run by another user on another node.

Configuring per job means that IMEX runs just before the job starts, and that the service runs on only the nodes that need it.

To run IMEX per job, any global IMEX setting must first be cleared away. This can be done by first:

  1. Removing the nvidia-imex service entirely from CMDaemon.

  2. Stopping the service on the compute nodes, for example with pdssh or pdexec, or simply carrying out a reboot The value of imex in the slurmclient role can then be set:

[root@basecm11 ~]# cmsh

[basecm11]% configurationoverlay roles slurm-client-gpu

[basecm11->configurationoverlay[slurm-client-gpu]->roles]% use slurmclient

[basecm11->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]% set imex yes

[basecm11->configurationoverlay*[slurm-client-gpu*]->roles*[slurmclient*]]% commit

[basecm11->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]%
  1. Committing the above IMEX setting configures Slurm prolog and epilog scripts in the backend as follows:

    1. The prolog script configures the IMEX daemon with the nodes allocated to the GPU job and starts the IMEX daemon on the node

    2. The epilog cleans up the configuration and stops the IMEX daemon on the node.

Slurm Topology Block Support#

Slurm can be configured to support topology-aware resource allocation to optimize job performance. More information about topologies in Slurm is available at the Slurm Topology Guide.

Previous version(s) of BCM only supported the topology/tree option for Slurm configuration. BCM 11 introduces support for topology/block. This is required for BCM to enable NVLINK-aware scheduling of Slurm jobs for GB200 systems.

The topology.conf file describes the cluster’s network topology for optimized job resource allocation. BCM 11 is able to generate different types of topology.conf to customize the Slurm behavior of the topology plugins. More information about topology.conf is available at the Slurm Topology Configuration documentation.

The location of topology.conf in BCM head-node(?) is: /cm/shared/apps/slurm/etc/<CLUSTER_NAME>/topology.conf.

To support topology/block, CMD introduces new entities to customize Slurm in cmsh and Base View: SlurmTreeTopologySettings and SlurmBlockTopologySettings become both valid objects for the parent SlurmTopologySettings field.

The new commands to configure the SlurmBlockTopologySettings are available in cmsh and Base View in the wlm mode/section:

[basecm11->wlm[slurm]]% topologysettings

[basecm11->wlm[slurm]->topologysettings]% show
Parameter                       Value
-------------------------------- ---------------------------------------------
Topology plugin                 None
Topology source                 None
Parameters                      <submode>
Tree settings                   <submode>
Block settings                  <submode>
Topograph settings              <submode>

From the topologysettings menu a few parameters submodes are available:

  1. Topology plugin: parameter that defines which Slurm topology plugin will be used (tree or block). By default none is set and topology.conf is not generated.

  2. Topology source: parameter that defines where BCM takes information for the topology construction. Values: internal (taken from switch/racks information), topograph: taken from topograph service.

  3. Parameters: submode that allows to tune Slurm parameters related to all topology plugins.

  4. Tree settings: submode that allows configure settings of tree topology.

  5. Block settings: submode that allows configure settings of block topology.

If block is chosen as the topology plugin, additional options are available for the administrator (e.g. block size, type of racks, etc) in block settings submode. The administrator can manually define Slurm blocks leveraging information on the cluster layout, such as full GB200 racks (1x72 configuration), half GB200 racks (2x36 configuration), or generic nodegroups.

Common Slurm Management Tasks#

Creating a Slurm partition#

Within BCM, you can create and manage slurm partitions through cmsh:

[root@basecm11 ~]# cmsh

[basecm11]% wlm

[basecm11->wlm[slurm]]% jobqueue

[basecm11->wlm[slurm]->jobqueue]% help
...

=============================== jobqueue ===============================
add ........................... Create and use a jobqueue
append ........................ Append value(s) to jobqueue property
clear ......................... Clear specific jobqueue property
clone ......................... Clone and use a jobqueue
commit ........................ Commit local changes
foreach ....................... Execute a set of commands on several jobqueues
format ........................ Modify or view current list format
get ........................... Get specific jobqueue property
jobqueue ...................... Enter job queues mode
list .......................... List overview
range ......................... Set a range of several jobqueues to execute future commands on
refresh ....................... Revert local changes
remove ........................ Remove a jobqueue
removefrom .................... Remove value(s) from jobqueue property
set ........................... Set jobqueue properties
show .......................... Show jobqueue properties
sort .......................... Modify or view current list sort order
statistics .................... Get job queue statistics.
swap .......................... Swap uuid names of two jobqueue
undefine ...................... Undefine specific jobqueue property
use ........................... Use the specified jobqueue
usedby ........................ List all entities which depend on this jobqueue
validate ...................... Remote validate a jobqueue

[basecm11->wlm[slurm]->jobqueue]% add batch_long

[basecm11->wlm[slurm]->jobqueue[batch_long]% show

Parameter                       Value
-------------------------------- ---------------------------------------------
Name                            batch_long
Revision
Type                            Slurm
WlmCluster                      slurm
Ordering                        0
Default                         no
Hidden                          no
Min nodes                       1
Max nodes
Default time                    UNLIMITED
Max time                        12:00:00
Priority Job Factor             1
Priority Tier                   1
OverSubscribe                   exclusive
Alternate
Grace time                      0
Preemption mode                 OFF
Require reservation             NO
Select Type Parameters
LLN                             no
TRES Billing Weights
Alloc nodes
CPU bindings                    None
QOS
Default memory per CPU          UNLIMITED
Max memory per CPU              UNLIMITED
Default memory per Node         UNLIMITED
Max memory per Node             UNLIMITED
Max CPUs per node               UNLIMITED
Default memory per GPU          UNLIMITED
Default CPU per GPU             UNLIMITED
Disable root                    no
Root only                       no
Allow groups
Allow accounts
Allow QOS                       ALL
Deny Accounts
Deny QOS
ExclusiveUser                   no
Queue State                     None
Nodesets
Overlays
Categories
Nodegroups
Compute nodes
All nodes
Options

Any of the properties above can be set through cmsh -> wlm. The below sets the compute nodes b08-p1-dgx-08-c[01-18] to be allocated from the batch_long partition.

[basecm11->wlm[slurm]->jobqueue[batch_long]% set computenodes b08-p1-dgx-08-c[01-18]

Draining nodes can be done through the common Slurm CLI tool scontrol.

scontrol update nodename=a06-p1-dgx-02-c[01-04,06-11,13-14,16-18] state=drain reason="maintenance"

Resuming nodes can be done similarly:

scontrol update nodename=a06-p1-dgx-02-c[01-04,06-11,13-14,16-18] state=resume

Adding or Appending to a jobqueue#

Adding a user or group to a jobqueue (partition):

[basecm11->wlm[slurm]->jobqueue[batch_long]% set allowaccounts <user_id>

[basecm11->wlm[slurm]->jobqueue[batch_long]% set allowgroups <group_id>

Append a user or group to a jobqueue (partition):

[basecm11->wlm[slurm]->jobqueue[batch_long]% append allowaccounts <user_id>

[basecm11->wlm[slurm]->jobqueue[batch_long]% append allowgroups <group_id>

Example: Running Interactive Slurm Job to confirm IMEX setup#

NOTE: This example assumes IMEX is configured to run per-job and that you have access to the slurm-login node.

  1. Check Slurm partitions and node availability with sinfo.

root@a03-p1-aps-arm-01:~# sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

bcm_qa_test up infinite 18 alloc b08-p1-dgx-08-c[01-18]

benchmark-a07 up 12:00:00 2 drain a07-p1-dgx-03-c[06,12]

benchmark-a07 up 12:00:00 15 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18]

benchmark-b05 up 12:00:00 18 idle b05-p1-dgx-05-c[01-18]

benchmark-combined up 12:00:00 2 drain a07-p1-dgx-03-c[06,12]

benchmark-combined up 12:00:00 33 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18],b05-p1-dgx-05-c[01-18]

computex_demo up infinite 1 down* b06-p1-dgx-06-c01

computex_demo up infinite 16 alloc b06-p1-dgx-06-c[10-17],b07-p1-dgx-07-c[01-08]

computex_demo up infinite 13 idle b06-p1-dgx-06-c[02-03,18],b07-p1-dgx-07-c[09-18]

defq* up infinite 4 drain* a05-p1-dgx-01-c[01,07,16],b06-p1-dgx-06-c04

defq* up infinite 4 down* a05-p1-dgx-01-c10,b06-p1-dgx-06-c[01,06-07]

defq* up infinite 3 drain a07-p1-dgx-03-c[06,12],b06-p1-dgx-06-c05

defq* up infinite 48 alloc a05-p1-dgx-01-c[02-06,08-09,11-15,17-18],b06-p1-dgx-06-c[10-17],b07-p1-dgx-07-c[01-08],b08-p1-dgx-08-c[01-18]

defq* up infinite 48 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18],b05-p1-dgx-05-c[01-18],b06-p1-dgx-06-c[02-03,08-09,18],b07-p1-dgx-07-c[09-18]

flex-a07 up 2:00:00 2 drain a07-p1-dgx-03-c[06,12]

flex-a07 up 2:00:00 15 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18]

flex-b05 up 2:00:00 18 idle b05-p1-dgx-05-c[01-18]
  1. Select idle nodes to test and create slurm allocation on (in this example a full NVL72 rack is selected).

root@a03-p1-aps-arm-01:~# salloc -N 18 -w b05-p1-dgx-05-c[01-18]

salloc: Granted job allocation 3912

salloc: Waiting for resource configuration

salloc: Nodes b05-p1-dgx-05-c[01-18] are ready for job
  1. SSH to node and run nvidia-imex-ctl -N.

root@b05-p1-dgx-05-c01:~# nvidia-imex-ctl -N

Connectivity Table Legend:

I - Invalid - Node wasn't reachable, no connection status available

N - Never Connected

R - Recovering - Connection was lost, but clean up has not yet been triggered.

D - Disconnected - Connection was lost, and clean up has been triggreed.

A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.

!V! - Version mismatch, communication disabled.

!M! - Node map mismatch, communication disabled.

C - Connected - Ready for operation

5/13/2025 14:46:11.588

Nodes:

Node #0 - 7.241.18.139 - READY - Version: 570.124.06

Node #1 - 7.241.18.140 - READY - Version: 570.124.06

Node #2 - 7.241.18.141 - READY - Version: 570.124.06

Node #3 - 7.241.18.142 - READY - Version: 570.124.06

Node #4 - 7.241.18.143 - READY - Version: 570.124.06

Node #5 - 7.241.18.144 - READY - Version: 570.124.06

Node #6 - 7.241.18.145 - READY - Version: 570.124.06

Node #7 - 7.241.18.146 - READY - Version: 570.124.06

Node #8 - 7.241.18.147 - READY - Version: 570.124.06

Node #9 - 7.241.18.148 - READY - Version: 570.124.06

Node #10 - 7.241.18.149 - READY - Version: 570.124.06

Node #11 - 7.241.18.150 - READY - Version: 570.124.06

Node #12 - 7.241.18.151 - READY - Version: 570.124.06

Node #13 - 7.241.18.152 - READY - Version: 570.124.06

Node #14 - 7.241.18.153 - READY - Version: 570.124.06

Node #15 - 7.241.18.154 - READY - Version: 570.124.06

Node #16 - 7.241.18.155 - READY - Version: 570.124.06

Node #17 - 7.241.18.156 - READY - Version: 570.124.06

Nodes From\To 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

0 C C C C C C C C C C C C C C C C C C

1 C C C C C C C C C C C C C C C C C C

2 C C C C C C C C C C C C C C C C C C

3 C C C C C C C C C C C C C C C C C C

4 C C C C C C C C C C C C C C C C C C

5 C C C C C C C C C C C C C C C C C C

6 C C C C C C C C C C C C C C C C C C

7 C C C C C C C C C C C C C C C C C C

8 C C C C C C C C C C C C C C C C C C

9 C C C C C C C C C C C C C C C C C C

10 C C C C C C C C C C C C C C C C C C

11 C C C C C C C C C C C C C C C C C C

12 C C C C C C C C C C C C C C C C C C

13 C C C C C C C C C C C C C C C C C C

14 C C C C C C C C C C C C C C C C C C

15 C C C C C C C C C C C C C C C C C C

16 C C C C C C C C C C C C C C C C C C

17 C C C C C C C C C C C C C C C C C C

Domain State: UP

The nvidia-imex-ctl -N command prints the full status of the current IMEX Domain. It is a useful way to confirm the health of an NVL72 system.

As stated, BCM provides a way to create IMEX Domains per-job in Slurm. The IMEX service and configuration file can be observed too.

  1. Checking nvidia-imex.service on a DGX node.

root@b05-p1-dgx-05-c01:~# systemctl status nvidia-imex.service

● nvidia-imex.service - NVIDIA IMEX service

Loaded: loaded (/usr/lib/systemd/system/nvidia-imex.service; enabled; preset: enabled)

Active: active (running) since Tue 2025-05-13 14:44:29 PDT; 5min ago

Process: 99921 ExecStart=/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg (code=exited, status=0/SUCCESS)

Main PID: 99981 (nvidia-imex)

Tasks: 32 (limit: 293736)

Memory: 10.3M ()

CGroup: /system.slice/nvidia-imex.service

└─99981 /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg

May 13 14:44:29 b05-p1-dgx-05-c01 systemd[1]: Starting nvidia-imex.service - NVIDIA IMEX service...

May 13 14:44:29 b05-p1-dgx-05-c01 systemd[1]: Started nvidia-imex.service - NVIDIA IMEX service.
  1. Check /etc/nvidia-imex/nodes_config.cfg (this list is populated with the same nodes that are in $SLURM_JOB_NODELIST).

root@b05-p1-dgx-05-c01:~# cat /etc/nvidia-imex/nodes_config.cfg

# imex nodes_config for job 3912
# b05-p1-dgx-05-c01
7.241.18.139
# b05-p1-dgx-05-c02
7.241.18.140
# b05-p1-dgx-05-c03
7.241.18.141
# b05-p1-dgx-05-c04
7.241.18.142
# b05-p1-dgx-05-c05
7.241.18.143
# b05-p1-dgx-05-c06
7.241.18.144
# b05-p1-dgx-05-c07
7.241.18.145
# b05-p1-dgx-05-c08
7.241.18.146
# b05-p1-dgx-05-c09
7.241.18.147
# b05-p1-dgx-05-c10
7.241.18.148
# b05-p1-dgx-05-c11
7.241.18.149
# b05-p1-dgx-05-c12
7.241.18.150
# b05-p1-dgx-05-c13
7.241.18.151
# b05-p1-dgx-05-c14
7.241.18.152
# b05-p1-dgx-05-c15
7.241.18.153
# b05-p1-dgx-05-c16
7.241.18.154
# b05-p1-dgx-05-c17
7.241.18.155
# b05-p1-dgx-05-c18
7.241.18.156
  1. Checking /var/log/nvidia-imex.log on DGX node.

IMEX Log initializing at: 5/13/2025 14:44:29.455

[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX version 570.124.06 is running with the following configuration options

[May 13 2025 14:44:29] [INFO] [tid 99981] Logging level = 4

[May 13 2025 14:44:29] [INFO] [tid 99981] Logging file name/path = /var/log/nvidia-imex.log

[May 13 2025 14:44:29] [INFO] [tid 99981] Append to log file = 1

[May 13 2025 14:44:29] [INFO] [tid 99981] Max Log file size = 1024 (MBs)

[May 13 2025 14:44:29] [INFO] [tid 99981] Use Syslog file = 0

[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX Library communication bind interface =

[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX library communication bind port = 29262

[May 13 2025 14:44:29] [INFO] [tid 99981] Identified this node as ID 0, using bind IP of '7.241.18.139', and network interface of bond0

[May 13 2025 14:44:29] [INFO] [tid 99981] NvGpu Library version matched with GPU Driver version

[May 13 2025 14:44:29] [INFO] [tid 100014] Started processing of incoming messages.

[May 13 2025 14:44:29] [INFO] [tid 100015] Started processing of incoming messages.

[May 13 2025 14:44:29] [INFO] [tid 100016] Started processing of incoming messages.

[May 13 2025 14:44:29] [INFO] [tid 99981] Creating gRPC channels to all peers (nPeers = 18).

[May 13 2025 14:44:29] [INFO] [tid 100017] Started processing of incoming messages.

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 0 with ip address 7.241.18.139. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 3 with ip address 7.241.18.142. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 2 with ip address 7.241.18.141. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 4 with ip address 7.241.18.143. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 1 with ip address 7.241.18.140. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 5 with ip address 7.241.18.144. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 7 with ip address 7.241.18.146. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 6 with ip address 7.241.18.145. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 8 with ip address 7.241.18.147. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 9 with ip address 7.241.18.148. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 10 with ip address 7.241.18.149. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 11 with ip address 7.241.18.150. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 12 with ip address 7.241.18.151. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 13 with ip address 7.241.18.152. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 14 with ip address 7.241.18.153. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 15 with ip address 7.241.18.154. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 16 with ip address 7.241.18.155. Number of times connected: 1

[May 13 2025 14:44:29] [INFO] [tid 99981] GPU event successfully subscribed

[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 17 with ip address 7.241.18.156. Number of times connected: 1