Workload Manager Installation and Validation#

This section provides the necessary steps to install and validate the Slurm workload manager on the GB200 NVL72 cluster.

Slurm Installation#

You can install Slurm using using the cm-wlm-setup tool included with BCM.

Note

If any images are added to the system (namely a new DGX OS image) after this installation. This wizard has to be run again in order for the appropriate pyxis/enroot and CMDaemon packages are installed.

You can use the following steps to install Slurm:

Run cm-wlm-setup.
Choose Setup (Step By Step) on the WLM operations window and then select OK:
Choose Slurm on the Select Workload Manager screen and then select Ok:
Enter WLM cluster name and then select Ok.

Use the default cluster name of slurm.
Choose (only) two nodes for Workload Manager server role and then select Ok.

Only select the head node(s); not the slogin node(s).
Enter the new configuration overlay name and priority for server role and then select Ok.

Use the default values of slurm-server and 500.
Ensure that all categories for the Workload Manager client role are unchecked and then select Ok.
Ensure that all nodes for the Workload Manager client role are unchecked and then select Ok.
Enter the configuration overlay name and priority for client role and select Ok.

Use the default values of slurm-client and 500.

Pick healthcheck producers that will be configured as pre-job checks and then select Ok.

Select the following healthcheck producers to be run as pre-job checks:
- cm-chroot-sw-img, cuda-dcgm, diskspace, dmesg, failedprejob
- gpuhealth_quick, mysql, oomkiller, rogueprocess, schedulers

Pick yes for Configure GPUs? and then select Ok.

The compute tray nodes with GPU are selected in a different step and have their own configuration overlay.

Enter the name for the ConfigurationOverlay and then select Ok.

Use the default value of slurm-client-gpu.

Pick dgx or whatever name the gb200 category is for the Workload Manager client role and then select Ok:
- All GPU compute tray nodes will be added and controlled at the category level.
- Do not select individual nodes with GPUs or any control nodes.

Ensure that nothing is chosen for the Workload Manager client role and then select Ok.

Enter the new configuration overlay priority for client role and then select Ok.

Use the default value of 450.

Leave the Tune number of slots empty and then select Ok.

Choose the categories for Workload Manager submit role and then select Ok.

Pick dgx-gb200 and slogin.

These values allow both the GB200 nodes and the slogin nodes to submit Slurm jobs.

Choose the nodes for Workload Manager submit role and then select Ok.

Pick the head node(s).

Enter new configuration overlay name and priority for submit role and then select Ok.

Use the defaults values of slurm-submit and 500.

Enter the new configuration overlay name and priority for accounting role and then select Ok.

Use the default values of slurm-accounting and 500.

Choose the accounting nodes and then select Ok.

Pick the head node(s).

Choose no for activate Slurm Accounting High Availability and then select Ok.

Choose Use accounting node on for the storage server type for accounting and then select Ok.

Choose no for automatically run takeover on BCM failover? and then select Ok.

Choose no for Enable Slurm power saving features? and then select Ok.

Choose BMC autodetects GPUs for the GPU configuration method and then select Ok.

Choose yes for Configure Pyxis plugin? and then select Ok.

Do not choose anything on the Enroot settings page and then select Ok.

Select Internal for the topology source so that the generated topology is based only on cluster-internal resources:

Choose Block for the topology plugin and then select Ok.

Choose Constrain devices for Cgroups resource constraints and then select Ok.

Choose no for Install NVIDIA GPU packages? and then select Ok.

Note

The required packages are already included in the DGX OS 7 image.

Use the default queue name of defq and then select Ok.

If different queues are requested, define them here or in the configurationoverlay within cmsh later where racks or sets of nodes can be assigned to different queues.

Choose Save config & deploy on the Summary screen and then select Ok.

Enter the filepath and then select Ok.

Use /root/<filename>.conf.

GB200 Post cm-wlm-setup Configuration Steps#

You can use the following steps to configure the GB200 post cm-wlm-setup configuration steps:

In cmsh, check the wlm and configuration overlay settings

cmsh; wlm; get gpuautodetect  > No ne
cmsh; configurationoverlay

Review the settings to see if the appropriate nodes/categories are consistent with the selections made in the previous section:

[a03-p1-head-01->configurationoverlay]% ls --format name,nodes,categories -v
name (key)           nodes                categories
-------------------- -------------------- --------------------
slurm-accounting
slurm-client
slurm-client-gpu                          dgx-gb200
slurm-server
slurm-submit                              slogin,dgx-gb200
wlm-headnode-submit

Check slurmctld and slurmdbd services to see if they are active on the headnode:

root@bcm11-head-01:~# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
      Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: enabled)
   Drop-In: /etc/systemd/system/slurmctld.service.d
            └─99-cmd.conf
      Active: active (running) since Tue 2025-06-03 12:13:18 PDT; 20min ago
   Main PID: 43566 (slurmctld)
      Tasks: 95
      Memory: 49.9M (peak: 214.1M)
         CPU: 5.509s
      CGroup: /system.slice/slurmctld.service
            ├─43566 /cm/local/apps/slurm/24.11/sbin/slurmctld --systemd
            └─43635 "slurmctld: slurmscriptd"

root@bcm11-head-01:~# systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
      Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; preset: enabled)
   Drop-In: /etc/systemd/system/slurmdbd.service.d
            └─99-cmd.conf
      Active: active (running) since Tue 2025-06-03 12:12:47 PDT; 21min ago
   Process: 43184 ExecStart=/cm/local/apps/slurm/24.11/sbin/slurmdbd (code=exited, status=0/SUCCESS)
   Main PID: 43187 (slurmdbd)
      Tasks: 16
      Memory: 9.5M (peak: 35.4M)
         CPU: 1.155s
      CGroup: /system.slice/slurmdbd.service
            └─43187 /cm/local/apps/slurm/24.11/sbin/slurmdbd

On the GB200 compute nodes, check if slurmd is running/active:

root@a08-p1-dgx-04-c01:~# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
      Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: enabled)
   Drop-In: /etc/systemd/system/slurmd.service.d
            └─99-cmd.conf
      Active: active (running) since Tue 2025-06-03 12:23:53 PDT; 5s ago
   Main PID: 877223 (slurmd)
      Tasks: 13
      Memory: 42.5M (peak: 68.0M)
         CPU: 204ms
      CGroup: /system.slice/slurmd.service
            └─877223 /cm/local/apps/slurm/24.11/sbin/slurmd --systemd

Setup NVIDIA IMEX in the BCM.

The IMEX must be setup for proper inter-GPU memory sharing across all nodes within an NVLink domain.

Global IMEX will configure and populate all IMEX peers in the /etc/nvidia-imex/nodes_config.cfg file with their management interface IP address (typically bond0). This setting is used to run validation testing during initial cluster bring-up:

# global (default)
category services dgx-gb200 (or whatever the compute node category name is)
add nvidia-imex
set autostart yes
set monitored yes
set managed yes
commit

The final settings should resemble the following (with the global IMEX configuration):

[a03-p1-head-01->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]% show
Parameter                          Value
---------------------------------- ------------------------------------------------
Name                               slurmclient
Revision
Type                               SlurmClientRole
Add services                       yes
WLM cluster                        slurm
Slots                              0
All queues                         no
Queues                             defq
Features
Sockets                            0
Cores per socket                   0
Threads per core                   0
Boards                             0
Sockets per board                  0
Real memory                        0B
Node address
Weight                             0
Port                               0
Tmp disk                           0
Reason
CPU spec list
Core spec count                    0
Mem spec limit                     0B
GPU auto detect                    BCM
Node customizations                <0 in submode>
Generic resources                  <1 in submode>
Cpu bindings                       None
Slurm hardware probe auto detect   yes
Memory autodetection slack         2.00%
IMEX                               no
Write procs always                 no
Write only Procs                   no
Nodesets
Power profiles                     <submode>
Nodeset features

#genericresources sub-menu
[a03-p1-head-01->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]% genericresources
[a03-p1-head-01->configurationoverlay[slurm-client-gpu]->roles[slurmclient]->genericresources]% list
Alias (key)        Name     Type     Count    File
------------------ -------- -------- -------- ----------------

Configure Workload/per-job IMEX.

The workload IMEX configuration is for after customer handoff, where the IMEX configuration will be done per Slurm job. All nodes in the job are added to a job specific IMEX domain. In this case be sure to only make changes to the configuration overlay. If the cluster was previously set to do global IMEX, be sure to undo the changes to the services within the GB200 category as shown below:
```
# workload
configurationoverlay roles slurm-client-gpu
set slurmclient imex yes
commit
```
Configure PMIX by following the steps below:

Create a slurmd file and copy it to the DGX OS image directory to this filepath /cm/images/<dgx-os-image-name>/etc/sysconfig:
```
cat /cm/images/<dgx-os-image-name>/etc/sysconfig/slurmd
PMIX_MCA_ptl=^usock
PMIX_MCA_psec=none
PMIX_SYSTEM_TMPDIR=/var/empty
PMIX_MCA_gds=hash
```
Configure Enroot.

Create the file /etc/enroot/mounts.d/30-imex.fstab and place it in the DGX OS image (similar to the PMIX configuration step):
```
cat /cm/images/<dgx-os-image-name>/etc/enroot/mounts.d/30-imex.fstab
/dev/nvidia-caps-imex-channels
```
Update the DGX nodes with the latest image changes with the following command. Wait for all the nodes to complete provisioning before proceeding:
```
[bcm11-headnode]% device
[bcm11-headnode->device]% imageupdate -w -c dgx-gb200
```

Validate Slurm Configuration#

After the Slurm setup is completed, all the compute tray nodes should appear in Slurm:

module load slurm
with the sinfo command, nodes should appear in an idle state.

If nodes show up in different states, you can use the following commands to get more information about why they are in that state:

scontrol show nodes - This command shows the status of all nodes.
scontrol show nodes <node hostname> - This command shows the details for only that node.

Example: GB200 healthy:

root@a03-p1-head-01:~# scontrol show nodes b05-p1-dgx-05-c01
NodeName=b05-p1-dgx-05-c01 Arch=aarch64 CoresPerSocket=72
   CPUAlloc=0 CPUEfctv=144 CPUTot=144 CPULoad=0.24
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:4(S:0-1)
   NodeAddr=b05-p1-dgx-05-c01 NodeHostName=b05-p1-dgx-05-c01 Version=24.05.5
   OS=Linux 6.8.0-1018-nvidia-64k #20-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  7 21:24:04 UTC 2024
   RealMemory=1700252 AllocMem=0 FreeMem=1704024 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2025-02-25T00:08:19 SlurmdStartTime=2025-02-25T02:17:43
   LastBusyTime=2025-02-25T02:17:43 ResumeAfterTime=None
   CfgTRES=cpu=144,mem=1700252M,billing=144,gres/gpu=4
   AllocTRES=
   CurrentWatts=0 AveWatts=0

Example: GB200 drain:

root@a03-p1-head-01:~# scontrol show nodes a05-p1-dgx-01-c13
NodeName=a05-p1-dgx-01-c13 Arch=aarch64 CoresPerSocket=72
   CPUAlloc=0 CPUEfctv=144 CPUTot=144 CPULoad=0.22
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:4(S:0-1)
   NodeAddr=a05-p1-dgx-01-c13 NodeHostName=a05-p1-dgx-01-c13 Version=24.05.5
   OS=Linux 6.8.0-1018-nvidia-64k #20-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  7 21:24:04 UTC 2024
   RealMemory=1700252 AllocMem=0 FreeMem=1702114 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2025-02-25T02:12:29 SlurmdStartTime=2025-02-25T02:17:55
   LastBusyTime=2025-02-25T02:17:55 ResumeAfterTime=None
   CfgTRES=cpu=144,mem=1700252M,billing=144,gres/gpu=4
   AllocTRES=
   CurrentWatts=0 AveWatts=0

   Reason=Low socket*core*thread count, Low CPUs : Not responding [slurm@2025-02-25T00:12:46]

Slurm Troubleshooting—Drain#

If a node is in a drain state, then either it was taken down due to a node failure or it was put into the drain state on purpose to pull it out of the queue of available nodes to perform work, maintenance, and/or debug tasks.

You can use the following steps to troubleshoot a node in a drain state:

Use scontrol show nodes <node in drain> to find the reason why the node is drained.
If the administrator what to put a node in drain state scontrol update nodename=<nodename> state=drain reason="maintenance".
When fixed add it back to the idle queue with scontrol update nodename=<nodename> state=resume.

Slurm Troubleshooting—Down#

You can use the following steps to troubleshoot a node in a down state:

Use scontrol show nodes <nodename> to see if there is a clear reason it is down.
Determine if slurmd is running on the worker/compute nodes:
1. On the node, do systemctl status slurmd.
2. If it is not in an active state do systemctl start slurmd
3. To do this cluster wide, use pdsh -w <node-hostname(s)> and command above.
4. At the category level, look at /etc/genders to see what categories are available, then do pdsh -g category=dgx-gb200 <command>.

Slurm Troubleshooting—Inval#

If the nodes are showing invalid, it means that the configuration of the node does not match what it expects. This is commonly due to an incorrect GPU count/missing GPU(s).

You can use the following steps to troubleshoot a node in an invalid state:

If it shows Reason=gres/gpu count reported lower than configured (0 < 8), this means Slurm is expecting 8 GPUs and sees zero.
1. This sometimes indicates that the autodetection of GPUs failed for some reason.
2. A reason this could fail is if the cuda-dcgm package is missing in the DGX OS image:
```
root@DGX-02:~# apt install cuda-dcgm
root@DGX-02:~# systemctl start cuda-dcgm.service
```
If it shows Reason=gres/gpu count reported lower than configured (7 < 8) then a GPU has failed due to perhaps GPU tray seating issues (this should not be an issue in GB200 generation)

Slurm Troubleshooting—MySQL#

If the slurmctld service logs (systemctl status slurmctld or journalctl -xeu slurmctld) or the slurmdbd service logs (systemctl status slurmdbd or journalctl -xeu slurmdbd) are indicating that a connection is being refused, the mySQL password may need to be reset:

root@bcm11-head-01:~# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: enabled)
   Drop-In: /etc/systemd/system/slurmctld.service.d
            └─99-cmd.conf
   Active: active (running) since Mon 2025-06-02 18:49:14 PDT; 1min 8s ago
   Main PID: 344308 (slurmctld)
      Tasks: 83
   Memory: 33.9M (peak: 55.6M)
      CPU: 339ms
   CGroup: /system.slice/slurmctld.service
            ├─344308 /cm/local/apps/slurm/24.11/sbin/slurmctld --systemd
            └─344374 "slurmctld: slurmscriptd"

Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: error: Still don't know my ClusterID
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Retrying initial connection to slurmdbd
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: _open_persist_conn: failed to open persistent connection to host:master:6819: Connection refused
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Still don't know my ClusterID
root@bcm11-head-01:~# systemctl status slurmctld.service

Set a new MySQL password of slurm_acct_db for user ‘slurm’@’g’ on the headnode using /cm/local/apps/slurm/current/scripts/cm-restore-db-password.
1. Specify the slurmdbd.conf path [/cm/shared/apps/slurm/etc/slurmdbd.conf].
2. Specify the slurmdbd.conf template path [/cm/local/apps/slurm/clurrent/templates/slurmdbd.conf.template].
3. Set the MySQL password to match the head node password.
4. If HA is configured, the utility will ask for the IP of the secondary head node. Leave blank if it is not.

Slurm Troubleshooting— Pyxis is Plug-in Unavailable on GB200 Software Image#

Sometimes if the software image for the GB200 compute trays was not assigned to the category when cm-wlm-setup is run, the Pyxis plug-in may be missing as indicated by the following output:

root@dgxos-image-ubuntu2404-aarch64:/# ls -la /cm/local/apps/slurm/current/lib64/slurm/spank_pyxis.so
/usr/bin/ls: cannot access '/cm/local/apps/slurm/current/lib64/slurm/spank_pyxis.so': No such file or directory

To correct the missing Pyxis plug-in:

use the cm-chroot-sw-img tool to enter the gb200 software image to install pyxis-sources and then run the command to compile and install the Pyxis plugin for Slurm.

From the headnode OS prompt:

cm-chroot-sw-img /cm/images/<gb200 sw img>
apt update
apt install pyxis-sources

# This command ensures the compilier can find the necessary Slurm header files for compilation and place it in the correct directory to be used by the Slurm daemon.

CPATH=/cm/local/apps/slurm/current/include /cm/local/apps/slurm/current/scripts/install-pyxis.sh -d /cm/local/apps/slurm/current/lib64/slurm

Restart the compute nodes.