Workload Manager Installation and Validation#

Slurm Installation#

The installation of Slurm is primarily done through the cm-wlm-setup tool included with BCM.

Note

If any images are added to the system (namely a new DGX OS image) after this installation. This wizard has to be run again in order for the appropriate pyxis/enroot and CMDaemon packages are installed.

Here is the process to install Slurm:

  1. Run cm-wlm-setup.

    Select Workload Manager
  2. Choose Setup (Step By Step) on the WLM operations window and then select OK.

    Setup (Step By Step)
  3. Choose Slurm on the Select Workload Manager screen and then select Ok.

  4. Enter WLM cluster name and then select Ok.

    Use the default cluster name of slurm.

    Slurm
  5. Choose (only) two nodes for Workload Manager server role and then select Ok.

    Only select the head node(s); not the slogin node(s).

    Workload Manager Server Role
  6. Enter the new configuration overlay name and priority for server role and then select Ok.

    Use the default values of slurm-server and 500.

    Server Role
  7. Ensure that all categories for the Workload Manager client role are unchecked and then select Ok.

    Client Role Categories
  8. Ensure that all nodes for the Workload Manager client role are unchecked and then select Ok.

    Client Role Nodes
  9. Enter the configuration overlay name and priority for client role and select Ok.

Use the default values of slurm-client and 500.

Client Role Priority
  1. Pick healthcheck producers that will be configured as pre-job checks and then select Ok.

    Select the following healthcheck producers to be run as pre-job checks:

    • cm-chroot-sw-img, cuda-dcgm, diskspace, dmesg, failedprejob

    • gpuhealth_quick, mysql, oomkiller, rogueprocess, schedulers

Healthcheck Producers
  1. Pick yes for Configure GPUs? and then select Ok.

    The compute tray nodes with GPU are selected in a different step and have their own configuration overlay.

Configure GPUs
  1. Enter the name for the ConfigurationOverlay and then select Ok.

    Use the default value of slurm-client-gpu.

Configuration Overlay
  1. Pick dgx or whatever name the gb200 category is for the Workload Manager client role and then select Ok.

    • All GPU compute tray nodes will be added and controlled at the category level.

    • Do not select individual nodes with GPUs or any control nodes.

GB200 Category
  1. Ensure that nothing is chosen for the Workload Manager client role and then select Ok.

Client Role Nothing Chosen
  1. Enter the new configuration overlay priority for client role and then select Ok.

    Use the default value of 450.

Client Role Priority
  1. Leave Tune number of slots empty and then select Ok.

Tune Number of Slots Empty
  1. Choose the categories for Workload Manager submit role and then select Ok.

    Pick dgx-gb200 and slogin.

    This allows both the GB200 nodes and the slogin nodes to submit Slurm jobs.

Workload Manager Submit Role Categories
  1. Choose the nodes for Workload Manager submit role and then select Ok.

    Pick the head node(s).

Workload Manager Submit Role Nodes
  1. Enter new configuration overlay name and priority for submit role and then select Ok.

    Use the defaults values of slurm-submit and 500.

Workload Manager Submit Role Name and Priority
  1. Enter the new configuration overlay name and priority for accounting role and then select Ok.

    Use the default values of slurm-accounting and 500.

Accounting Role Name and Priority
  1. Choose the accounting nodes and then select Ok.

    Pick the head node(s).

SelectAccounting Role Nodes
  1. Choose no for activate Slurm Accounting High Availability and then select Ok.

Slurm Accounting High Availability
  1. Choose Use accounting node on for the storage server type for accounting and then select Ok.

Slurm Accounting Storage Server Type
  1. Choose no for automatically run takeover on BCM failover? and then select Ok.

Slurm Automatically Run Takeover on BCM Failover
  1. Choose no for Enable Slurm power saving features? and then select Ok.

Slurm Power Saving Features
  1. Choose BMC autodetects GPUs for the GPU configuration method and then select Ok.

Slurm GPU Configuration Method
  1. Choose yes for Configure Pyxis plugin? and then select Ok.

Slurm Pyxis Plugin
  1. Do not choose anything on the Enroot settings page and then select Ok.

Enroot Settings
  1. Select Internal for the topology source so that the generated topology is based only on cluster-internal resources.

Slurm Topology Source of Internal
  1. Choose Block for the topology plugin and then select Ok.

Block the Slurm Topology Plugin
  1. Choose Constrain devices for Cgroups resource constraints and then select Ok.

Constrain Devices for Cgroups Resource Constraints
  1. Choose no for Install NVIDIA GPU packages? and then select Ok.

    The required packages are already included in the DGX OS 7 image.

Install NVIDIA GPU Packages
  1. Use the default queue name of defq and then select Ok.

    If different queues are requested, define them here or in the configurationoverlay within cmsh later where racks or sets of nodes can be assigned to different queues.

Default Queue Name
  1. Choose Save config & deploy on the Summary screen and then select Ok.

Save Config & Deploy
  1. Enter the filepath and then select Ok.

    Use /root/<filename>.conf.

Enter the Filepath

GB200 Post cm-wlm-setup Configuration Steps#

  1. In cmsh, check the wlm and configuration overlay settings:

    cmsh; wlm; get gpuautodetect  > No ne
    cmsh; configurationoverlay
    
  2. Review the settings to see if the appropriate nodes/categories are consistent with the selections made in the previous section:

    [a03-p1-head-01->configurationoverlay]% ls --format name,nodes,categories -v
    name (key)           nodes                categories
    -------------------- -------------------- --------------------
    slurm-accounting
    slurm-client
    slurm-client-gpu                          dgx-gb200
    slurm-server
    slurm-submit                              slogin,dgx-gb200
    wlm-headnode-submit
    
  3. Check slurmctld and slurmdbd services to see if they are active on the headnode:

    root@bcm11-head-01:~# systemctl status slurmctld.service
    ● slurmctld.service - Slurm controller daemon
         Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: enabled)
        Drop-In: /etc/systemd/system/slurmctld.service.d
                 └─99-cmd.conf
         Active: active (running) since Tue 2025-06-03 12:13:18 PDT; 20min ago
       Main PID: 43566 (slurmctld)
          Tasks: 95
         Memory: 49.9M (peak: 214.1M)
            CPU: 5.509s
         CGroup: /system.slice/slurmctld.service
                 ├─43566 /cm/local/apps/slurm/24.11/sbin/slurmctld --systemd
                 └─43635 "slurmctld: slurmscriptd"
    
    root@bcm11-head-01:~# systemctl status slurmdbd.service
    ● slurmdbd.service - Slurm DBD accounting daemon
         Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; preset: enabled)
        Drop-In: /etc/systemd/system/slurmdbd.service.d
                 └─99-cmd.conf
         Active: active (running) since Tue 2025-06-03 12:12:47 PDT; 21min ago
        Process: 43184 ExecStart=/cm/local/apps/slurm/24.11/sbin/slurmdbd (code=exited, status=0/SUCCESS)
       Main PID: 43187 (slurmdbd)
          Tasks: 16
         Memory: 9.5M (peak: 35.4M)
            CPU: 1.155s
         CGroup: /system.slice/slurmdbd.service
                 └─43187 /cm/local/apps/slurm/24.11/sbin/slurmdbd
    
  4. On the GB200 compute nodes, check if slurmd is running/active:

    root@a08-p1-dgx-04-c01:~# systemctl status slurmd.service
    ● slurmd.service - Slurm node daemon
         Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; preset: enabled)
        Drop-In: /etc/systemd/system/slurmd.service.d
                 └─99-cmd.conf
         Active: active (running) since Tue 2025-06-03 12:23:53 PDT; 5s ago
       Main PID: 877223 (slurmd)
          Tasks: 13
         Memory: 42.5M (peak: 68.0M)
            CPU: 204ms
         CGroup: /system.slice/slurmd.service
                 └─877223 /cm/local/apps/slurm/24.11/sbin/slurmd --systemd
    
  5. Setup NVIDIA IMEX in BCM.

    The IMEX must be setup for proper inter-GPU memory sharing across all nodes within an NVLink domain.

    Global IMEX will configure/populate all IMEX peers in the /etc/nvidia-imex/nodes_config.cfg file with their management interface IP address (typically bond0). This setting is used to run validation testing during initial cluster bring-up:

    # global (default)
    category services dgx-gb200 (or whatever the compute node category name is)
    add nvidia-imex
    set autostart yes
    set monitored yes
    set managed yes
    commit
    

    The final settings should resemble the following (with the global IMEX configuration):

    [a03-p1-head-01->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]% show
    Parameter                          Value
    ---------------------------------- ------------------------------------------------
    Name                               slurmclient
    Revision
    Type                               SlurmClientRole
    Add services                       yes
    WLM cluster                        slurm
    Slots                              0
    All queues                         no
    Queues                             defq
    Features
    Sockets                            0
    Cores per socket                   0
    Threads per core                   0
    Boards                             0
    Sockets per board                  0
    Real memory                        0B
    Node address
    Weight                             0
    Port                               0
    Tmp disk                           0
    Reason
    CPU spec list
    Core spec count                    0
    Mem spec limit                     0B
    GPU auto detect                    BCM
    Node customizations                <0 in submode>
    Generic resources                  <1 in submode>
    Cpu bindings                       None
    Slurm hardware probe auto detect   yes
    Memory autodetection slack         2.00%
    IMEX                               no
    Write procs always                 no
    Write only Procs                   no
    Nodesets
    Power profiles                     <submode>
    Nodeset features
    
    #genericresources sub-menu
    [a03-p1-head-01->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]% genericresources
    [a03-p1-head-01->configurationoverlay[slurm-client-gpu]->roles[slurmclient]->genericresources]% list
    Alias (key)        Name     Type     Count    File
    ------------------ -------- -------- -------- ----------------
    
  6. Configure Workload/per-job IMEX.

    The workload IMEX configuration is for after customer handoff, where the IMEX configuration will be done per Slurm job. All nodes in the job are added to a job specific IMEX domain. In this case be sure to only make changes to the configuration overlay. If the cluster was previously set to do global IMEX, be sure to undo the changes to the services within the GB200 category:

    # workload
    configurationoverlay roles slurm-client-gpu
    set slurmclient imex yes
    commit
    
  7. Configure PMIX.

    Create a slurmd file and copy it to the DGX OS image directory to this filepath /cm/images/<dgx-os-image-name>/etc/sysconfig:

    cat /cm/images/<dgx-os-image-name>/etc/sysconfig/slurmd
    PMIX_MCA_ptl=^usock
    PMIX_MCA_psec=none
    PMIX_SYSTEM_TMPDIR=/var/empty
    PMIX_MCA_gds=hash
    
  8. Configure Enroot.

    Create the file /etc/enroot/mounts.d/30-imex.fstab and place it in the DGX OS image (similar to the PMIX configuration step):

    cat /cm/images/<dgx-os-image-name>/etc/enroot/mounts.d/30-imex.fstab
    /dev/nvidia-caps-imex-channels
    

    Update the DGX nodes with the latest image changes with the following command. Wait for all the nodes to complete provisioning before proceeding:

    [bcm11-headnode]% device
    [bcm11-headnode->device]% imageupdate -w -c dgx-gb200
    

Validate Slurm Configuration#

After Slurm setup is completed, all the compute tray nodes should appear in Slurm:

  • module load slurm

  • with the sinfo command, nodes should appear in an idle state.

  • If nodes show up in different states, get more information about why they are in that state:

    • scontrol show nodes will show the status of all nodes

    • scontrol show nodes <node hostname> will show the details for only that node.

Example: GB200 healthy:

root@a03-p1-head-01:~# scontrol show nodes b05-p1-dgx-05-c01
NodeName=b05-p1-dgx-05-c01 Arch=aarch64 CoresPerSocket=72
   CPUAlloc=0 CPUEfctv=144 CPUTot=144 CPULoad=0.24
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:4(S:0-1)
   NodeAddr=b05-p1-dgx-05-c01 NodeHostName=b05-p1-dgx-05-c01 Version=24.05.5
   OS=Linux 6.8.0-1018-nvidia-64k #20-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  7 21:24:04 UTC 2024
   RealMemory=1700252 AllocMem=0 FreeMem=1704024 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2025-02-25T00:08:19 SlurmdStartTime=2025-02-25T02:17:43
   LastBusyTime=2025-02-25T02:17:43 ResumeAfterTime=None
   CfgTRES=cpu=144,mem=1700252M,billing=144,gres/gpu=4
   AllocTRES=
   CurrentWatts=0 AveWatts=0

Example: GB200 drain:

root@a03-p1-head-01:~# scontrol show nodes a05-p1-dgx-01-c13
NodeName=a05-p1-dgx-01-c13 Arch=aarch64 CoresPerSocket=72
   CPUAlloc=0 CPUEfctv=144 CPUTot=144 CPULoad=0.22
   AvailableFeatures=location=local
   ActiveFeatures=location=local
   Gres=gpu:4(S:0-1)
   NodeAddr=a05-p1-dgx-01-c13 NodeHostName=a05-p1-dgx-01-c13 Version=24.05.5
   OS=Linux 6.8.0-1018-nvidia-64k #20-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  7 21:24:04 UTC 2024
   RealMemory=1700252 AllocMem=0 FreeMem=1702114 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2025-02-25T02:12:29 SlurmdStartTime=2025-02-25T02:17:55
   LastBusyTime=2025-02-25T02:17:55 ResumeAfterTime=None
   CfgTRES=cpu=144,mem=1700252M,billing=144,gres/gpu=4
   AllocTRES=
   CurrentWatts=0 AveWatts=0

   Reason=Low socket*core*thread count, Low CPUs : Not responding [slurm@2025-02-25T00:12:46]

Slurm Troubleshooting—Drain#

If a node is in drain state, then either it was taken down due to a node failure or it was put into drain state on purpose to pull it out of the queue of available nodes to perform work/maintenance/debug.

  1. Use scontrol show nodes <node in drain> to find the reason why the node is drained.

  2. If the administrator what to put a node in drain state scontrol update nodename=<nodename> state=drain reason="maintenance".

  3. When fixed add it back to the idle queue with scontrol update nodename=<nodename> state=resume.

Slurm Troubleshooting—Down#

If a node is in down state:

  1. Use scontrol show nodes <nodename> to see if there is a clear reason it is down.

  2. Determine if slurmd is running on the worker/compute nodes:

    1. On the node, do systemctl status slurmd.

    2. If it is not in an active state do systemctl start slurmd

    3. To do this cluster wide, use pdsh -w <node-hostname(s)> and command above.

    4. At the category level, look at /etc/genders to see what categories are available, then do pdsh -g category=dgx-gb200 <command>.

Slurm Troubleshooting—Inval#

If the nodes are showing invalid, it means that the configuration of the node does not match what it expects. Commonly this is due to an incorrect GPU count/missing GPU(s).

  1. If it shows Reason=gres/gpu count reported lower than configured (0 < 8), this means Slurm is expecting 8 GPUs and sees zero.

    1. This sometimes indicates that the autodetection of GPUs failed for some reason.

    2. A reason this could fail is if the cuda-dcgm package is missing in the DGX OS image:

      root@DGX-02:~# apt install cuda-dcgm
      root@DGX-02:~# systemctl start cuda-dcgm.service
      
  2. If it shows Reason=gres/gpu count reported lower than configured (7 < 8) then a GPU has failed due to perhaps GPU tray seating issues (this should not be an issue in GB200 generation)

Slurm Troubleshooting—MySQL#

If the slurmctld service logs (systemctl status slurmctld or journalctl -xeu slurmctld) or the slurmdbd service logs (systemctl status slurmdbd or journalctl -xeu slurmdbd) are indicating that a connection is being refused, the mySQL password may need to be reset:

root@bcm11-head-01:~# systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: enabled)
    Drop-In: /etc/systemd/system/slurmctld.service.d
             └─99-cmd.conf
     Active: active (running) since Mon 2025-06-02 18:49:14 PDT; 1min 8s ago
   Main PID: 344308 (slurmctld)
      Tasks: 83
     Memory: 33.9M (peak: 55.6M)
        CPU: 339ms
     CGroup: /system.slice/slurmctld.service
             ├─344308 /cm/local/apps/slurm/24.11/sbin/slurmctld --systemd
             └─344374 "slurmctld: slurmscriptd"

Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:21 bcm11-head-01 slurmctld[344308]: slurmctld: error: Still don't know my ClusterID
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Retrying initial connection to slurmdbd
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: _open_persist_conn: failed to open persistent connection to host:master:6819: Connection refused
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Sending PersistInit msg: Connection refused
Jun 02 18:50:23 bcm11-head-01 slurmctld[344308]: slurmctld: error: Still don't know my ClusterID
root@bcm11-head-01:~# systemctl status slurmctld.service
  1. Set a new MySQL password of slurm_acct_db for user ‘slurm’@’g’ on the headnode using /cm/local/apps/slurm/current/scripts/cm-restore-db-password.

    1. Specify the slurmdbd.conf path [/cm/shared/apps/slurm/etc/slurmdbd.conf].

    2. Specify the slurmdbd.conf template path [/cm/local/apps/slurm/clurrent/templates/slurmdbd.conf.template].

    3. Set the MySQL password to match the head node password.

      Set a New MySQL Password
    4. If HA is configured, the utility will ask for the IP of the secondary head node. Leave blank if it is not.

Slurm Troubleshooting— Pyxis is Plug-in Unavailable on GB200 Software Image#

Sometimes if the software image for the GB200 compute trays was not assigned to the category when cm-wlm-setup is run, the Pyxis plug-in may be missing as indicated by:

root@dgxos-image-ubuntu2404-aarch64:/# ls -la /cm/local/apps/slurm/current/lib64/slurm/spank_pyxis.so
/usr/bin/ls: cannot access '/cm/local/apps/slurm/current/lib64/slurm/spank_pyxis.so': No such file or directory

To correct this:

  1. use the cm-chroot-sw-img tool to enter the gb200 software image to install pyxis-sources and then run the command to compile and install the Pyxis plugin for Slurm.

    From the headnode OS prompt:

    cm-chroot-sw-img /cm/images/<gb200 sw img>
    apt update
    apt install pyxis-sources
    
    # This command ensures the compilier can find the necessary Slurm header files for compilation and place it in the correct directory to be used by the Slurm daemon.
    
    CPATH=/cm/local/apps/slurm/current/include /cm/local/apps/slurm/current/scripts/install-pyxis.sh -d /cm/local/apps/slurm/current/lib64/slurm
    
  2. Restart the compute nodes.