Slurm Updates#

Please refer to the DGX SuperPOD Release Notes for the latest validated Slurm version supported on DGX BasePOD and SuperPOD.

  1. On the headnode, check the current version of Slurm using sinfo. If the headnode does not recognize the sinfo command, try running the module load slurm command and then run sinfo.

    root@demeter-headnode-01:~# sinfo --version
    slurm 23.02.6
    
  2. If there are Slurm jobs running, either wait till they complete or stop them, confirming no running jobs on the DGX nodes.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% drain -l slurmclient
    [demeter-headnode-01->device]% quit
    root@demeter-headnode-01:~# squeue -t R,CG
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    
  3. Stop the Slurm controller and accounting server services on the headnodes. Confirm the Slurm services are stopped on both headnodes.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% foreach -l slurmserver ( services; stop slurmctld )
    [demeter-headnode-01->device]% foreach -l slurmaccounting ( services; stop slurmdbd )
    [demeter-headnode-01->device]% foreach -l slurmclient ( services; stop slurmd )
    [demeter-headnode-01->device]% quit
    
    root@demeter-headnode-01:~# systemctl status slurmctld.service
    root@demeter-headnode-01:~# systemctl status slurmdbd.service
    
    root@Demeter-headnode-02:~# systemctl status slurmctld.service
    root@Demeter-headnode-02:~# systemctl status slurmdbd.service
    
  4. Update Slurm packages. First update the local database of available software packages and their versions, then identify currently installed Slurm packages and their versions, followed by identifying the updated Slurm packages to install.

    root@demeter-headnode-01:~# apt update
    root@demeter-headnode-01:~# apt list --installed | grep slurm
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    slurm23.02-client/BCM 10.0,now 23.02.6-100806-cm10.0-9c8dc03511 amd64 [installed,upgradable to: 23.02.8-100881-cm10.0-48e305b89c]
    slurm23.02-contribs/BCM 10.0,now 23.02.6-100806-cm10.0-9c8dc03511 amd64 [installed,upgradable to: 23.02.8-100881-cm10.0-48e305b89c]
    slurm23.02-devel/BCM 10.0,now 23.02.6-100806-cm10.0-9c8dc03511 amd64 [installed,upgradable to: 23.02.8-100881-cm10.0-48e305b89c]
    slurm23.02-perlapi/BCM 10.0,now 23.02.6-100806-cm10.0-9c8dc03511 amd64 [installed,upgradable to: 23.02.8-100881-cm10.0-48e305b89c]
    slurm23.02-slurmdbd/BCM 10.0,now 23.02.6-100806-cm10.0-9c8dc03511 amd64 [installed,upgradable to: 23.02.8-100881-cm10.0-48e305b89c]
    slurm23.02/BCM 10.0,now 23.02.6-100806-cm10.0-9c8dc03511 amd64 [installed,upgradable to: 23.02.8-100881-cm10.0-48e305b89c]
    
  5. Update the Slurm packages. Select <OK> if prompted to select which services to restart.

    root@demeter-headnode-01:~# apt install slurm23.02-client slurm23.02-contribs slurm23.02-devel slurm23.02-perlapi slurm23.02-slurmdbd slurm23.02
    
    _images/c42ebfc1cc776e81e66f0f9b3b9187eafb6ea86c.png
  6. Verify the updated packages are installed.

    root@demeter-headnode-01:~# apt list --installed | grep slurm
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    slurm23.02-client/BCM 10.0,now 23.02.8-100881-cm10.0-48e305b89c amd64 [installed]
    slurm23.02-contribs/BCM 10.0,now 23.02.8-100881-cm10.0-48e305b89c amd64 [installed]
    slurm23.02-devel/BCM 10.0,now 23.02.8-100881-cm10.0-48e305b89c amd64 [installed]
    slurm23.02-perlapi/BCM 10.0,now 23.02.8-100881-cm10.0-48e305b89c amd64 [installed]
    slurm23.02-slurmdbd/BCM 10.0,now 23.02.8-100881-cm10.0-48e305b89c amd64 [installed]
    slurm23.02/BCM 10.0,now 23.02.8-100881-cm10.0-48e305b89c amd64 [installed]
    
  7. On the active headnode, perform a systemctl daemon-reload to re-read the updated Slurm systemd unit files that point to the new Slurm version directory. After this, start the Slurm services on the headnodes and verify they are ‘active’.

    root@demeter-headnode-01:~# systemctl daemon-reload
    root@demeter-headnode-01:~# systemctl restart slurmdbd
    root@demeter-headnode-01:~# systemctl restart slurmctld
    

    Note

    If the slurmdbd.service does not start (i.e., become active) on the active headnode, review the slurmdbd.service file located at /lib/systemd/system/ and check the Type setting and the -D option in the ExecStart line.

    • If the Type setting is set to forking, set it to simple.

    • If the -D option is missing in the ExecStart line, then add the -D option to the ExecStart line as shown below.

    Then run systemctl daemon-reload and restart the slurmdbd.service and slurmctld.service on the active headnode.

    root@demeter-headnode-01:~# vi /lib/systemd/system/slurmdbd.service
    [Unit]
    Description=Slurm DBD accounting daemon
    After=network-online.target munge.service mysql.service mysqld.service mariadb.service
    Wants=network-online.target
    ConditionPathExists=/etc/slurm/slurmdbd.conf
    
    [Service]
    Type=simple
    EnvironmentFile=-/etc/sysconfig/slurmdbd
    EnvironmentFile=-/etc/default/slurmdbd
    ExecStart=cm/shared/apps/slurm/23.02.8/sbin/slurmdbd -D -s $SLURMDBD_OPTIONS
    ExecReload=/bin/kill -HUP $MAINPID
    LimitNOFILE=65536
    TasksMax=infinity
    
    # Uncomment the following lines to disable logging through journald.
    # NOTE: It may be preferable to set these through an override file instead.
    #StandardOutput=null
    #StandardError=null
    
    [Install]
    WantedBy=multi-user.target
    
    root@demeter-headnode-01:~# systemctl daemon-reload
    root@demeter-headnode-01:~# systemctl restart slurmdbd
    root@demeter-headnode-01:~# systemctl restart slurmctld
    
  8. Repeat steps 4 to 7 on the passive headnode. Only need to restart the slurmctld.service on the passive headnode. Make sure Type is set to simple and the -D option is in the ExecStart line in the slurmdbd.service file.

    root@Demeter-headnode-02:~# apt update
    root@Demeter-headnode-02:~# apt list --installed | grep slurm
    root@Demeter-headnode-02:~# apt install slurm23.02-client slurm23.02-contribs slurm23.02-devel slurm23.02-perlapi slurm23.02-slurmdbd slurm23.02
    root@Demeter-headnode-02:~# apt list --installed | grep slurm
    root@Demeter-headnode-02:~# systemctl daemon-reload
    root@Demeter-headnode-02:~# systemctl restart slurmctld
    
  9. Next update the Slurm client on the software image for the DGX nodes and then apply the updated image to the DGX nodes. Use cm-chroot-sw-img to update the DGX nodes image. Press <OK> if prompted for newer kernel is available. When all done, exit chroot.

    root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image/
    root@dgx-os-6:/# apt update
    root@dgx-os-6:/# apt list --installed | grep slurm
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    slurm23.02-client/BCM 10.0,now 23.02.6-100806-cm10.0-9c8dc03511 amd64 [installed,upgradable to: 23.02.8-100881-cm10.0-48e305b89c]
    root@dgx-os-6:/# apt install slurm23.02-client
    root@dgx-os-6:/# apt list --installed | grep slurm
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    slurm23.02-client/BCM 10.0,now 23.02.8-100881-cm10.0-48e305b89c amd64 [installed]
    root@dgx-os-6:/# exit
    
    _images/70e1f07c573749d390227bff6d22e255e6067c8d.png
  10. Apply the updated DGX image to the DGX nodes. Exit cmsh until you see the imageupdate is completed notification.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% imageupdate -w -c dgx-h100
    [demeter-headnode-01->device]%
    Fri May 30 00:41:38 2025 [notice] demeter-headnode-01: Provisioning started: sending demeter-headnode-01:/cm/images/dgx-os-6.3.2-h100-image-MOFED to dgx-07:/, mode UPDATE, dry run = no
    ….
    #output truncated
    ….
    imageupdate -c dgx-h100 -w [ COMPLETED ]
    
  11. Since the Slurm client package is updated, so systemctl daemon-reload is needed on the DGX nodes to re-read the updated Slurm systemd unit files, then restart the slurmd.service on the DGX nodes to ensure the updated Slurm client version is active. You can verify Slurm is loaded properly on the DGX nodes by using sinfo to check the Slurm nodes are in Idle state.

    root@demeter-headnode-01:~# pdsh -w dgx-[01-31] systemctl daemon-reload && systemctl restart slurmd
    root@demeter-headnode-01:~# sinfo
    
  12. Start the Slurm services from cmsh to retain the settings between cm daemon restarts.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% foreach -l slurmserver ( services; start slurmctld )
    [demeter-headnode-01->device]% foreach -l slurmaccounting ( services; start slurmdbd )
    [demeter-headnode-01->device]% foreach -l slurmclient ( services; start slurmd )
    [demeter-headnode-01->device]% quit
    
    root@demeter-headnode-01:~# systemctl status slurmctld.service
    root@demeter-headnode-01:~# systemctl status slurmdbd.service
    
    root@Demeter-headnode-02:~# systemctl status slurmctld.service
    root@Demeter-headnode-02:~# systemctl status slurmdbd.service