Slurm Setup#

Update the interface names on the slogin nodes.
1% device use slogin-01
1. If the slogin-01 does not have the expected interface names, update the interface names.
  1% use networkdevicename 2% set networkdevicename new-name

Assign the MAC addresses to the slogin nodes.

device use slogin-01
set mac <MAC address>

Power on and install the slogin nodes.
Run the bcm-install-slurm script.
Use the following parameters:
1. Installation source for the –bcm-media parameter. It can be either a USB or a path to a .iso file.
2. Use the -A parameter to run the script in air-gapped mode.
3. If CMHA is set up but has failover ping errors, append --ignore-ha-errors.
4. If there is only one slogin node, append --ignore-missing-login-node.
  bcm-install-slurm -A --bcm-media <path to installer image or usb device to mount>
Confirm that the slurmd file is present in DGX image before provisioning DGX nodes, if not create it.
The same file is needed for both DGX A100 and DGX H100 systems. This example is for DGX H100 systems. It is observed that NCCL tests with PMIX need this file.
1cat /cm/images/dgx-os-6.3-h100-image/etc/sysconfig/slurmd 2PMIX_MCA_ptl=^usock 3PMIX_MCA_psec=none 4PMIX_SYSTEM_TMPDIR=/var/empty 5PMIX_MCA_gds=hash

Reboot the slogin and compute nodes.

cmsh
device
reboot -c slogin
reboot -c dgx-h100

To simplify the configuration, modify the slurmclient-gpu role to remove the slurm client role and convert slurm client-gpu to instead use that name.
1cmsh 2configurationoverlay 3remove slurm-client 4commit 5use slurm-client-gpu 6set name slurm-client 7commit

For DGX A100 systems, clear the Type value and set the correct core association with each GPU entry for maximum performance.

cmsh
configurationoverlay
use slurm-client
roles
use slurmclient
genericresources
use gpu0
clear type
set cores 48-63,176-191
use gpu1
clear type
set cores 48-63,176-191
use gpu2
clear type
set cores 16-31,144-159
use gpu3
clear type
set cores 16-31,144-159
use gpu4
clear type
set cores 112-127,240-255
use gpu5
clear type
set cores 112-127,240-255
use gpu6
clear type
set cores 80-95,210-223
use gpu7
clear type
set cores 80-95,210-223
commit

For DGX H100 systems, generic resources are set to autodetect.

Use this script.

 1cmsh
 2wlm
 3set gpuautodetect nvml
 4commit
 5configurationoverlay
 6use slurm-client
 7roles
 8use slurmclient
 9set gpuautodetect nvml
10commit
11genericresources
12foreach * (remove)
13commit
14add autodetected-gpus
15set name gpu
16set count 8
17set addtogresconfig yes
18commit
Note

addtogresconfig is set by default to YES and does not need to be set explicitly.

Which should yield output like this.
1[vikingbcmhead-01->configurationoverlay*[slurm-client*]->roles*[slurmclient*]->genericresources*[autodetected-gpus]]% ls
2Alias (key)        Name     Type     Count    File
3------------------ -------- -------- -------- ----------------
4autodetected-gpus  gpu      H100     8
The gres.conf file will be updated automatically by BCM—these settings align with the expectations of various scripts and tools in the NVIDIA ecosystem and will then maximize compatibility of this environment with those scripts and tools.

If the /home directory is not mounted on the nodes, increase the number of retries. Due to a race condition between the bond0 interface being up and /home being mounted, sometimes /home will not be mounted. Increasing the number of retries should fix the issue.
```
1category
2use dgx-h100
3fsmounts
4use /home
5set mountoptions "x-systemd.mount-timeout=150,defaults,_netdev,retry=5,vers=3"
```

The pod setup might leave stale repos in an air gapped environment. In which case following files need to be removed manually on the login nodes.

cd /etc/apt/sources.list.d/

Disable the following repos:

mv local.list local.list.disabled
mv cm.disabled cm.list
mv cm-ml.disabled cm-ml.list
mv /etc/apt/sources.disabled /etc/apt/sources.list