Slurm Setup
Update the interface names on the slogin nodes.
1% device use slogin-01
If the slogin-01 does not have the expected interface names, update the interface names.
1% use networkdevicename 2% set networkdevicename new-name
Assign the MAC addresses to the slogin nodes.
1device use slogin-01 2set mac <MAC address>
Power on and install the slogin nodes.
Run the bcm-install-slurm script.
Use the following parameters:
Installation source for the –bcm-media parameter. It can be either a USB or a path to a .iso file.
Use the -A parameter to run the script in air-gapped mode.
If CMHA is set up but has failover ping errors, append
--ignore-ha-errors
.If there is only one slogin node, append
--ignore-missing-login-node
.bcm-install-slurm -A --bcm-media <path to installer image or usb device to mount>
Confirm that the slurmd file is present in DGX image before provisioning DGX nodes, if not create it.
The same file is needed for both DGX A100 and DGX H100 systems. This example is for DGX H100 systems. It is observed that NCCL tests with PMIX need this file.
1cat /cm/images/dgx-os-6.2-h100-image/etc/sysconfig/slurmd 2PMIX_MCA_ptl=^usock 3PMIX_MCA_psec=none 4PMIX_SYSTEM_TMPDIR=/var/empty 5PMIX_MCA_gds=hash
Reboot the slogin and compute nodes.
1cmsh 2device 3reboot -c slogin 4reboot -c dgx-h100
To simplify the configuration, modify the slurmclient-gpu role to remove the slurm client role and convert slurm client-gpu to instead use that name.
1cmsh 2configurationoverlay 3remove slurm-client 4commit 5use slurm-client-gpu 6set name slurm-client 7commit
For DGX A100 systems, clear the Type value and set the correct core association with each GPU entry for maximum performance.
1cmsh 2configurationoverlay 3use slurm-client 4roles 5use slurmclient 6genericresources 7use gpu0 8clear type 9set cores 48-63,176-191 10use gpu1 11clear type 12set cores 48-63,176-191 13use gpu2 14clear type 15set cores 16-31,144-159 16use gpu3 17clear type 18set cores 16-31,144-159 19use gpu4 20clear type 21set cores 112-127,240-255 22use gpu5 23clear type 24set cores 112-127,240-255 25use gpu6 26clear type 27set cores 80-95,210-223 28use gpu7 29clear type 30set cores 80-95,210-223 31commit
For DGX H100 systems, generic resources are set to autodetect.
Use this script.
1cmsh 2wlm 3set gpuautodetect nvml 4commit 5configurationoverlay 6use slurm-client 7roles 8use slurmclient 9set gpuautodetect nvml 10commit 11genericresources 12foreach * (remove) 13commit 14add autodetected-gpus 15set name gpu 16set count 8 17set addtogresconfig yes 18commit
Note
addtogresconfig
is set by default to YES and does not need to be set explicitly.Which should yield output like this.
1[vikingbcmhead-01->configurationoverlay*[slurm-client*]->roles*[slurmclient*]->genericresources*[autodetected-gpus]]% ls 2Alias (key) Name Type Count File 3------------------ -------- -------- -------- ---------------- 4autodetected-gpus gpu H100 8
The
gres.conf
file will be updated automatically by BCM—these settings align with the expectations of various scripts and tools in the NVIDIA ecosystem and will then maximize compatibility of this environment with those scripts and tools.If the /home directory is not mounted on the nodes, increase the number of retries. Due to a race condition between the bond0 interface being up and /home being mounted, sometimes /home will not be mounted. Increasing the number of retries should fix the issue.
1category 2use dgx-h100 3fsmounts 4use /home 5set mountoptions "x-systemd.mount-timeout=150,defaults,_netdev,retry=5,vers=3"
The pod setup might leave stale repos in an air gapped environment. In which case following files need to be removed manually on the login nodes.
cd /etc/apt/sources.list.d/
Disable the following repos:
1mv local.list local.list.disabled
2mv cm.disabled cm.list
3mv cm-ml.disabled cm-ml.list
4mv /etc/apt/sources.disabled /etc/apt/sources.list