Slurm Setup

  1. Update the interface names on the slogin nodes.

    1% device use slogin-01
    
    1. If the slogin-01 does not have the expected interface names, update the interface names.

      1% use networkdevicename
      2% set networkdevicename new-name
      
  2. Assign the MAC addresses to the slogin nodes.

    1device use slogin-01
    2set mac <MAC address>
    
  3. Power on and install the slogin nodes.

  4. Run the bcm-install-slurm script.

    Use the following parameters:

    1. Installation source for the –bcm-media parameter. It can be either a USB or a path to a .iso file.

    2. Use the -A parameter to run the script in air-gapped mode.

    3. If CMHA is set up but has failover ping errors, append --ignore-ha-errors.

    4. If there is only one slogin node, append --ignore-missing-login-node.

      bcm-install-slurm -A --bcm-media <path to installer image or usb device to mount>
      
  5. Confirm that the slurmd file is present in DGX image before provisioning DGX nodes, if not create it.

    The same file is needed for both DGX A100 and DGX H100 systems. This example is for DGX H100 systems. It is observed that NCCL tests with PMIX need this file.

    1cat /cm/images/dgx-os-6.2-h100-image/etc/sysconfig/slurmd
    2PMIX_MCA_ptl=^usock
    3PMIX_MCA_psec=none
    4PMIX_SYSTEM_TMPDIR=/var/empty
    5PMIX_MCA_gds=hash
    
  6. Reboot the slogin and compute nodes.

    1cmsh
    2device
    3reboot -c slogin
    4reboot -c dgx-h100
    
  7. To simplify the configuration, modify the slurmclient-gpu role to remove the slurm client role and convert slurm client-gpu to instead use that name.

    1cmsh
    2configurationoverlay
    3remove slurm-client
    4commit
    5use slurm-client-gpu
    6set name slurm-client
    7commit
    
  8. For DGX A100 systems, clear the Type value and set the correct core association with each GPU entry for maximum performance.

     1cmsh
     2configurationoverlay
     3use slurm-client
     4roles
     5use slurmclient
     6genericresources
     7use gpu0
     8clear type
     9set cores 48-63,176-191
    10use gpu1
    11clear type
    12set cores 48-63,176-191
    13use gpu2
    14clear type
    15set cores 16-31,144-159
    16use gpu3
    17clear type
    18set cores 16-31,144-159
    19use gpu4
    20clear type
    21set cores 112-127,240-255
    22use gpu5
    23clear type
    24set cores 112-127,240-255
    25use gpu6
    26clear type
    27set cores 80-95,210-223
    28use gpu7
    29clear type
    30set cores 80-95,210-223
    31commit
    
  9. For DGX H100 systems, generic resources are set to autodetect.

    Use this script.

     1cmsh
     2wlm
     3set gpuautodetect nvml
     4commit
     5configurationoverlay
     6use slurm-client
     7roles
     8use slurmclient
     9set gpuautodetect nvml
    10commit
    11genericresources
    12foreach * (remove)
    13commit
    14add autodetected-gpus
    15set name gpu
    16set count 8
    17set addtogresconfig yes
    18commit
    

    Note

    addtogresconfig is set by default to YES and does not need to be set explicitly.

    Which should yield output like this.

    1[vikingbcmhead-01->configurationoverlay*[slurm-client*]->roles*[slurmclient*]->genericresources*[autodetected-gpus]]% ls
    2Alias (key)        Name     Type     Count    File
    3------------------ -------- -------- -------- ----------------
    4autodetected-gpus  gpu      H100     8
    

    The gres.conf file will be updated automatically by BCM—these settings align with the expectations of various scripts and tools in the NVIDIA ecosystem and will then maximize compatibility of this environment with those scripts and tools.

  10. If the /home directory is not mounted on the nodes, increase the number of retries. Due to a race condition between the bond0 interface being up and /home being mounted, sometimes /home will not be mounted. Increasing the number of retries should fix the issue.

    1category
    2use dgx-h100
    3fsmounts
    4use /home
    5set mountoptions "x-systemd.mount-timeout=150,defaults,_netdev,retry=5,vers=3"
    

The pod setup might leave stale repos in an air gapped environment. In which case following files need to be removed manually on the login nodes.

cd /etc/apt/sources.list.d/

Disable the following repos:

1mv local.list local.list.disabled
2mv cm.disabled cm.list
3mv cm-ml.disabled cm-ml.list
4mv /etc/apt/sources.disabled /etc/apt/sources.list