Initial Cluster Setup

The deployment stage of a DGX SuperPOD consists of using BCM to provision and manage the Slurm cluster.

  1. Configure the NFS server.

    User home directories (home/) and shared data (cm_shared/) directories must be shared between head nodes (such as the DGX OS image) and must be stored on an NFS filesystem for HA availability. Because DGX SuperPOD does not mandate the nature of the NFS storage, the configuration is outside the scope of this document. This DGX SuperPOD deployment uses the NFS export path provided in the site survey /var/nfs/general. The following parameters are recommended for the NFS server export file /etc/exports.

    /var/nfs/general *(rw,sync,no_root_squash,no_subtree_check)
    
  2. Configure the DGX systems to PXE boot by default.

    1. Using either KVM or a crash cart, connect to the DGX system, enter the BIOS menu, and configure Boot Option #1 to be [NETWORK].

      _images/cluster-setup-01.png
    2. Ensure that other Boot Options are [Disabled] and go to the next screen.

    3. Set Boot Option #1 and Boot Option #2 to use IPv4 for Storage 4-2 and Storage 5 2.

      _images/cluster-setup-02.png
    4. Ensure that other Boot Options are [Disabled].

    5. Select Save & Exit.

  3. On the failover head node and the CPU nodes, ensure that Network boot is configured as the primary option. Ensure that the Mellanox ports connected on the network on the head and CPU nodes are set to Ethernet mode as well.

    This is an example of a system that will boot from the network with Slot 1 Port 2 and Slot 2 Port 2.

    _images/cluster-setup-03.png
  4. Download the BCM installer ISO.

  5. Burn the ISO to a DVD or to a bootable USB device.

    It can also be mounted as virtual media and installed using the BMC. The specific mechanism for the latter will vary by vendor.

  6. Ensure that the BIOS of the target head node is configured in UEFI mode and that its boot order is configured to boot the media containing the BCM installer image.

  7. Boot the installation media.

  8. At the grub menu, choose Start Base Command Manager Graphical Installer.

    _images/cluster-setup-04.png
  9. Select Start installation on the splash screen.

    _images/cluster-setup-05.png
  10. Accept the terms of the NVIDIA EULA by checking I agree and then select Next.

    _images/cluster-setup-06.png
  11. Accept the terms of the Ubuntu Server UELA by checking I agree and then select Next.

    _images/cluster-setup-07.png
  12. Unless instructed otherwise, select Next without modifying the kernel modules to be loaded at boot time.

    _images/cluster-setup-08.png
  13. Verify the Hardware info is correct and then select Next.

    For example, that the target storage device and the cabled host network interfaces are present (in this case three NVMe drives are the target storage device, and ens1np0 and ens2np01 are the cabled host network interfaces).

    _images/cluster-setup-09.png
  14. On the Installation source screen, choose the appropriate source and then select Next.

    Running a media integrity check is optional.

    _images/cluster-setup-10.png
  15. On the Cluster settings screen, enter the required information and then select Next.

    _images/cluster-setup-11.png
  16. On the Workload manager screen, choose None and then select Next.

    _images/cluster-setup-12.png
  17. On the Network topology screen, choose the network type for the data center environment and then select Next.

    _images/cluster-setup-13.png
  18. On the Head node screen, enter the Hostname, Administrator password, choose Other for Hardware manufacturer, and then select Next.

    _images/cluster-setup-14.png
  19. Accept defaults in the Compute nodes and then select Next.

    Ensure that the Node base name is node. Other values will be updated later in the installation.

    _images/cluster-setup-15.png
  20. On the BMC Configuration screen, choose No for both Head Node and Compute Nodes, and then select Next.

    These will be updated later in the post install stages.

    _images/cluster-setup-16.png
  21. On the Networks screen, enter the required information for internalnet, and then select Next.

    Since a Type 2 network was specified, there are no other network tabs (for example, internalnet or ipminet).

    _images/cluster-setup-17.png
  22. On the Head node interfaces screen, ensure that one interface is configured with the head node’s target internalnet IP, and then select Next.

    Other interfaces will be configured by the post install script.

    _images/cluster-setup-18.png
  23. On the Compute node interfaces screen, leave the default entries, and then select Next.

    These will be updated post install.

    _images/cluster-setup-19.png
  24. On the Disk layout screen, select the target install location (in this case nvme0n1) and then select Next.

    _images/cluster-setup-20.png
  25. On the Disk layout settings screen, accept defaults and then select Next.

    These settings will be updated later in the post installation steps.

    _images/cluster-setup-21.png
  26. In the Additional software screen, do not choose anything and then select Next.

    _images/cluster-setup-22.png
  27. Confirm the information on the Summary screen and then select Next.

    The Summary screen provides an opportunity to confirm the Head node and basic cluster configuration before deployment begins. This configuration will be updated/modified for DGX SuperPOD after deployment is complete. If values do not match expectations, use the Back button to navigate to the appropriate screen to correct any mistake.

    _images/cluster-setup-23.png
  28. Once the deployment is complete, select Reboot.

    _images/cluster-setup-24.png
  29. License the cluster by running the request-license and providing the product key.

    sudo -i request-license
    
    Product Key (XXXXXX-XXXXXX-XXXXXX-XXXXXX-XXXXXX):
    
  30. Options:

    1. If using the old method of MAC to IP allocation, skip line 32.

    2. If employing the new method: Automatically detect MAC addresses based on switch and switch port, proceed to the next step.

    3. Before advancing with the execution of the network automation application, certain prerequisites are necessary. Do as following:

      1. Copy the “p2p_ethernet.csv” file from the USB stick to the following path /cm/local/apps/bcm-superpod-network/config/p2p_ethernet.csv

        mv p2p_ethernet.csv /cm/local/apps/bcm-superpod-network/config/
        
  31. Load the bcm-superpod-network module.

    module load bcm-superpod-network
    
  32. Run the bcm-netautogen script.

    bcm-netautogen
    

    Noticed New additional information has been provided.

    _images/cluster-setup-29.png

    Data was extracted from the p2p_ethernet.csv file to compute the quantities of Network Switches, DGX, IBSW, and PDUs. Accurate values must be provided during menu execution, which will be updated in future releases to utilize the count as physical cable connections.

    _images/cluster-setup-25.png

    The following generated files are important and contain data:

    • Site network configuration - /cm/local/apps/bcm-superpod-network/config/network-configuration.yml

    • Site network allocations - /cm/local/apps/bcm-superpod-network/config/network-allocations.yml

    • Switch connection - /cm/local/apps/bcm-superpod-network/config/switch-connections.yml

    • IP Allocation Readme file - /cm/local/apps/bcm-superpod-network/config/ip_allocations.md

  33. Download and move cumulus-linux-5.5.1-mlx-amd64.bin and image-X86_64-3.11.2016.img to the following directory on the head node. Contact your TAM for access to the correct file and move the file to the following directory on the head node.

    mv cumulus-linux-5.5.1-mlx-amd64.bin  /cm/local/apps/cmd/etc/htdocs/switch/image/
    mv image-X86_64-3.11.2016.img /cm/local/apps/cmd/etc/htdocs/switch/image/
    
  34. Load the bcm-post-install module.

    module load bcm-post-install/
    
  35. Run the bcm-pod-setup script.

    The parameters to use are:

    • –C sets the base address of the computenet network.

    • –S sets the base address of the storagenet network.

    • –I sets the installation source.

    bcm-pod-setup -C 100.126.0.0/16 -S 100.127.0.0/16 -I /dev/sdb
    
  36. Check the nodes and their categories.

    Extra options are used for device list to make the format more readable.

    1cmsh
    2[bcm-head-01]%device list -f hostname:20,category:10
    

    Result:

    1hostname(key)      category
    2bcm-cpu-01           default
    3bcm-dgx-a100-01.  dgx-a100
    4bcm-dgx-h100-01.  dgx-h100
    
  37. Confirm the config is correct for bcm-dgx-h100-01 / bcm-dgx-a100-01.

     1[bcm-head-01->device[bcm-dgx-h100-01]]% interfaces
     2[bcm-head-01->device[bcm-dgx-h100-01]->interfaces]% list
     3Type         Network device name    IP               Network          Start if
     4------------ ---------------------- ---------------- ---------------- --------
     5bmc          ipmi0                  10.0.92.50       ipminet          always
     6bond         bond0 [prov]           10.0.93.12       dgxnet           always
     7physical     enp170s0f1np1 (bond0)  0.0.0.0                           always
     8physical     enp41s0f1np1 (bond0)   0.0.0.0                           always
     9physical     ibp154s0               100.126.5.14     ibnetcompute     always
    10physical     ibp170s0f0             100.127.2.2      ibnetstorage     always
    11physical     ibp192s0               100.126.6.14     ibnetcompute     always
    12physical     ibp206s0               100.126.7.14     ibnetcompute     always
    13physical     ibp220s0               100.126.8.14     ibnetcompute     always
    14physical     ibp24s0                100.126.1.14     ibnetcompute     always
    15physical     ibp41s0f0              100.127.1.2      ibnetstorage     always
    16physical     ibp64s0                100.126.2.14     ibnetcompute     always
    17physical     ibp79s0                100.126.3.14     ibnetcompute     always
    18physical     ibp94s0                100.126.4.14     ibnetcompute     always
    

    Note

    Enabling the CX7 firmware upgrade

    To upgrade the mlx firmware , set below flag to ‘yes’. By default, this flag is set to ‘no’. This flag can be changed in the software image.

    For example (setting in the softwareimage):

    1cat /cm/images/<dgx image>/etc/infiniband/openib.conf | grep RUN_FW_UPDATER_ONBOOT
    2
    3RUN_FW_UPDATER_ONBOOT=yes
    

    Once set, perform an ipmi tool power off and power on.

  38. Check the Ethernet Switch are in the devices.

    cmsh >> device >> list
    
    _images/cluster-setup-26.png

    Validate under the Type Switches are added after executing bcm-pod-setup

  39. Add Switch credential, under each IPMI, TOR and SPINE switch.

    _images/cluster-setup-27.png
    1commit
    2quit
    
  40. To allocate IP via switch port:

    • After running bcm-pod-setup, once all the networks, and devices objects are added to the Bright

    • Make sure the IPMI switch is UP in the Bright before moving to the next step

    • Based on the switch and switch port configuration for each node, navigate to the device then nodes, and execute the below command:

      • setmacviaswitchport …… Set the MAC of a device via the MAC found on its switch ports.

        • It will access the switch and pull the MAC address based on the switch port allocation.

        _images/cluster-setup-30.png
  41. To gather UFM metrics

    • Add UFM to the Bright with mgmt IP address

    • Make sure UFM Promethus exporter is enable in UFM

     1## You can check by curl command from bright:
     2curl http://<UFM-IP>:9001/metrics
     3
     4## Configure Bright with following:
     5monitoring setup
     6add prometheus UFM
     7set urls https://<UFM-IP>:9001/metrics
     8set –e NoPostAllowed yes
     9nodeexecutionfilters
    10active
    11commit
    12
    13## Wait for (~2mins) for data to be collected
    14get measurables
    15
    16## To plot
    17monitoring labeledentity
    18list
    19
    20## Using the index value:
    21instantquery <index value>