Head Node Configuration

This section addresses configuration steps to be performed on BCM head nodes.

  1. Log in to the BCM head node assigned to externalnet.

    ssh <externalnet>
    
  2. Install the cluster license by running the request-license command. Provide the product key.

     1# request-license
     2Product Key (XXXXXX-XXXXXX-XXXXXX-XXXXXX-XXXXXX): 123456-123456-123456-123456
     3Country Name (2 letter code): US
     4State or Province Name (full name): California
     5Locality Name (e.g. city): Santa Clara
     6Organization Name (e.g. company): NVIDIA
     7Organizational Unit Name (e.g. department): TME
     8Cluster Name: BasePOD
     9Private key data saved to /cm/local/apps/cmd/etc/cluster.key.new
    10
    11Warning: Permanently added 'basepod-head1' (ECDSA) to the list of known hosts.
    12MAC Address of primary head node (basepod-head1) for ens1f1np1 [04:3F:72:E7:67:1F]:
    13Will this cluster use a high-availability setup with 2 head nodes? [y/N] n
    14Certificate request data saved to /cm/local/apps/cmd/etc/cluster.csr.new
    15Submit certificate request to http://licensing.brightcomputing.com/licensing/index.cgi ? [Y/n] Y
    16Contacting http://licensing.brightcomputing.com/licensing/index.cgi...
    17
    18License granted.
    19License data was saved to /cm/local/apps/cmd/etc/cluster.pem.new
    20Install license? [Y/n] Y
    21========= Certificate Information ========
    22Version:                    7.0
    23Edition:                    Advanced
    24Common name:                BasePod
    25Organization:               NVIDIA
    26Organizational unit:        Org
    27Locality:                   Santa Clara
    28State:                      CA
    29Country:                    US
    30Serial:                     2102313
    31Starting date:              21/Nov/2022
    32Expiration date:            17/Jan/2029
    33MAC address / Cloud ID:     04:3F:72:E7:67:1F
    34Licensed tokens:            1000
    35Accounting & Reporting:     Yes
    36Allow edge sites:           Yes
    37License type:               Commercial
    38==========================================
    39Is the license information correct ? [Y/n] Y
    
  3. Back up the default software image.

    The backup image can be used to create additional software images.

    1# cmsh
    2% softwareimage
    3% clone default-image default-image-orig
    4% commit
    
  4. Wait for the ramdisk to be regenerated and the following text to be displayed.

    Mon Feb 26 11:04:14 2024 [notice] bcm10-headnode: Initial ramdisk for image default-image-orig was generated successfully
    
  5. Create the DGX node category by cloning the default software image.

    This software image will be further configured and provisioned onto the dgx nodes.

    1% softwareimage
    2% clone default-image dgx-RHEL-image
    3% commit
    
  6. Add the required kernel modules to the dgx-RHEL-image software image.

    1% /
    2% softwareimage
    3% use dgx-RHEL-image
    4% kernelmodules
    5% add bonding
    6% softwareimage commit
    
  7. Modify the disksetup for the dgx-RHEL-image.

     1% category
     2% add dgx-a100
     3% set softwareimage dgx-rhel-image
     4% set disksetup
     5Modify the /var directory size to match the following:
     6    <partition id="a2">
     7            <size>58G</size>
     8            <type>linux</type>
     9            <filesystem>xfs</filesystem>
    10            <mountPoint>/var</mountPoint><mountOptions>defaults,noatime,nodiratime</mountOptions>
    11    </partition>
    12% commit
    
  8. Create the K8s software image by cloning the default software image.

    This software image will be further configured and provisioned onto the K8s master nodes.

    1% softwareimage
    2% clone default-image k8s-master-image
    3% commit
    
  9. Add the required kernel modules to the k8s-master-image software image.

    1% /
    2% softwareimage
    3% use k8s-master-image
    4% kernelmodules
    5% add mlx5_core
    6% add bonding
    7% softwareimage commit
    
  10. Create the k8s-master node category and assign the k8s-master-image software image to it.

    All nodes assigned to the k8s-master category will be provisioned with the k8s-master-image software image.

    1% category
    2% clone default k8s-master
    3% set softwareimage k8s-master-image
    4% commit
    
  11. Create the DGX nodes.

    node01 was created during head node installation. Clone node01 to create the DGX nodes, which will initially be named node02, node03, node04, and node05.

    1% device
    2% foreach --clone node01 -n node02..node05 ()
    3% commit
    
  12. Rename the DGX nodes so they are more easily identified later.

    1% use node02
    2% set hostname dgx01
    3% use node03
    4% set hostname dgx02
    5% use node04
    6% set hostname dgx03
    7% use node05
    8% set hostname dgx04
    9% device commit
    
  13. Clone node01 to create the K8s control plane nodes, which will initially be named node05, node06 and node07.

    1% device
    2% foreach --clone node01 -n node06..node08 ()
    3% commit
    
  14. Rename the K8s master nodes so they are more easily identifiable.

    1% device
    2% use node06
    3% set hostname knode01
    4% use node07
    5% set hostname knode02
    6% use node08
    7% set hostname knode03
    8% device commit
    
  15. Rename node01.

    The purpose of this step is to specify that node01 is only a template.

    1% device
    2% use node01
    3% set hostname template01
    4% commit
    
  16. Assign the DGX nodes to the dgx-a100 category.

    % foreach -n dgx01..dgx04 (set category dgx-a100)
    
  17. Assign the K8S nodes to the k8s-master node category.

    1% foreach -n knode01..knode03 (set category k8s-master)
    2% commit
    
  18. Check the nodes and their categories.

    Extra options are used for the device list to make the format more readable.

     1[bcm10-headnode->device]% device list -f hostname:20,category:10,ip:20,status:15
     2hostname (key)       category   ip                   status
     3-------------------- ---------- -------------------- ---------------
     4bcm10-headnode                  10.184.71.4          [   UP   ], he+
     5dgx01                dgx-a100   10.184.71.4          [  DOWN  ], un+
     6dgx02                dgx-a100   10.184.71.5          [  DOWN  ], un+
     7dgx03                dgx-a100   10.184.71.6          [  DOWN  ], un+
     8dgx04                dgx-a100   10.184.71.7          [  DOWN  ], un+
     9knode01              k8s-master 10.184.71.7          [  DOWN  ], un+
    10knode02              k8s-master 10.184.71.7          [  DOWN  ], un+
    11knode03              k8s-master 10.184.71.7          [  DOWN  ], un+
    12template01           default    10.184.71.4          [  DOWN  ], un+
    
  19. Add a network for InfiniBand (ibnet)

    1% network
    2% add ibnet
    3% set domainname ibnet.cluster.local
    4% set baseaddress 10.126.0.0
    5% set netmaskbits 16
    6% set mtu 2048
    7% commit
    
  20. Verify the results

    1[bcm10-headnode->network[ibnet]]% list -f name:20,type:10,netmaskbits:10,baseaddress:15,domainname:20
    2name (key)           type       netmaskbit baseaddress     domainname
    3-------------------- ---------- ---------- --------------- --------------------
    4externalnet          External   26         10.184.70.192   nvidia.com
    5globalnet            Global     0          0.0.0.0         cm.cluster
    6ibnet                Internal   16         0.0.0.0         ibnet.cluster.local
    7internalnet          Internal   26         10.184.71.0     eth.cluster
    8ipminet              Internal   26         10.184.70.64    ipmi.cluster
    
  21. Configure BCM to allow MAC Addresses to PXE Boot

    In the file, /cm/local/apps/cmd/etc/cmd.conf, uncomment the AdvancedConfig parameter and modify to DeviceResolveAnyMAC=1.

    1AdvancedConfig = { "DeviceResolveAnyMAC=1" }
    
  22. Restart the CMDaemon to enable reliable PXE booting from bonded interfaces.

    # systemctl restart cmd
    
  23. Configure Provisioning Interfaces on the DGX Nodes.

    Use a cmsh for loop to quickly add the new physical interfaces and the bond0 interface. This will update all four DGX A100 systems.

    1# cmsh
    2% device
    3% foreach -n dgx01..dgx04 (interfaces; add physical enp225s0f1; add physical enp97s0f1; add physical enp225s0f1np1; add physical enp97s0f1np1; commit)
    4% foreach -n dgx01..dgx04 (interfaces; add bond bond0; set interfaces enp225s0f1 enp97s0f1 enp225s0f1np1 enp97s0f1np1; set network internalnet; set mode 4; set options miimon=100; commit)
    
  24. Set the physical interface MAC addresses as appropriate, and set the ipmi0 and bond0 interfaces if they should be changed.

    This will need to be repeated on each DGX system (a single system shown here).

     1# cmsh
     2% device
     3% use dgx01
     4% interfaces
     5% set enp225s0f1 mac B8:CE:F6:2F:08:69
     6% set enp97s0f1 mac B8:CE:F6:2D:0E:A7
     7% set enp225s0f1np1 mac B8:CE:F6:2F:08:69
     8% set enp97s0f1np1 mac B8:CE:F6:2D:0E:A7
     9% set ipmi0 ip 10.227.20.69
    10% set bond0 ip 10.227.48.13
    11% commit
    12% list
    13Type         Network device name    IP               Network          Start if
    14------------ ---------------------- ---------------- ---------------- --------
    15bmc          ipmi0                  10.184.70.75     ipminet          always
    16bond         bond0                  10.184.71.11     internalnet      always
    17physical     BOOTIF [prov]          10.184.71.4      internalnet      always
    18physical     enp225s0f1 (bond0)     0.0.0.0                           always
    19physical     enp225s0f1np1 (bond0)  0.0.0.0                           always
    20physical     enp97s0f1 (bond0)      0.0.0.0                           always
    21physical     enp97s0f1np1 (bond0)   0.0.0.0                           always
    
  25. Set the bond0 interface as the provisioninginterface, and remove bootif. A for loop should be used here again.

    1% /                     # go to top level of cmsh
    2% device
    3% foreach -n dgx01..dgx04 (set provisioninginterface bond0; commit; interfaces; remove bootif; commit)
    
  26. Verify the configuration.

     1% device
     2% use dgx01
     3% get provisioninginterface
     4bond0
     5% interfaces
     6% list
     7Type         Network device name    IP               Network          Start if
     8------------ ---------------------- ---------------- ---------------- --------
     9bmc          ipmi0                  10.184.70.75     ipminet          always
    10bond         bond0 [prov]           10.184.71.11     internalnet      always
    11physical     enp225s0f1 (bond0)     0.0.0.0                           always
    12physical     enp225s0f1np1 (bond0)  0.0.0.0                           always
    13physical     enp97s0f1 (bond0)      0.0.0.0                           always
    14physical     enp97s0f1np1 (bond0)   0.0.0.0                           always
    
  27. Configure Provisioning Interfaces on the K8s Nodes.

    All the following steps in this section must be run for each of the three K8s nodes. Use a cmsh for loop to quickly add the new physical interfaces and the bond0 interface. This will update all 3 knodes.

    1% /                     # got to top level of CMSH
    2% device
    3% foreach -n knode01..knode03 (interfaces; add physical ens1f1; add physical ens2f1; add physical ens1f1np1; add physical ens2f1np1; commit)
    4% foreach -n knode01..knode03 (interfaces; add bond bond0; set interfaces ens1f1np1 ens2f1np1 ens1f1 ens2f1; set network internalnet; set mode 4; set options miimon=100; commit)
    
  28. Set the physical interface MAC addresses as appropriate, and set the ipmi0 and bond0 interfaces if they should be changed – this will need to be repeated on each knode system (a single system shown here).

     1% /
     2% device
     3% use knode01
     4% interfaces
     5% set ens1f1 mac 04:3F:72:E7:64:97
     6% set ens1f1np1 mac 04:3F:72:E7:64:97
     7% set ens2f1 mac 0C:42:A1:79:9B:15
     8% set ens2f1np1 mac 0C:42:A1:79:9B:15
     9% set ipmi0 ip 10.184.70.72
    10% set bond0 ip 10.227.48.30
    11% commit
    12% list
    13Type         Network device name  IP               Network          Start if
    14------------ -------------------- ---------------- ---------------- --------
    15bmc          ipmi0                10.184.70.72     ipminet          always
    16bond         bond0                10.184.71.8      internalnet      always
    17physical     BOOTIF [prov]        10.184.71.7      internalnet      always
    18physical     ens1f1 (bond0)       0.0.0.0                           always
    19physical     ens1f1np1 (bond0)    0.0.0.0                           always
    20physical     ens2f1 (bond0)       0.0.0.0                           always
    21physical     ens2f1np1 (bond0)    0.0.0.0                           always
    
  29. Set the bond0 interface as the provisioninginterface, and remove bootif. A for loop should be used here again.

    1% /                     # go to top level of cmsh
    2% device
    3% foreach -n knode01..knode03 (set provisioninginterface bond0; commit; interfaces; remove bootif; commit)
    
  30. Configure InfiniBand Interfaces on DGX Nodes.

    The following procedure adds four physical InfiniBand interfaces and must be run for each DGX node. Use a cmsh for loop to quickly add the new physical Infiniband interfaces. This will update all four DGX nodes.

    1% /                     # got to top level of CMSH
    2% device
    3% foreach -n dgx01..dgx04 (interfaces; add physical ibp12s0; set network ibnet; add physical ibp141s0; set network ibnet; add physical ibp186s0; set network ibnet; add physical ibp75s0; set network ibnet; commit)
    
  31. Set the IP addresses for each physical Infiniband interface – this will need to be repeated on each DGX system (a single system shown here). Make sure to iterate the 4th octet of the IPs up by one (so ibp12s0 for DGX02 would be 10.126.0.14)

     1% /                     # go to top level of CMSH
     2% device
     3% use dgx01
     4% interfaces
     5% set ibp12s0 ip 10.126.0.13
     6% set ibp141s0 ip 10.126.2.13
     7% set ibp186s0 ip 10.126.3.13
     8% set ibp75s0 ip 10.126.1.13
     9% commit
    10% list Type         Network device name    IP               Network          Start if
    11------------ ---------------------- ---------------- ---------------- --------
    12bmc          ipmi0                  10.184.70.75     ipminet          always
    13bond         bond0 [prov]           10.184.71.11     internalnet      always
    14physical     enp225s0f1 (bond0)     0.0.0.0                           always
    15physical     enp225s0f1np1 (bond0)  0.0.0.0                           always
    16physical     enp97s0f1 (bond0)      0.0.0.0                           always
    17physical     enp97s0f1np1 (bond0)   0.0.0.0                           always
    18physical     ibp12s0                10.126.0.13      ibnet            always
    19physical     ibp141s0               10.126.2.13      ibnet            always
    20physical     ibp186s0               10.126.3.13      ibnet            always
    21physical     ibp75s0                10.126.1.13      ibnet            always
    
  32. Identify the nodes by setting the MAC address for the provisioning interface for each node to the MAC address listed in the site survey.

     1% /
     2% device
     3% set dgx01 mac b8:ce:f6:2f:08:69
     4% set dgx02 mac 0c:42:a1:54:32:a7
     5% set dgx03 mac 0c:42:a1:0a:7a:51
     6% set dgx04 mac 1c:34:da:29:17:6e
     7% set knode01 mac 04:3F:72:E7:64:97
     8% set knode02 mac 04:3F:72:D3:FC:EB
     9% set knode03 mac 04:3F:72:D3:FC:DB
    10% foreach -c dgx-a100,k8s-master (get mac)
    11B8:CE:F6:2F:08:69
    120C:42:A1:54:32:A7
    130C:42:A1:0A:7A:51
    141C:34:DA:29:17:6E
    1504:3F:72:E7:64:97
    1604:3F:72:D3:FC:EB
    1704:3F:72:D3:FC:DB
    18% commit
    
  33. Update the K8s master node software image on the BCM head node.

    1# cm-chroot-sw-img /cm/images/k8s-master-image/
    2# subscription-manager register
    3# dnf update && dnf -y upgrade --nobest
    4# exit
    
  34. Set the knode OS kernel version to the latest one available.

    If you run into networking issues, it may be caused by the DNSSEC trust anchor file. To remedy this, follow instructions under the known issues section here.

    1# cmsh
    2% softwareimage
    3% use k8s-master-image
    4% set kernelversion 5.14.0-284.11.1.el9_2.x86_64
    5% commit
    
  35. Install NVIDIA software onto the DGX RHEL image.

    The NVIDIA driver will update automatically during the Kubernetes setup. Make sure the version you specify here matches the version the NVIDIA driver will upgrade to. If the driver updates to a version different from what you specified, it could potentially cause issues with the NVIDIA software.

     1# cm-chroot-sw-img /cm/images/dgx-rhel-image/
     2# subscription-manager register
     3# sudo subscription-manager release --set 9.3
     4# sudo dnf install -y https://repo.download.nvidia.com/baseos/el/el-files/9/nvidia-repo-setup-22.12-1.el9.x86_64.rpm
     5# sudo dnf update -y --nobest
     6# sudo dnf group install -y 'DGX A100 Configurations' --allowerasing
     7# sudo /usr/bin/configure_raid_array.py -c -f
     8# sudo dnf module install –y --nobest nvidia-driver:550-open/{fm,src} --allowerasing
     9# sudo dnf install -y nv-persistence-mode nvidia-fm-enable
    10# sudo dnf group install -y --allowerasing 'NVIDIA Container Runtime'
    11# exit
    
  36. Set the DGX OS kernel version to the latest one available.

    1# cmsh
    2% softwareimage
    3% use dgx-rhel-image
    4% set kernelversion 5.14.0-284.11.1.el9_2.x86_64
    5% commit
    
  37. Power On and Provision Cluster Nodes.

    Now that the required post-installation configuration has been completed, it is time to power on and provision the cluster nodes. After the initial provisioning, power control will be available from within BCM—using the cmsh or Base View. But for this initial provisioning, it is necessary to power them outside of BCM (that is, using the power button or a KVM). It will take several minutes for the nodes to go through their BIOS. After that, the node status will progress as the nodes are being provisioned. Watch the /var/log/messages and /var/log/node-installer log files to verify that everything is proceeding smoothly.