Head Node Configuration

This section addresses configuration steps to be performed on BCM head nodes.

Use the root (not cmsh) shell.

  1. In /cm/local/apps/cmd/etc/cmd.conf, uncomment the AdvancedConfig parameter.

    AdvancedConfig = { "DeviceResolveAnyMAC=1" } # modified value
    
  2. Restart the CMDaemon to enable reliable PXE booting from bonded interfaces.

    systemctl restart cmd
    

    The cmsh session will be disconnected because of restarting the CMDaemon. Type connect to reconnect after the CMDaemon has restarted. Or enter exit and then restart cmsh. The steps that follow are performed on the head node and should be run for all DGX systems.

  3. The steps that follow are performed on the head node and should be run for all DGX systems.

    On the head node, set the MAC addresses on the physical interfaces.

    Note

    Double check the MAC address for each interface and the IP number for the bond0 interface. Mistakes here will be difficult to diagnose.

    For each DGX H100 system, set the MAC and IP addresses as in this code block. Ensure that the addresses match the site survey.

     1cmsh
     2
     3[bcm10-headnode->device]% use dgx-01
     4[bcm10-headnode->device[dgx-01]]% set mac 94:6D:AE:53:91:FB
     5[bcm10-headnode->device*[dgx-01*]]% interfaces
     6[bcm10-headnode->device*[dgx-01*]->interfaces]% set enp170s0f1np1 mac 94:6D:AE:53:91:FB
     7[bcm10-headnode->device*[dgx-01*]->interfaces*]% set enp41s0f1np1 mac 94:6D:AE:53:74:0B
     8[bcm10-headnode->device*[dgx-01*]->interfaces*]% set ipmi0 ip 10.133.3.39
     9[bcm10-headnode->device*[dgx-01*]->interfaces*]% set bond0 ip 10.133.5.31
    10[bcm10-headnode->device*[dgx-01*]->interfaces*]% exit
    11[bcm10-headnode->device*[dgx-01*]]% commit
    
  4. Verify the configuration.

     1[bcm10-headnode->device]% use dgx-01
     2[bcm10-headnode->device*[dgx-01]]% interfaces
     3[bcm10-headnode->device[dgx-01]->interfaces]% ls
     4Type         Network device name    IP               Network          Start if
     5------------ ---------------------- ---------------- ---------------- --------
     6bmc          ipmi0                  10.133.3.39      ipminet          always
     7bond         bond0 [prov]           10.133.5.31      dgxnet1          always
     8physical     enp170s0f1np1 (bond0)  0.0.0.0                           always
     9physical     enp41s0f1np1 (bond0)   0.0.0.0                           always
    10physical     ibp154s0               100.126.0.17     computenet       always
    11physical     ibp170s0f0             100.127.0.14     storagenet       always
    12physical     ibp192s0               100.126.0.18     computenet       always
    13physical     ibp206s0               100.126.0.19     computenet       always
    14physical     ibp220s0               100.126.0.20     computenet       always
    15physical     ibp24s0                100.126.0.13     computenet       always
    16physical     ibp41s0f0              100.127.0.13     storagenet       always
    17physical     ibp64s0                100.126.0.14     computenet       always
    18physical     ibp79s0                100.126.0.15     computenet       always
    19physical     ibp94s0                100.126.0.16     computenet       always
    
  5. (Optional) Delete any extra DGX nodes that will not be provisioned. The list of nodes can be comma separated, or specified by a range as in the example below.

    [bcm10-headnode]% device
    [bcm10-headnode->device]% remove -n dgx-21..dgx-31
    [bcm10-headnode->device*]% commit
    Successfully removed 11 Devices
    Successfully committed 0 Devices
    
  6. Delete the slogin nodes and create the first k8s master node. The knodes will be configured during the kubernetes setup.

    [bcm10-headnode]% device
    [bcm10-headnode->device]% remove -n slogin-01,slogin-02
    [bcm10-headnode->device*]% set cpu-01 hostname knode-01
    [bcm10-headnode->device*]% commit
    Successfully removed 2 Devices
    Successfully committed 1 Devices
    
  7. (Optional) If the head node will be using a bonded interface, use the following commands. You may need to reboot the head node and redo request-license steps.

     1[bcm10-headnode]% device
     2[bcm10-headnode->device]% use bcm10-headnode
     3[bcm10-headnode->device[bcm10-headnode]]% interfaces
     4[bcm10-headnode->device[bcm10-headnode]->interfaces]% clear ens3f1np1 ip
     5[bcm10-headnode->device*[bcm10-headnode*]->interfaces*]% clear ens3f1np1 network
     6[bcm10-headnode->device*[bcm10-headnode*]->interfaces*]% add physical ens2np0
     7[bcm10-headnode->device*[bcm10-headnode*]->interfaces*[ens2np0*]]% set mac 88:e9:a4:20:18:d8
     8[bcm10-headnode->device*[bcm10-headnode*]->interfaces*[ens2np0*]]% add bond bond0
     9[bcm10-headnode->device*[bcm10-headnode*]->interfaces*[bond0*]]% append interfaces ens3f1np1 ens2np0
    10[bcm10-headnode->device*[bcm10-headnode*]->interfaces*[bond0*]]% set mode 1
    11[bcm10-headnode->device*[bcm10-headnode*]->interfaces*[bond0*]]% set network internalnet
    12[bcm10-headnode->device*[bcm10-headnode*]->interfaces*[bond0*]]% set ip 10.133.4.24
    13[bcm10-headnode->device*[bcm10-headnode*]->interfaces*[bond0*]]% ..
    14[bcm10-headnode->device*[bcm10-headnode*]->interfaces*]% ..
    15[bcm10-headnode->device*[bcm10-headnode*]]% set provisioninginterface bond0
    16[bcm10-headnode->device*[bcm10-headnode*]]% commit
    
  8. Power on and provision the DGX nodes.

    For initial provisioning, the DGX nodes must be powered on either directly or by using a KVM. It will take several minutes for the nodes to go through their BIOS. After that, node status progress will be displayed as the nodes are being provisioned. Monitor the /var/log/messages and /var/log/node-installer log files to verify that everything is proceeding smoothly.