Finalize Headnode Setup#

The following steps are needed to ensure successful provisioning of the control plane nodes and the GB200 NVL72 rack(s).

Setting the Bond Priority for ipminet reachability#

For the RA, the correct bond priorities must be set so that bond1 on the headnode can reach all the IPMI networks within the cluster.

cmsh; network

use internalnet

show

set gatewaymetric 5

commit

use ipminet0

set gatewaymetric 10

show

commit

quit

Update Partition(base) to Complete Type 3 Network Setup for DGX GB200 Systems#

In the initial setup of the head node, a Type 3 network is selected, and a management network is defined (for DGX SuperPOD). However, in the partition settings, the external network setting needs to be changed from managementnet to internalnet. After this setting is changed, managementnet can be removed from the list of networks in the cluster.

cmsh;partition;set externalnetwork internalnet;commit

Note

The head node needs to be rebooted to ensure these changes take effect.

Reference: Partition(base) settings for Type 3 networks on DGX GB200.

[a03-p1-head-01->partition[base]]% show

.. code-block:: console

   [a03-p1-head-01->partition[base]]% show

   Parameter                              Value
   --------------------------------------  ---------------------------
   Cluster name                           Equinix SV11 GB200
   Revision
   Cluster reference architecture
   Administrator e-mail
   Name                                   base
   Headnode                               a03-p1-head-01
   Node basename                          node
   Node digits                            3
   Name servers                           10.61.13.53
   Name servers from dhcp
   Time servers                           10.10.10.53,10.10.10.54
   Search domains                         nvidia.com
   Relay Host
   Externally visible IP                  0.0.0.0
   Time zone                              America/Los_Angeles
   BMC Settings                           <submode>
   SNMP Settings                          <submode>
   DPU Settings                           <submode>
   SELinux Settings                       <submode>
   Access Settings                        <submode>
   Provisioning Settings                  <submode>
   ZTP settings                           <submode>
   ZTP new switch settings                <submode>
   NetQ settings                          <submode>
   UFM settings                           <submode>
   NMX-M settings                         <submode>
   Default burn configuration             default-destructive
   External network                       internalnet   # will initially be managementnet
   Management network                     internalnet
   No zero conf                           no
   Default category                       default-ubuntu2404-aarch64
   ArchOS                                 <2 in submode>
   Sign installer certificates            AUTO
   Failover                               b03-p1-head-02
   Failover groups                        <0 in submode>
   Burn configs                           <3 in submode>
   Notes                                  <0B>
   Wlm job power usage settings           <submode>
   Leak action policies                   <5 in submode>
   Active leak action policy
   BMS                                    Cronus
   Prometheus metric forwarders           <0 in submode>

fsexports#

To ensure that various file paths are available on other networks, the following needs to be done:

  1. Verify fsexports.

    1. For initial single headnode deployments, CMDaemon (cmd) on the head node will define fsexports automatically for any networks that are assigned as management/boot networks.

    2. For HA setups, while /cm/shared and /home come from the external NFS server, cmd will still automatically manage the /cm/node-installer fsexports for each head node.

  2. If missing, add fsexports (should not need to do this).

The fsexports need to be present so that the /home, /cm/shared/, and /cm/node-installer are accessible on the network. If this has not been done, the following example shows how to do so.

Example: Adding dgxnet to fsexports

# If doing this manually, ensure the dgxnet in this example is replaced
# with its intended network name.

cmsh;device use master;fsexports;

add /cm/node-installer-ubuntu2404-aarch64 dgxnet

..

add /cm/node-installer/certificates-ubuntu2404-aarch64 dgxnet

set write yes

..

add /var/spool/burn dgxnet

set write yes

..

add /home dgxnet

set write yes

set disabled no

..

add /cm/shared-ubuntu2404-aarch64 dgxnet

set write yes

set disabled no

..

Example: Completed fsexports for dgxnet

[head-01->device*[head-01*]->fsexports*]% ls

Name (key)                                 Path                          Network     Hosts  Write  Disabled
------------------------------------------  ---------------------------- ----------  -----  -----  --------
/cm/node-installer@internalnet              /cm/node-installer           internalnet         no     no
/cm/node-installer/certificates@internalnet /cm/node-installer/certificates internalnet      yes    no
/var/spool/burn@internalnet                 /var/spool/burn              internalnet         yes    no
/home@internalnet                           /home                        internalnet         yes    no
/cm/shared@internalnet                      /cm/shared                   internalnet         yes    no
/cm/node-installer@dgxnet                   /cm/node-installer           dgxnet              no     no
/cm/node-installer/certificates@dgxnet      /cm/node-installer/certificates dgxnet           yes    no
/var/spool/burn@dgxnet                      /var/spool/burn              dgxnet              yes    no
/home@dgxnet                                /home                        dgxnet              yes    no
/cm/shared@dgxnet                           /cm/shared                   dgxnet              yes    no

[head-01->device*[head-01*]->fsexports*]% commit

need to reflect what is automatically setup:


Name (key)                                   Path                              Network           Hosts  Write  Disabled

----------------------------------------------------  ----------------------------------------  ---------------  -----  --------

/var/spool/burn@internalnet                          /var/spool/burn                      internalnet               yes    no
/var/spool/burn@dgxnet1                              /var/spool/burn                      dgxnet1                   yes    no
/var/spool/burn@dgxnet2                              /var/spool/burn                      dgxnet2                   yes    no
/var/spool/burn@ipminet0                             /var/spool/burn                      ipminet0                  yes    no
/var/spool/burn@ipminet1                             /var/spool/burn                      ipminet1                  yes    no
/var/spool/burn@ipminet2                             /var/spool/burn                      ipminet2                  yes    no
/var/spool/burn@ipminet3                             /var/spool/burn                      ipminet3                  yes    no

/cm/node-installer-ubuntu2404-x86_64@dgxnet2         /cm/node-installer-ubuntu2404-x86_64 dgxnet2                    no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+     /cm/node-installer-ubuntu2404-x86_64/certificat+ dgxnet2         yes    no
/cm/node-installer-ubuntu2404-x86_64@internalnet     /cm/node-installer-ubuntu2404-x86_64 internalnet                no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+     /cm/node-installer-ubuntu2404-x86_64/certificat+ internalnet       yes    no
/cm/node-installer-ubuntu2404-x86_64@dgxnet1         /cm/node-installer-ubuntu2404-x86_64 dgxnet1                    no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+     /cm/node-installer-ubuntu2404-x86_64/certificat+ dgxnet1         yes    no

/cm/node-installer-ubuntu2404-aarch64@dgxnet2        /cm/node-installer-ubuntu2404-aarch64 dgxnet2                   no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+     /cm/node-installer-ubuntu2404-aarch64/certifica+ dgxnet2        yes    no
/cm/node-installer-ubuntu2404-aarch64@internaln+     /cm/node-installer-ubuntu2404-aarch64 internalnet               no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+     /cm/node-installer-ubuntu2404-aarch64/certifica+ internalnet       yes    no
/cm/node-installer-ubuntu2404-aarch64@dgxnet1        /cm/node-installer-ubuntu2404-aarch64 dgxnet1                   no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+     /cm/node-installer-ubuntu2404-aarch64/certifica+ dgxnet1        yes    no

/home@ipminet1                                       /home                                ipminet1                  yes    yes
/cm/shared-ubuntu2404-aarch64@ipminet1               /cm/shared-ubuntu2404-aarch64        ipminet1                  yes    yes
/cm/shared-ubuntu2404-x86_64@ipminet1                /cm/shared-ubuntu2404-x86_64         ipminet1                  yes    yes

/home@ipminet3                                       /home                                ipminet3                  yes    yes
/cm/shared-ubuntu2404-aarch64@ipminet3               /cm/shared-ubuntu2404-aarch64        ipminet3                  yes    yes
/cm/shared-ubuntu2404-x86_64@ipminet3                /cm/shared-ubuntu2404-x86_64         ipminet3                  yes    yes

/home@ipminet0                                       /home                                ipminet0                  yes    yes
/cm/shared-ubuntu2404-aarch64@ipminet0               /cm/shared-ubuntu2404-aarch64        ipminet0                  yes    yes
/cm/shared-ubuntu2404-x86_64@ipminet0                /cm/shared-ubuntu2404-x86_64         ipminet0                  yes    yes

/home@storagenet                                     /home                                storagenet                yes    yes
/cm/shared-ubuntu2404-aarch64@storagenet             /cm/shared-ubuntu2404-aarch64        storagenet                yes    yes
/cm/shared-ubuntu2404-x86_64@storagenet              /cm/shared-ubuntu2404-x86_64         storagenet                yes    yes

/home@ipminet2                                       /home                                ipminet2                  yes    yes
/cm/shared-ubuntu2404-aarch64@ipminet2               /cm/shared-ubuntu2404-aarch64        ipminet2                  yes    yes
/cm/shared-ubuntu2404-x86_64@ipminet2                /cm/shared-ubuntu2404-x86_64         ipminet2                  yes    yes

/home@dgxnet2                                        /home                                dgxnet2                   yes    yes
/cm/shared-ubuntu2404-aarch64@dgxnet2                /cm/shared-ubuntu2404-aarch64        dgxnet2                   yes    yes
/cm/shared-ubuntu2404-x86_64@dgxnet2                 /cm/shared-ubuntu2404-x86_64         dgxnet2                   yes    yes

/home@internalnet                                    /home                                internalnet               yes    yes
/cm/shared-ubuntu2404-aarch64@internalnet            /cm/shared-ubuntu2404-aarch64        internalnet               yes    yes
/cm/shared-ubuntu2404-x86_64@internalnet             /cm/shared-ubuntu2404-x86_64         internalnet               yes    yes

/home@computenet                                     /home                                computenet                yes    yes
/cm/shared-ubuntu2404-aarch64@computenet             /cm/shared-ubuntu2404-aarch64        computenet                yes    yes
/cm/shared-ubuntu2404-x86_64@computenet              /cm/shared-ubuntu2404-x86_64         computenet                yes    yes

/home@dgxnet1                                        /home                                dgxnet1                   yes    yes
/cm/shared-ubuntu2404-aarch64@dgxnet1                /cm/shared-ubuntu2404-aarch64        dgxnet1                   yes    yes
/cm/shared-ubuntu2404-x86_64@dgxnet1                 /cm/shared-ubuntu2404-x86_64         dgxnet1                   yes    yes

/home@loopback                                       /home                                loopback                  yes    yes
/cm/shared-ubuntu2404-aarch64@loopback               /cm/shared-ubuntu2404-aarch64        loopback                  yes    yes
/cm/shared-ubuntu2404-x86_64@loopback                /cm/shared-ubuntu2404-x86_64         loopback                  yes    yes

/home@failovernet                                    /home                                failovernet               yes    yes
/cm/shared-ubuntu2404-aarch64@failovernet            /cm/shared-ubuntu2404-aarch64        failovernet               yes    yes
/cm/shared-ubuntu2404-x86_64@failovernet             /cm/shared-ubuntu2404-x86_64         failovernet               yes    yes

/cm/node-installer-ubuntu2404-x86_64@internalne+      /cm/node-installer-ubuntu2404-x86_64 internalnet2               no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+      /cm/node-installer-ubuntu2404-x86_64/certificat+ internalnet2       yes    no
/cm/node-installer-ubuntu2404-aarch64@internaln+      /cm/node-installer-ubuntu2404-aarch64 internalnet2               no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+      /cm/node-installer-ubuntu2404-aarch64/certifica+ internalnet2       yes    no

/var/spool/burn@internalnet2                         /var/spool/burn                      internalnet2               yes    no
/home@internalnet2                                   /home                                internalnet2               yes    yes
/cm/shared-ubuntu2404-aarch64@internalnet2           /cm/shared-ubuntu2404-aarch64        internalnet2               yes    yes
/cm/shared-ubuntu2404-x86_64@internalnet2            /cm/shared-ubuntu2404-x86_64         internalnet2               yes    yes

/home@kube-default-pod                               /home                                kube-default-pod           yes    yes
/cm/shared-ubuntu2404-aarch64@kube-default-pod       /cm/shared-ubuntu2404-aarch64        kube-default-pod           yes    yes
/cm/shared-ubuntu2404-x86_64@kube-default-pod        /cm/shared-ubuntu2404-x86_64         kube-default-pod           yes    yes

/home@kube-default-service                           /home                                kube-default-service       yes    yes
/cm/shared-ubuntu2404-aarch64@kube-default-serv+     /cm/shared-ubuntu2404-aarch64        kube-default-service       yes    yes
/cm/shared-ubuntu2404-x86_64@kube-default-servi+     /cm/shared-ubuntu2404-x86_64         kube-default-service       yes    yes

/cm/node-installer-ubuntu2404-x86_64@ipminet3        /cm/node-installer-ubuntu2404-x86_64 ipminet3                   no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+     /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet3        yes    no
/cm/node-installer-ubuntu2404-aarch64@ipminet3       /cm/node-installer-ubuntu2404-aarch64 ipminet3                  no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+     /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet3        yes    no

/cm/node-installer-ubuntu2404-x86_64@ipminet2        /cm/node-installer-ubuntu2404-x86_64 ipminet2                   no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+     /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet2        yes    no
/cm/node-installer-ubuntu2404-aarch64@ipminet2       /cm/node-installer-ubuntu2404-aarch64 ipminet2                  no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+     /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet2        yes    no

/cm/node-installer-ubuntu2404-x86_64@ipminet1        /cm/node-installer-ubuntu2404-x86_64 ipminet1                   no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+     /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet1        yes    no
/cm/node-installer-ubuntu2404-aarch64@ipminet1       /cm/node-installer-ubuntu2404-aarch64 ipminet1                  no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+     /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet1        yes    no

/cm/node-installer-ubuntu2404-x86_64@ipminet0        /cm/node-installer-ubuntu2404-x86_64 ipminet0                   no     no
/cm/node-installer-ubuntu2404-x86_64/certificat+     /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet0        yes    no
/cm/node-installer-ubuntu2404-aarch64@ipminet0       /cm/node-installer-ubuntu2404-aarch64 ipminet0                  no     no
/cm/node-installer-ubuntu2404-aarch64/certifica+     /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet0        yes    no

Note

This example does not show the -ubuntu-aarch64 at the end of the /cm/shared/ and /cm/node-installer directory names.

The following steps will help to ensure successful provisioning of the control plane nodes and the GB200 nodes.

Enable Dependable PXE Booting#

  1. Use the root (not cmsh) shell.

  2. In /cm/local/apps/cmd/etc/cmd.conf, add the following AdvancedConfig parameter.

    AdvancedConfig = { "DeviceResolveAnyMAC=1" } # modified value
    
  3. Restart the CMDaemon to enable dependable PXE booting from bonded interfaces.

    # systemctl restart cmd
    

    The cmsh session will be disconnected because of restarting the CMDaemon. Type connect to reconnect after the CMDaemon has restarted. Or enter exit and then restart cmsh.

Disable Node BMC Setup in the Node-installer#

The global node-installer.conf file does not overwrite the individual architecture node-installer.conf file. Each arch node-installer.conf needs to be modified (/cm/node-installer itself should be a symlink to /cm/node-installer-<headnodedistro>-<headnodearch> itself in multi-arch/distro setups).

For the node-installer.conf file make these changes for each microarchitecture:

  1. vi /cm/node-installer/scripts/node-installer.conf

  2. vi /cm/node-installer-ubuntu2404-aarch64/scripts/node-installer.conf

  3. vi /cm/node-installer-ubuntu2404-x86_64/scripts/node-installer.conf

Example: node-installer.conf settings

# Set this to false if, for some reason, the installer fails to setup
# the BMC hardware correctly. In that case do it manually, or use
# a custom finalize script.
setupBmc = false

# Set this to false if the Node Installer should just skip BMC network
# devices if they are configured but not detected. By default it will
# halt when this happens.
failOnMissingBmc = false

# Some BMC hardware have user ID's for which the user name can not be modified.
# If the user ID is set to such an ID the Node Installer would halt because it
# can not change the user name. When this setting is set to false (the default)
# the Node Installer will try to find an alternative user ID. When this setting
# is set to true, the Node Installer will only attempt to set the configured
# user ID and leave any other ID's alone.
strictBmcUserId = false