Finalize Headnode Setup#
The following steps are needed to ensure successful provisioning of the control plane nodes and the GB200 NVL72 rack(s).
Setting the Bond Priority for ipminet reachability#
For the RA, the correct bond priorities must be set so that bond1 on the headnode can reach all the IPMI networks within the cluster.
cmsh; network
use internalnet
show
set gatewaymetric 5
commit
use ipminet0
set gatewaymetric 10
show
commit
quit
Update Partition(base) to Complete Type 3 Network Setup for DGX GB200 Systems#
In the initial setup of the head node, a Type 3 network is selected, and a management network is defined (for DGX SuperPOD). However, in the partition settings, the external network setting needs to be changed from managementnet to internalnet. After this setting is changed, managementnet can be removed from the list of networks in the cluster.
cmsh;partition;set externalnetwork internalnet;commit
Note
The head node needs to be rebooted to ensure these changes take effect.
Reference: Partition(base) settings for Type 3 networks on DGX GB200.
[a03-p1-head-01->partition[base]]% show
.. code-block:: console
[a03-p1-head-01->partition[base]]% show
Parameter Value
-------------------------------------- ---------------------------
Cluster name Equinix SV11 GB200
Revision
Cluster reference architecture
Administrator e-mail
Name base
Headnode a03-p1-head-01
Node basename node
Node digits 3
Name servers 10.61.13.53
Name servers from dhcp
Time servers 10.10.10.53,10.10.10.54
Search domains nvidia.com
Relay Host
Externally visible IP 0.0.0.0
Time zone America/Los_Angeles
BMC Settings <submode>
SNMP Settings <submode>
DPU Settings <submode>
SELinux Settings <submode>
Access Settings <submode>
Provisioning Settings <submode>
ZTP settings <submode>
ZTP new switch settings <submode>
NetQ settings <submode>
UFM settings <submode>
NMX-M settings <submode>
Default burn configuration default-destructive
External network internalnet # will initially be managementnet
Management network internalnet
No zero conf no
Default category default-ubuntu2404-aarch64
ArchOS <2 in submode>
Sign installer certificates AUTO
Failover b03-p1-head-02
Failover groups <0 in submode>
Burn configs <3 in submode>
Notes <0B>
Wlm job power usage settings <submode>
Leak action policies <5 in submode>
Active leak action policy
BMS Cronus
Prometheus metric forwarders <0 in submode>
fsexports#
To ensure that various file paths are available on other networks, the following needs to be done:
Verify fsexports.
For initial single headnode deployments, CMDaemon (cmd) on the head node will define fsexports automatically for any networks that are assigned as management/boot networks.
For HA setups, while /cm/shared and /home come from the external NFS server, cmd will still automatically manage the /cm/node-installer fsexports for each head node.
If missing, add fsexports (should not need to do this).
The fsexports need to be present so that the /home, /cm/shared/, and /cm/node-installer are accessible on the network. If this has not been done, the following example shows how to do so.
Example: Adding dgxnet to fsexports
# If doing this manually, ensure the dgxnet in this example is replaced
# with its intended network name.
cmsh;device use master;fsexports;
add /cm/node-installer-ubuntu2404-aarch64 dgxnet
..
add /cm/node-installer/certificates-ubuntu2404-aarch64 dgxnet
set write yes
..
add /var/spool/burn dgxnet
set write yes
..
add /home dgxnet
set write yes
set disabled no
..
add /cm/shared-ubuntu2404-aarch64 dgxnet
set write yes
set disabled no
..
Example: Completed fsexports for dgxnet
[head-01->device*[head-01*]->fsexports*]% ls
Name (key) Path Network Hosts Write Disabled
------------------------------------------ ---------------------------- ---------- ----- ----- --------
/cm/node-installer@internalnet /cm/node-installer internalnet no no
/cm/node-installer/certificates@internalnet /cm/node-installer/certificates internalnet yes no
/var/spool/burn@internalnet /var/spool/burn internalnet yes no
/home@internalnet /home internalnet yes no
/cm/shared@internalnet /cm/shared internalnet yes no
/cm/node-installer@dgxnet /cm/node-installer dgxnet no no
/cm/node-installer/certificates@dgxnet /cm/node-installer/certificates dgxnet yes no
/var/spool/burn@dgxnet /var/spool/burn dgxnet yes no
/home@dgxnet /home dgxnet yes no
/cm/shared@dgxnet /cm/shared dgxnet yes no
[head-01->device*[head-01*]->fsexports*]% commit
need to reflect what is automatically setup:
Name (key) Path Network Hosts Write Disabled
---------------------------------------------------- ---------------------------------------- --------------- ----- --------
/var/spool/burn@internalnet /var/spool/burn internalnet yes no
/var/spool/burn@dgxnet1 /var/spool/burn dgxnet1 yes no
/var/spool/burn@dgxnet2 /var/spool/burn dgxnet2 yes no
/var/spool/burn@ipminet0 /var/spool/burn ipminet0 yes no
/var/spool/burn@ipminet1 /var/spool/burn ipminet1 yes no
/var/spool/burn@ipminet2 /var/spool/burn ipminet2 yes no
/var/spool/burn@ipminet3 /var/spool/burn ipminet3 yes no
/cm/node-installer-ubuntu2404-x86_64@dgxnet2 /cm/node-installer-ubuntu2404-x86_64 dgxnet2 no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ dgxnet2 yes no
/cm/node-installer-ubuntu2404-x86_64@internalnet /cm/node-installer-ubuntu2404-x86_64 internalnet no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ internalnet yes no
/cm/node-installer-ubuntu2404-x86_64@dgxnet1 /cm/node-installer-ubuntu2404-x86_64 dgxnet1 no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ dgxnet1 yes no
/cm/node-installer-ubuntu2404-aarch64@dgxnet2 /cm/node-installer-ubuntu2404-aarch64 dgxnet2 no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ dgxnet2 yes no
/cm/node-installer-ubuntu2404-aarch64@internaln+ /cm/node-installer-ubuntu2404-aarch64 internalnet no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ internalnet yes no
/cm/node-installer-ubuntu2404-aarch64@dgxnet1 /cm/node-installer-ubuntu2404-aarch64 dgxnet1 no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ dgxnet1 yes no
/home@ipminet1 /home ipminet1 yes yes
/cm/shared-ubuntu2404-aarch64@ipminet1 /cm/shared-ubuntu2404-aarch64 ipminet1 yes yes
/cm/shared-ubuntu2404-x86_64@ipminet1 /cm/shared-ubuntu2404-x86_64 ipminet1 yes yes
/home@ipminet3 /home ipminet3 yes yes
/cm/shared-ubuntu2404-aarch64@ipminet3 /cm/shared-ubuntu2404-aarch64 ipminet3 yes yes
/cm/shared-ubuntu2404-x86_64@ipminet3 /cm/shared-ubuntu2404-x86_64 ipminet3 yes yes
/home@ipminet0 /home ipminet0 yes yes
/cm/shared-ubuntu2404-aarch64@ipminet0 /cm/shared-ubuntu2404-aarch64 ipminet0 yes yes
/cm/shared-ubuntu2404-x86_64@ipminet0 /cm/shared-ubuntu2404-x86_64 ipminet0 yes yes
/home@storagenet /home storagenet yes yes
/cm/shared-ubuntu2404-aarch64@storagenet /cm/shared-ubuntu2404-aarch64 storagenet yes yes
/cm/shared-ubuntu2404-x86_64@storagenet /cm/shared-ubuntu2404-x86_64 storagenet yes yes
/home@ipminet2 /home ipminet2 yes yes
/cm/shared-ubuntu2404-aarch64@ipminet2 /cm/shared-ubuntu2404-aarch64 ipminet2 yes yes
/cm/shared-ubuntu2404-x86_64@ipminet2 /cm/shared-ubuntu2404-x86_64 ipminet2 yes yes
/home@dgxnet2 /home dgxnet2 yes yes
/cm/shared-ubuntu2404-aarch64@dgxnet2 /cm/shared-ubuntu2404-aarch64 dgxnet2 yes yes
/cm/shared-ubuntu2404-x86_64@dgxnet2 /cm/shared-ubuntu2404-x86_64 dgxnet2 yes yes
/home@internalnet /home internalnet yes yes
/cm/shared-ubuntu2404-aarch64@internalnet /cm/shared-ubuntu2404-aarch64 internalnet yes yes
/cm/shared-ubuntu2404-x86_64@internalnet /cm/shared-ubuntu2404-x86_64 internalnet yes yes
/home@computenet /home computenet yes yes
/cm/shared-ubuntu2404-aarch64@computenet /cm/shared-ubuntu2404-aarch64 computenet yes yes
/cm/shared-ubuntu2404-x86_64@computenet /cm/shared-ubuntu2404-x86_64 computenet yes yes
/home@dgxnet1 /home dgxnet1 yes yes
/cm/shared-ubuntu2404-aarch64@dgxnet1 /cm/shared-ubuntu2404-aarch64 dgxnet1 yes yes
/cm/shared-ubuntu2404-x86_64@dgxnet1 /cm/shared-ubuntu2404-x86_64 dgxnet1 yes yes
/home@loopback /home loopback yes yes
/cm/shared-ubuntu2404-aarch64@loopback /cm/shared-ubuntu2404-aarch64 loopback yes yes
/cm/shared-ubuntu2404-x86_64@loopback /cm/shared-ubuntu2404-x86_64 loopback yes yes
/home@failovernet /home failovernet yes yes
/cm/shared-ubuntu2404-aarch64@failovernet /cm/shared-ubuntu2404-aarch64 failovernet yes yes
/cm/shared-ubuntu2404-x86_64@failovernet /cm/shared-ubuntu2404-x86_64 failovernet yes yes
/cm/node-installer-ubuntu2404-x86_64@internalne+ /cm/node-installer-ubuntu2404-x86_64 internalnet2 no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ internalnet2 yes no
/cm/node-installer-ubuntu2404-aarch64@internaln+ /cm/node-installer-ubuntu2404-aarch64 internalnet2 no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ internalnet2 yes no
/var/spool/burn@internalnet2 /var/spool/burn internalnet2 yes no
/home@internalnet2 /home internalnet2 yes yes
/cm/shared-ubuntu2404-aarch64@internalnet2 /cm/shared-ubuntu2404-aarch64 internalnet2 yes yes
/cm/shared-ubuntu2404-x86_64@internalnet2 /cm/shared-ubuntu2404-x86_64 internalnet2 yes yes
/home@kube-default-pod /home kube-default-pod yes yes
/cm/shared-ubuntu2404-aarch64@kube-default-pod /cm/shared-ubuntu2404-aarch64 kube-default-pod yes yes
/cm/shared-ubuntu2404-x86_64@kube-default-pod /cm/shared-ubuntu2404-x86_64 kube-default-pod yes yes
/home@kube-default-service /home kube-default-service yes yes
/cm/shared-ubuntu2404-aarch64@kube-default-serv+ /cm/shared-ubuntu2404-aarch64 kube-default-service yes yes
/cm/shared-ubuntu2404-x86_64@kube-default-servi+ /cm/shared-ubuntu2404-x86_64 kube-default-service yes yes
/cm/node-installer-ubuntu2404-x86_64@ipminet3 /cm/node-installer-ubuntu2404-x86_64 ipminet3 no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet3 yes no
/cm/node-installer-ubuntu2404-aarch64@ipminet3 /cm/node-installer-ubuntu2404-aarch64 ipminet3 no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet3 yes no
/cm/node-installer-ubuntu2404-x86_64@ipminet2 /cm/node-installer-ubuntu2404-x86_64 ipminet2 no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet2 yes no
/cm/node-installer-ubuntu2404-aarch64@ipminet2 /cm/node-installer-ubuntu2404-aarch64 ipminet2 no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet2 yes no
/cm/node-installer-ubuntu2404-x86_64@ipminet1 /cm/node-installer-ubuntu2404-x86_64 ipminet1 no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet1 yes no
/cm/node-installer-ubuntu2404-aarch64@ipminet1 /cm/node-installer-ubuntu2404-aarch64 ipminet1 no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet1 yes no
/cm/node-installer-ubuntu2404-x86_64@ipminet0 /cm/node-installer-ubuntu2404-x86_64 ipminet0 no no
/cm/node-installer-ubuntu2404-x86_64/certificat+ /cm/node-installer-ubuntu2404-x86_64/certificat+ ipminet0 yes no
/cm/node-installer-ubuntu2404-aarch64@ipminet0 /cm/node-installer-ubuntu2404-aarch64 ipminet0 no no
/cm/node-installer-ubuntu2404-aarch64/certifica+ /cm/node-installer-ubuntu2404-aarch64/certifica+ ipminet0 yes no
Note
This example does not show the -ubuntu-aarch64 at the end of the /cm/shared/ and /cm/node-installer directory names.
The following steps will help to ensure successful provisioning of the control plane nodes and the GB200 nodes.
Enable Dependable PXE Booting#
Use the root (not cmsh) shell.
In /cm/local/apps/cmd/etc/cmd.conf, add the following AdvancedConfig parameter.
AdvancedConfig = { "DeviceResolveAnyMAC=1" } # modified value
Restart the CMDaemon to enable dependable PXE booting from bonded interfaces.
# systemctl restart cmd
The cmsh session will be disconnected because of restarting the CMDaemon. Type connect to reconnect after the CMDaemon has restarted. Or enter exit and then restart cmsh.
Disable Node BMC Setup in the Node-installer#
The global node-installer.conf file does not overwrite the individual architecture node-installer.conf file. Each arch node-installer.conf needs to be modified (/cm/node-installer itself should be a symlink to /cm/node-installer-<headnodedistro>-<headnodearch> itself in multi-arch/distro setups).
For the node-installer.conf file make these changes for each microarchitecture:
vi /cm/node-installer/scripts/node-installer.conf
vi /cm/node-installer-ubuntu2404-aarch64/scripts/node-installer.conf
vi /cm/node-installer-ubuntu2404-x86_64/scripts/node-installer.conf
Example: node-installer.conf settings
# Set this to false if, for some reason, the installer fails to setup
# the BMC hardware correctly. In that case do it manually, or use
# a custom finalize script.
setupBmc = false
# Set this to false if the Node Installer should just skip BMC network
# devices if they are configured but not detected. By default it will
# halt when this happens.
failOnMissingBmc = false
# Some BMC hardware have user ID's for which the user name can not be modified.
# If the user ID is set to such an ID the Node Installer would halt because it
# can not change the user name. When this setting is set to false (the default)
# the Node Installer will try to find an alternative user ID. When this setting
# is set to true, the Node Installer will only attempt to set the configured
# user ID and leave any other ID's alone.
strictBmcUserId = false