Adding New Machines to an Existing Site
This guide is intended to cover some of the basic things you should check to get a machine into a basic state where it can be discovered by NICo auto-ingestion.
Some of the configuration items that should be considered which could potentially cause issues:
- Host BMC Password Requirements
- Updating the Host BMC and UEFI Firmware (Not covered in this document at this time)
- DPU BMC Password Requirements
- Updating DPU BMC Firmware
- DPU ARM OS Check Secure Boot status
Host BMC Password Requirements
Note: New servers should be using the default username for the server type e.g. USERID for Lenovo, admin for NVIDIA/Vikings, root for Dell
You should check both the expected machines DB and the site vault pod data store for any existing data. If entries exist in both expected machines and vault, you should consider the password stored in vault as the password that should be used.
Check Host BMC exists in Expected Machines DB
If there is an existing data in expected machines for the machine, you can either update the password in expected machines or change the password on the Host BMC to match.
-
Use
carbide-admin-clito check if there is an existing entry for the host BMC: -
If an entry exists for the machine, display the details using
carbide-admin-cli: -
To update an existing expected machines data:
Note: If you only need to update the BMC password, you just need to supply the BMC MAC Address and BMC Password
-
To add a new machine to the expected machines DB:
Checking site vault data
To check if the Host BMC has currently any passwords in vault on a site:
-
Connect to the Kubernetes environment for the site you are working on
-
Retrieve the decoded vault secret for the site:
-
Connect to the vault pod for the site and paste in the decoded vault secret at the Token prompt:
-
List the secrets in vault:
-
Look for the site BMC:
-
Get the current credentials set for the host bmc if they exist:
Ensure these credentials match the credentials currently set on the host BMC. It is easier to just update the Host BMC to match vault rather than attempting to update the secret in vault.
DPU BMC Password Requirements
For a new/undiscovered DPU BMC, ensure that it is set to the default BMC username/password
Resetting DPU BMC password to default - From DPU BMC
To reset to factory defaults from the DPU BMC:
-
Log into the DPU BMC.
-
Run the following command to reset to factory defaults:
-
Reboot the DPU BMC:
Resetting DPU BMC password to default - From DPU ARM OS
If you don’t know the BMC password, but have access to the DPU ARM OS, you can reset to defaults as follows:
-
Log into the DPU ARM OS
-
Switch to root:
-
Restore DPU BMC defaults:
-
Restart DPU BMC:
Updating DPU firmware
Determine the DPU model
Log on to the DPU ARM OS and attempt to run the following command:
For Bluefield 2 DPUs you should expect the output similar to the following:
For Bluefield 3 DPUs you should expect the output similar to the following:
Checking Bluefield Firmware Versions
To check the current Bluefield firmware versions installed on a DPU:
-
Log into the staging server for the site
-
Set up IP, password and token environment variables:
-
Check the current DPU BMC Firmware Versions:
Bluefield 2 DPUs:
Bluefield 3 DPUs:
Updating the Bluefield Firmware Versions
Note: If discovery is failing due to the firmware revision being too low, confirm with the NICo dev team what version you should update to before proceeding
DPU Firmware versions can be downloaded from the following locations:
For the examples below, we are installing FW version 24.01-5, but confirm this with the development team for your specific install before proceeding
-
Download the relevant packages for your DPU type:
Bluefield 2:
Bluefield 3:
-
Copy the firmware package to the staging server for the site
-
Set up IP, password and token environment variables:
-
Initiate the DPU BMC FW Upgrade:
Bluefield 2:
Bluefield 3:
-
Monitor the firmware update progress:
-
Once the progress has reached 100% complete, initiate a reboot of the BMC:
-
Once the DPU BMC has rebooted, retrieve a new BMC Token and check the installed firmware version:
Bluefield 2:
Bluefield 3:
DPU ARM OS: Checking Secure Boot Status
To successfully boot from the NICo BFB image, the DPU ARM OS needs to have Secure Boot disabled and configured for HTTP PXE boot.
Check current secure boot settings
-
Log in to the staging server for the site
-
Set up the DPU IP, password environment variables:
-
Check the current Secure Boot settings:
Note: If you do not see the
SecureBootCurrentBootoption listed, you should install DOCA version 2.5.0If you see the following output, secure boot is enabled and it needs to be disabled:
If you see
"SecureBootCurrentBoot": "Disabled",no action is required. You should attempt to boot the DPU ARM OS over the network:
Disable Secure Boot
To disable Secure Boot if it is enabled:
-
Run the command to disable Secure Boot:
-
Restart the DPU ARM OS:
-
Wait for the DPU ARM OS to boot and check if Secure Boot is enabled now:
Note: You may need to run this step several times to disable secure boot. It may take up to 3 cycles of this for the setting to stick
If the “SecureBootCurrentBoot” setting is not shown, attempt to install DOCA 2.5.0:
-
Download the BFB image on the staging server:
-
Install the BFB image to the DPU ARM OS via the DPU BMC from the server with the BFB image:
-
Log on to the DPU BMC and reboot the DPU ARM OS:
-
After the DPU ARM OS boots, log into the DPU ARM OS using the default password
-
Switch to root and set the default username password back to the default
-
Ensure that the DOCA firmware is up to date:
-
Check that the DPU ARM OS is configured for HTTPs boot. Log into the DPU ARM OS and switch to root,
-
List the current boot order:
-
If the boot order is set to something similar the following, no action is needed and you should reboot the DPU ARM OS:
-
To set the correct boot order, create the /etc/bf.cfg file with the following contents:
-
Run the bfcfg command to update the boot order:
-
Verify that the boot order is now set to NET-OOB-IPV4-HTTP as default:
-
Reboot the DPU ARM OS from the RSHIM console and monitor the reboot/provisioning process
Note: If you see an error similar to the following during PXE boot, verify that Secure Boot is disabled correctly: