Bare-metal Reprovisioning

NVIDIA BlueField BMC Software v24.04
Note

Relevant for NVIDIA® BlueField®-3 and later in DPU mode only (not supported in NIC mode).

The re-provisioning flow of the BlueField-3 bare metal network offers a solution for restoring the BlueField-3 system without relying on external measures. This method ensures the system can be brought back to its initial state, enabling the reloading of the operational image.

To facilitate this approach, the BMC is responsible for maintaining and managing a golden image for the UEFI and the NIC. This allows the UEFI to retrieve the operational image from the network via protocols such as HTTP or PXE.

The following block diagram describes in high level the system components and the data flow:

network-reprovisioning-version-1-modificationdate-1715303394707-api-v2.png

The entire flow of the network re-provisioning includes the following primary stages:

  1. Initial provisioning of the golden images to the BMC.

    Info

    This process usually takes place during system manufacturing.

  2. In-field update process enables the updating of golden images.

  3. OOB network configuration involves configuring the network settings.

  4. Recovering the system by reinstalling the golden images.

SKUDescription
LNV0000000066ThinkSystem NVIDIA BlueField-3 Ethernet/InfiniBand QSFP112 2P 200G PCIe Gen5 x16
MT_0000000884NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
MT_0000000884-04NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled ; Alternate for IL1
MT_0000000965NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Disabled
MT_0000001075NVIDIA BlueField-3 B3220SH E-Series FHHL Storage Controller; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 48GB on-board DDR; integrated BMC; Crypto Enabled; Secure Boot
MT_0000001083NVIDIA BlueField-3 B3220SH E-Series No heatsink FHHL Storage Controller; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 48GB on-board DDR; integrated BMC; Crypto Enabled
MT_0000001101NVIDIA BlueField-3 B3220SH E-Series No Heatsink FHHL Storage Controller; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 48GB on-board DDR; integrated BMC; Crypto Disabled
MT_0000001102NVIDIA BlueField-3 B3220SH E-Series FHHL Storage Controller; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 48GB on-board DDR; integrated BMC; Crypto Disabled
ORC0000000012NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled

To initiate the initial provisioning of the Golden images, the BMC must be connected to the OOB network. The user is required to copy the images from their local storage to the BMC by utilizing a standard scp command over the network. Once the images are successfully located within the BMC, the user must log into the BMC to initiate the provisioning process which involves transferring the golden images into the BMC's non-volatile storage. To accomplish this, a dedicated utility provided within the BMC can be used. Users must ensure that the BMC remains powered on and uninterrupted during this stage to avoid potential problems.

The current flow supports the portioning of the golden images golden_image_arm and golden_image_nic.

To copy the golden images from the local environment into the BMC, run:

  • For golden_image_nic:

    Copy
    Copied!
                

    #host> scp <nic-golden-image> root@<bmc-ip>:/tmp

  • For golden_image_arm:

    Copy
    Copied!
                

    #host> scp <arm-golden-image> root@<bmc-ip>:/tmp

After copying the golden images to the BMC's /tmp directory, the user must log into the BMC and execute the following commands to provision the golden images into the BMC's non-volatile storage:

  • For golden_image_nic:

    Copy
    Copied!
                

    #bmc> dpu_golden_image golden_image_nic -w /tmp/<nic-golden-image>

  • For golden_image_arm:

    Copy
    Copied!
                

    #bmc> dpu_golden_image golden_image_arm -w /tmp/<arm-golden-image>

Once the golden images have been provisioned to the BMC's non-volatile storage, the user must execute the following commands to verify the accuracy and correctness of the images:

  • For golden_image_nic:

    Copy
    Copied!
                

    #bmc> dpu_golden_image -v golden_image_nic #bmc> echo $? # Expected Output: 0

  • For golden_image_arm:

    Copy
    Copied!
                

    #bmc> dpu_golden_image -v golden_image_arm #bmc> echo $? # Expected Output: 0

To get the version of the golden images, run:

  • For golden_image_nic:

    Copy
    Copied!
                

    bmc> dpu_golden_image golden_image_nic -r /tmp/nic_image bmc> sha256sum /tmp/nic_image

  • For golden_image_arm:

    Copy
    Copied!
                

    bmc> dpu_golden_image golden_image_arm -r /tmp/arm_image bmc> sha256sum /tmp/arm_image

  1. Confirm the identity of the host and BMC as described here.
  2. To initiate an update, run the following command from the host:

    • For NIC golden image:

      Copy
      Copied!
                  

      curl -k -u root:'<password>' -H "Content-Type: application/json" -X POST -d '{"TransferProtocol":"SCP", "ImageURI":"<remote server ip>/<nic golden image path>","Targets":["redfish/v1/UpdateService/FirmwareInventory/golden_image_nic"], "Username":"<username>"}' https://<bmc_ip>/redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate

    • For Arm golden Image:

      Copy
      Copied!
                  

      curl -k -u root:'<password>' -H "Content-Type: application/json" -X POST -d '{"TransferProtocol":"SCP", "ImageURI":"<remote server ip>/<arm golden image path>","Targets":["redfish/v1/UpdateService/FirmwareInventory/golden_image_arm"], "Username":"<username>"}' https://<bmc_ip>/redfish/v1/UpdateService/Actions/UpdateService.SimpleUpdate

      Where:

      • ImageURI – the image URI format should be <remote_server_ip>/<golden image path>
      • username – username on the remote server
      • bmc_ip – BMC IP address
  3. After initiating the update, a new task is created for monitoring the progress:

    Copy
    Copied!
                

    "@Message.ExtendedInfo": [ { "@odata.type": "#Message.v1_1_1.Message", "Message": "The request completed successfully.", "MessageArgs": [], "MessageId": "Base.1.15.0.Success", "MessageSeverity": "OK", "Resolution": "None" } ], "@odata.id": "/redfish/v1/TaskService/Tasks/<Task id>", "@odata.type": "#Task.v1_4_3.Task", "Id": "<Task id>", "TaskState": "Running", "TaskStatus": "OK"

  4. To track the update progress:

    Copy
    Copied!
                

    curl -k -u root:'<password>' -X GET https://<bmc_ip>/redfish/v1/TaskService/Tasks/<Task id>

    The update progress has three states: 0,10,100. After a successful update, the following output is expected:

    Copy
    Copied!
                

    "PercentComplete": 100, "TaskState": "Completed", "TaskStatus": "OK"

    In case of a failure, it is recommended to reboot the BMC and retry the update.

    Info

    The golden image update may take between 1-3 minutes.

To enhance the system's security, a new mechanism has been introduced to control network connectivity over the OOB network. This new feature provides an IPMI command to disable any communication between the BlueField BMC, BlueField, and the OOB management network. A set of IPMI commands are introduced to selectively enable the network on each of the above interfaces. This permits the platform's RoT to have complete control over which network interfaces can be enabled and when.

Note

This IPMI can only be sent by the platform's ROT. OOB and BlueField are blocked.

By default, the OOB interface is enabled. However, for the host BMC to gain control over this interface, it must disable it during the initial boot. Once disabled, the interface remains in that state regardless of BMC reboots or system cold boots.

For more details, refer to "OOB Network 3-Port Switch Control".

It is possible to trigger the following IPMI command through UART or by SSHing to the OOB:

Copy
Copied!
            

#bmc> ipmitool raw 0x32 0x99 <golden_image_timeout> <timeout_from_network> <verbosity_level>

This command is designed to be executed exclusively from within the BMC since it has a potentially disruptive impact on the system. When the command is executed, it extracts the golden images from the BlueField BMC's non-volatile memory and initiates the recovery process. Once the golden images are pushed to the RShim, the RShim console output is redirected to the BMC console, enabling the user to easily monitor the progress.

Upon successful completion of this command, both the BlueField NIC and Arm execute the designated GA image fetched from a preconfigured server.

  • golden_image_timeout – timeout value, in minutes, for updating the golden images. For default value (15), users may input 0.
  • timeout_from_network – timeout value, in minutes, for booting the operational image from the network. For default value (60), users may input 0.
  • Verbosity level defines the type of messages that will appear during the reprovisioning process:

    • 0 – Quiet mode; only error messages appear on the screen
    • 1 – Info mode; only error messages and re-provisioning process messages appear on the screen
    • 2 – Full mode; all messages appear on the screen including BlueField RShim messages

      Info

      Reprovisioning messages have the following prefix: [<running date> GOLDEN-IMAGE-RECOVERY].

After BFB installation is complete, the BlueField BMC waits for a specific sequence of messages over the RShim log:

Copy
Copied!
            

NIC firmware update done Installation finished Linux up

  • NIC firmware update done – This message indicates that the firmware update for the NIC subsystem has been successfully completed

  • Installation finished – This message signals the completion of the installation process for the BFB from the network

  • Linux up – Upon receiving this message, the BlueField BMC acknowledges that the Arm OS has booted up and is ready

BlueField BMC expects these messages in the specified order.

Users can add custom entries to the RShim log from the BlueField Arm OS using the bfrshlog command. The syntax of the command is: bfrshlog <output>.

For example, to add the message "Linux up" to the RShim log, run:

Copy
Copied!
            

bfrshlog "Linux up"

All output from the BlueField Arm console is redirected to the BlueField BMC console for monitoring purposes.

The steps of the re-provisioning process are printed with [<running date> GOLDEN-IMAGE-RECOVERY] prefix and are outlined in the following:

Copy
Copied!
            

[<running date> GOLDEN-IMAGE-RECOVERY] Checking pcie slot is in reset [<running date> GOLDEN-IMAGE-RECOVERY] Read golden images from flash [<running date> GOLDEN-IMAGE-RECOVERY] Set FNP to 0 [<running date> GOLDEN-IMAGE-RECOVERY] Checking rshim interface after SOC hard reset [<running date> GOLDEN-IMAGE-RECOVERY] Starting ATF/UEFI golden image update [<running date> GOLDEN-IMAGE-RECOVERY] Finished updating ATF/UEFI golden image [<running date> GOLDEN-IMAGE-RECOVERY] Starting NIC FW golden image update [<running date> GOLDEN-IMAGE-RECOVERY] Finished updating NIC FW golden image [<running date> GOLDEN-IMAGE-RECOVERY] Stop Redfish server [<running date> GOLDEN-IMAGE-RECOVERY] Configure Recovery image to boot from network [<running date> GOLDEN-IMAGE-RECOVERY] set FNP to 1 [<running date> GOLDEN-IMAGE-RECOVERY] Booting BFB from network [<running date> GOLDEN-IMAGE-RECOVERY] Start Redfish server [<running date> GOLDEN-IMAGE-RECOVERY] Set boot option to default [<running date> GOLDEN-IMAGE-RECOVERY] Finished programming image from network. Start BlueField hard reset

A failed update prints the following:

Copy
Copied!
            

[<running date> GOLDEN-IMAGE-RECOVERY] ERROR: aborting process! PCIE is not in reset. [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: Reading golden_image_nic failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: Reading golden_image_arm failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: rshim has not started successfully [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: pushing ATF/UEFI golden image over rshim failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: programming of ATF/UEFI golden image failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: pushing NIC FW golden image over rshim failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: programming of NIC FW golden image failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: failed to configure image to boot from network [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: programming of image from network failed: NIC firmware update failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: programming of image from network failed: Installation failed [<running date> GOLDEN-IMAGE-RECOVERY] ERROR: programming of image from network failed: Failed to get Linux up

Due to line buffering in the BlueField Arm console, buffered output lines receive the same timestamp value in <running date> when they are redirected to the BlueField BMC console.

© Copyright 2024, NVIDIA. Last updated on May 22, 2024.