Deploying NVIDIA Converged Accelerator

Note

It is recommended to upgrade your BlueField product to the latest software and firmware versions available to benefit from new features and latest bug fixes.

This section assumes that you have installed the BlueField OS BFB on your NVIDIA® Converged Accelerator using any of the following guides:

NVIDIA® CUDA® (GPU driver) must be installed in order to use the GPU. For information on how to install CUDA on your Converged Accelerator, refer to NVIDIA CUDA Installation Guide for Linux.

After installing the BFB, you may now select the mode you want your NVIDIA Converged Accelerator to operate in.

  • Standard (default) – the NVIDIA® BlueField® DPU and the GPU operate separately (GPU is owned by the host)

  • BlueField-X – the GPU is exposed to the DPU and is no longer visible on the host (GPU is owned by the DPU)

Warning

It is is important to learn your DPU's device-id for performing some of the software installations or upgrades in this guide.

To determine the device ID of the DPUs on your setup, run:

Copy
Copied!
            

mst start mst status -v

Example output:

Copy
Copied!
            

MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField2(rev:1) /dev/mst/mt41686_pciconf0.1 3b:00.1 mlx5_1 net-ens1f1 0   BlueField2(rev:1) /dev/mst/mt41686_pciconf0 3b:00.0 mlx5_0 net-ens1f0 0   BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1   e2:00.1   mlx5_1          net-ens7f1np1             4   BlueField3(rev:1)       /dev/mst/mt41692_pciconf0     e2:00.0   mlx5_0          net-ens7f0np0             4

The device IDs for the BlueField-2 and BlueField-3 DPUs in this example are /dev/mst/mt41686_pciconf0 and /dev/mst/mt41692_pciconf0 respectively.

BlueField-X Mode

  1. Run the following command from the host:

    Copy
    Copied!
                

    mlxconfig -d /dev/mst/<device-name> s PCI_DOWNSTREAM_PORT_OWNER[4]=0xF

  2. Perform a graceful shutdown and power cycle the host for the configuration to take effect.

Standard Mode

To return the DPU from BlueField-X mode to Standard mode:

  1. Run the following command from the host:

    Copy
    Copied!
                

    mlxconfig -d /dev/mst/<device-name> s PCI_DOWNSTREAM_PORT_OWNER[4]=0x0

  2. Perform a graceful shutdown and power cycle the host for the configuration to take effect.

Use the following command from host or from DPU:

Copy
Copied!
            

$ sudo mlxconfig -d /dev/mst/<device-name> q PCI_DOWNSTREAM_PORT_OWNER[4]

Example of Standard mode output:

Copy
Copied!
            

Device #1: ----------   [...]   Configurations: Next Boot PCI_DOWNSTREAM_PORT_OWNER[4] DEVICE_DEFAULT(0)

Example of BlueField-X mode output:

Copy
Copied!
            

Device #1: ---------- [...]   Configurations: Next Boot PCI_DOWNSTREAM_PORT_OWNER[4]        EMBEDDED_CPU(15)

The following are example outputs for when the DPU is configured to BlueField-X mode.

The GPU is no longer visible from the host:

Copy
Copied!
            

root@host:~# lspci | grep -i nv None

The GPU is now visible from the DPU:

Copy
Copied!
            

ubuntu@dpu:~$ lspci | grep -i nv 06:00.0 3D controller: NVIDIA Corporation GA20B8 (rev a1)

Firmware upgrade of BMC and CEC components using BMC can be performed from a remote server using the Redfish interface. The following table presents commands available for performing the upgrade:

No.

Function

Command

Required for BMC/CEC Update

Description

1

Establish Redfish connection session

Copy
Copied!
            

export token=`curl -k -H "Content-Type: application/json" -X POST https://<bmc_ip>/login -d '{"username" : "root", "password" : "<password>"}' | grep token | awk '{print $2;}' | tr -d '"'`

Where:

  • bmc_ip – BMC IP address

  • password – Password of root account

BMC

CEC

Establish Redfish connection session

2

Trigger a secure firmware update

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -H "Content-Type: application/octet-stream" -X POST -T <package_path> https://<bmc_ip>/redfish/v1/UpdateService/update

Where:

  • bmc_ip – BMC IP address

  • token – session token received when establishing connection

  • package_path – firmware update package path

BMC

CEC

Triggers the secure update and starts tracking the secure update progress

3

Track secure firmware update progress

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/TaskService/Tasks

Find the current task ID in the response and use it for checking the progress:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/TaskService/Tasks/<task_id> | jq -r ' .PercentComplete'

Where:

  • bmc_ip – BMC IP address

  • token – session token received when establishing connection

  • task_id – Task ID

BMC

CEC

Tracks the firmware update progress

4

Reset/reboot a BMC

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -H "Content-Type: application/json" -X POST -d '{"ResetType": "GracefulRestart"}' https://<bmc_ip>/redfish/v1/Managers/Bluefield_BMC/Actions/Manager.Reset

Where:

  • bmc_ip – BMC IP address

  • token – session token received when establishing connection

BMC

Resets/reboots the BMC

5

Fetch running BMC firmware version

For BlueField-3:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware | jq -r ' .Version'

Where:

  • bmc_ip – BMC IP address

  • token – session token received when establishing connection

BMC

Fetches the running firmware version from BMC

For BlueField-2:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory

Fetch the current firmware ID and then perform:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory/<firmware_id>_BMC_Firmware | jq -r ' .Version'

Where:

  • bmc_ip – BMC IP address

  • token – session token received when establishing connection

6

Fetch running CEC firmware version

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory/Bluefield_FW_ERoT | jq -r ' .Version'

Where:

  • bmc_ip – BMC IP address

  • token – session token received when establishing connection

CEC

Fetches the running firmware version from CEC

BMC Update

Note

Firmware update takes about 12 minutes.

After initiating the BMC secure update with the command #2 to from the previous table, a response similar to the following is received:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -H "Content-Type: application/octet-stream" -X POST -T <package_path> https://<bmc_ip>/redfish/v1/UpdateService   { "@odata.id": "/redfish/v1/TaskService/Tasks/0", "@odata.type": "#Task.v1_4_3.Task", "Id": "0", "TaskState": "Running" }

Command #3 from the previous table can be used to track secure firmware update progress. For instance:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/TaskService/Tasks/0 | jq -r ' .PercentComplete'   % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2123 100 2123 0 0 38600 0 --:--:-- --:--:-- --:--:-- 37910 20

Command #3 is used to verify the task has completed because during the update procedure the reboot option is disabled. When "PercentComplete" reaches 100, command #4 is used to reboot the BMC. For example:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/TaskService/Tasks/0 | jq -r ' .PercentComplete' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3822 100 3822 0 0 81319 0 --:--:-- --:--:-- --:--:-- 81319 100   curl -k -H "X-Auth-Token: <token>" -H "Content-Type: application/octet-stream" -X POST -d '{"ResetType": "GracefulRestart"}' https://<bmc_ip>/redfish/v1/Managers/Bluefield_BMC/Actions/Manager.Reset { "@Message.ExtendedInfo": [ { "@odata.type": "#Message.v1_1_1.Message", "Message": "The request completed successfully.", "MessageArgs": [], "MessageId": "Base.1.13.0.Success", "MessageSeverity": "OK", "Resolution": "None" } ] }

Command #5 can be used to verify the current BMC firmware version after reboot:

  • For BlueField-3:

    Copy
    Copied!
                

    curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware | jq -r ' .Version'   % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 513 100 513 0 0 9679 0 --:--:-- --:--:-- --:--:-- 9679

  • For BlueField-2:

    1. Fetch the firmware ID from FirmwareInventory:

      Copy
      Copied!
                  

      curl -k -H "X-Auth-Token: <token>" -X GET https:/<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory/ { "@odata.id": "/redfish/v1/UpdateService/FirmwareInventory", "@odata.type": "#SoftwareInventoryCollection.SoftwareInventoryCollection", "Members": [ { "@odata.id": "/redfish/v1/UpdateService/FirmwareInventory/8c8549f3_BMC_Firmware" …

    2. Use command #5 with the fetched firmware ID in the previous step:

      Copy
      Copied!
                  

      curl -k -H "X-Auth-Token: <token>" -X GET https:/<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory/8c8549f3_BMC_Firmware | jq -r ' .Version'   % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 471 100 471 0 0 622 0 --:--:-- --:--:-- --:--:-- 621 bmc-23.04

CEC Update

Note

Firmware update takes about 20 seconds.

After initiating the BMC secure update with the command #2 to from the previous table, a response similar to the following is received:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -H "Content-Type: application/octet-stream" -X POST -T <package_path> https://<bmc_ip>/redfish/v1/UpdateService { "@odata.id": "/redfish/v1/TaskService/Tasks/0", "@odata.type": "#Task.v1_4_3.Task", "Id": "0", "TaskState": "Running" }

Command #3 can be used to track the progress of the CEC firmware update. For example:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/TaskService/Tasks/0 | jq -r ' .PercentComplete' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 2123 100 2123 0 0 38600 0 --:--:-- --:--:-- --:--:-- 37910 100

After the CEC secure update operation is complete, perform a graceful shutdown and a power cycle or cold reset of the BlueField-3 DPU must be manually triggered to apply the changes once the update is finished.

Command #6 can be used to verify the current CEC firmware version after reboot:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/UpdateService/FirmwareInventory/Bluefield_FW_ERoT | jq -r ' .Version'   % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 421 100 421 0 0 1172 0 --:--:-- --:--:-- --:--:-- 1172 19-4


CEC Background Update Status

Note

This section is relevant only for BlueField-3.

BMC and CEC have an active and inactive copy of the same firmware image on their respective firmware SPI flash. The firmware update updates the inactive copy, and on a successful boot from the newly updated and active image, the inactive image (e.g., the previous active image) is updated with the latest image.

Warning

Firmware update cannot be initiated if the background copy is in progress.

To check the status of the background update:

Copy
Copied!
            

curl -k -H "X-Auth-Token: <token>" -X GET https://<bmc_ip>/redfish/v1/Chassis/Bluefield_ERoT ... "Oem": { "Nvidia": { "@odata.type": "#NvidiaChassis.v1_0_0.NvidiaChassis", "AutomaticBackgroundCopyEnabled": true, "BackgroundCopyStatus": "Completed", "InbandUpdatePolicyEnabled": true } } …

Note

The background update initially indicates InProgress while the inactive copy of the image is being updated with the copy.


Possible Error Codes

Note

This section is relevant only for BlueField-3.

Fault

Diagnosis and Possible Solution

Connection to BMC breaks during firmware package transfer

  • Redfish task URI is not returned by the Redfish server

  • The Redfish server (if operational) is in idle state

  • After a reboot of BMC, or restart/recovery of the Redfish server, the Redfish server is in idle state

A new firmware update can be attempted by the Redfish client.

Connection to BMC breaks during firmware update

  • Redfish task URI previously returned by the Redfish server is no longer accessible

  • The Redfish server (if operational) is in one of the following states:

    • In idle state, if the firmware update has completed

    • In update state, if the firmware update is still ongoing

  • After a BMC reboot, or the restart/recovery of the Redfish server, the Redfish server is in idle state

A new firmware update can be attempted by the Redfish client.

Two firmware update requests are initiated

The Redfish server blocks the second firmware update request and returns the following:

  • HTTP code 400 "Bad Request"

  • Redfish message based on standard registry entry UpdateInProgress

Check the status of the ongoing firmware update by looking at the TaskCollection resource.

Redfish task hangs

  • Redfish task URI that previously returned by the Redfish server is no longer accessible

  • PLDM-based firmware update progresses

  • After a reboot of BMC, or restart/recovery of the Redfish server, the Redfish server us in idle state

A new firmware update can be attempted by the Redfish client.

BMC-EROT communication failure during image transfer

The Redfish task monitoring the firmware update indicates a failure:

  • TaskState is set to Exception

  • TaskStatus is set to Warning

  • Messages array in the task includes an entry based on the standard registry Update.1.0.0.TransferFailed indicating the components that failed during image transfer

The Redfish client may retry the firmware update.

Firmware update fails

The Redfish task monitoring the firmware update indicates a failure:

  • TaskState is set to Exception

  • TaskStatus is set to Warning

  • Messages array in the task includes an entry describing the error

The Redfish client may retry the firmware update.

ERoT failure (not responding)

The Redfish task monitoring the firmware update indicates a failure:

  • TaskState is set to Canceled

  • TaskStatus is set to Warning

  • Messages array in the task includes an entry describing the error

  • The Redfish client reports the error

The Redfish client may retry the firmware update.

Firmware image validation failure

The Redfish task monitoring the firmware update indicates a failure:

  • TaskState is set to Exception

  • TaskStatus is set to Warning

  • Messages array in the task includes an entry based on the standard registry Update.1.0.0.VerificationFailed to indicate the component for which verification failed

  • The Redfish client reports the error

The Redfish client might retry the firmware update.

Power loss before activation command is sent

  • The Redfish server is in idle state

A new firmware update can be attempted by the Redfish client.

Firmware activation failure

The Redfish task monitoring the firmware update indicates a failure:

  • TaskState is set to Exception

  • TaskStatus is set to Warning

  • Messages array in the task includes an entry based on the standard registry Update.1.0.ActivateFailed

The Redfish client may retry the firmware update.

Push to BMC firmware package greater than 200 MB

  • No Redfish task is created

  • Messages array in the task includes an entry based on the standard registry Base.1.8.1.ResourceExhaustion and a request to retry the operation is given.

Get GPU Firmware

Copy
Copied!
            

smbpbi: (See SMBPBI spec)   root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x00 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x00 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x39 0x32 0x2e 0x30 root@dpu:~# root@dpu:~# root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x01 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x01 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x30 0x2e 0x36 0x42 root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x02 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x02 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x2e 0x30 0x30 0x2e root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x03 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x03 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x30 0x31 0x00 0x00 root@dpu:~#   39 32 2e 30 30 2e 36 42 2e 30 30 2e 30 31 00 00 → 92.00.6B.00.01


Updating GPU Firmware

Copy
Copied!
            

root@dpu:~# scp root@10.23.201.227:/<path-to-fw-bin>/1004_0230_891__92006B0001-dbg-ota.bin /tmp/gpu_images/ root@10.23.201.227's password: 1004_0230_891__92006B0001-dbg-ota.bin 100% 384KB 384.4KB/s 00:01   root@dpu:~# cat /tmp/gpu_images/progress.txt TaskState="Running" TaskStatus="OK" TaskProgress="50"   root@dpu:~# cat /tmp/gpu_images/progress.txt TaskState="Running" TaskStatus="OK" TaskProgress="50"   root@dpu:~# cat /tmp/gpu_images/progress.txt TaskState=Frimware update succeeded. TaskStatus=OK TaskProgress=100


© Copyright 2023, NVIDIA. Last updated on Feb 8, 2024.