Deploying NVIDIA Converged Accelerator
This section assumes that you have installed the BlueField OS BFB on your NVIDIA® Converged Accelerator using any of the following guides:
NVIDIA® CUDA® (GPU driver) must be installed in order to use the GPU. For information on how to install CUDA on your Converged Accelerator, refer to NVIDIA CUDA Installation Guide for Linux.
After installing the BFB, you may now select the mode you want your NVIDIA Converged Accelerator to operate in.
Standard (default) – the NVIDIA® BlueField® DPU and the GPU operate separately (GPU is owned by the host)
BlueField-X – the PCIe switch is reconfigured so the GPU is dedicated to the DPU and no longer visible to the host system (GPU is owned by the DPU)
It is is important to learn your DPU's device-id for performing some of the software installations or upgrades in this guide.
To determine the device ID of the DPUs on your setup, run:
mst start
mst status -v
Example output:
MST modules:
------------
MST PCI module is not loaded
MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
BlueField2(rev:1) /dev/mst/mt41686_pciconf0.1 3b:00.1 mlx5_1 net-ens1f1 0
BlueField2(rev:1) /dev/mst/mt41686_pciconf0 3b:00.0 mlx5_0 net-ens1f0 0
BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 e2:00.1 mlx5_1 net-ens7f1np1 4
BlueField3(rev:1) /dev/mst/mt41692_pciconf0 e2:00.0 mlx5_0 net-ens7f0np0 4
The device IDs for the BlueField-2 and BlueField-3 DPUs in this example are /dev/mst/mt41686_pciconf0 and /dev/mst/mt41692_pciconf0 respectively.
BlueField-X Mode
Run the following command from the host:
mlxconfig -d /dev/mst/<device-name> s PCI_DOWNSTREAM_PORT_OWNER[4]=0xF
Power cycle the host for the configuration to take effect.
Standard Mode
To return the DPU from BlueField-X mode to Standard mode:
Run the following command from the host:
mlxconfig -d /dev/mst/<device-name> s PCI_DOWNSTREAM_PORT_OWNER[4]=0x0
Power cycle the host for the configuration to take effect.
Use the following command from host or from DPU:
$ sudo mlxconfig -d /dev/mst/<device-name> q PCI_DOWNSTREAM_PORT_OWNER[4]
Example of Standard mode output:
Device #1:
----------
Device type: BlueField2
Name: 699210020215_Ax
Description: PRIS BlueField-2
Device: /dev/mst/mt41686_pciconf0
Configurations: Next Boot
PCI_DOWNSTREAM_PORT_OWNER[4] DEVICE_DEFAULT(0)
Example of BlueField-X mode output:
Device #1:
----------
Device type: BlueField2
Name: 699210020215_Ax
Description: PRIS BlueField-2
Device: /dev/mst/mt41686_pciconf0
Configurations: Next Boot
PCI_DOWNSTREAM_PORT_OWNER[4] EMBEDDED_CPU(15)
The following are example outputs for when the DPU is configured to BlueField-X mode.
The GPU is no longer visible from the host.
root@host:~# lspci | grep -i nv
None
The GPU is now visible from the DPU.
ubuntu@dpu:~$ lspci | grep -i nv
06:00.0 3D controller: NVIDIA Corporation GA20B8 (rev a1)
Firmware upgrade of BMC and CEC components using BMC can be performed from a remote server using openbmctool.
The following table presents the commands available to perform the upgrade:
No. |
Function |
Command |
Description |
1 |
Trigger a BMC secure update |
Where:
|
Triggers BMC secure update |
2 |
Track a BMC firmware update |
Where:
|
Tracks the BMC firmware update |
3 |
Fetch running BMC firmware version |
Where:
|
Fetches the running firmware version from BMC |
4 |
Reset/reboot a BMC |
Where:
|
Reboots/resets the BMC |
5 |
Trigger a CEC secure update |
Where:
|
Triggers CEC secure update |
6 |
Track a CEC firmware update |
Where:
|
Tracks the CEC firmware update |
7 |
Trigger CEC attestation/challenge-response |
Where:
For example:
In the above example the hex string represents the 32-byte decimal number "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32". |
Triggers CEC attestation or challenge-response |
BMC Update
The command in line #2 in the table above can be used to track the BMC firmware update. The following example shows the completion the first stage of BMC secure update.
python3 openbmctool.py -H <ip_address> -U <username> -P <password> task status -i <task-id>
Attempting login...
Task Details:
TaskState="Completed"
TaskStatus="OK"
TaskProgress="100"
User root has been logged out
BMC reboot is required to complete the BMC secure update operation. BMC reboot can be triggered after the completion of the first stage of BMC secure update operation.
CEC Update
The command in line #6 in the table above can be used to track the CEC firmware update. The following example shows the completion of the first stage of CEC secure update:
python3 openbmctool.py -H <bmc_ip> -U <username> -P <password> apfirmware status cec
Firmware update status for the component cec as below.
TaskState=Frimware update succeeded.
TaskStatus=OK
TaskProgress=100
Power-cycle/cold reset is required to complete the CEC secure update operation. Power-cycle/cold reset can be triggered after the completion of the first stage of CEC secure update operation.
Get GPU Firmware
smbpbi: (See SMBPBI spec)
root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x00 0x80 s
root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5
5: 0x04 0x05 0x08 0x00 0x5f
root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5
5: 0x04 0x39 0x32 0x2e 0x30
root@dpu:~#
root@dpu:~#
root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x01 0x80 s
root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5
5: 0x04 0x05 0x08 0x01 0x5f
root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5
5: 0x04 0x30 0x2e 0x36 0x42
root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x02 0x80 s
root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5
5: 0x04 0x05 0x08 0x02 0x5f
root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5
5: 0x04 0x2e 0x30 0x30 0x2e
root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x03 0x80 s
root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5
5: 0x04 0x05 0x08 0x03 0x5f
root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5
5: 0x04 0x30 0x31 0x00 0x00
root@dpu:~#
39 32 2e 30 30 2e 36 42 2e 30 30 2e 30 31 00 00 → 92.00.6B.00.01
Updating GPU Firmware
root@dpu:~# scp root@10.23.201.227:/<path-to-fw-bin>/1004_0230_891__92006B0001-dbg-ota.bin /tmp/gpu_images/
root@10.23.201.227's password:
1004_0230_891__92006B0001-dbg-ota.bin 100% 384KB 384.4KB/s 00:01
root@dpu:~# cat /tmp/gpu_images/progress.txt
TaskState="Running"
TaskStatus="OK"
TaskProgress="50"
root@dpu:~# cat /tmp/gpu_images/progress.txt
TaskState="Running"
TaskStatus="OK"
TaskProgress="50"
root@dpu:~# cat /tmp/gpu_images/progress.txt
TaskState=Frimware update succeeded.
TaskStatus=OK
TaskProgress=100