Deploying NVIDIA Converged Accelerator

1.0

This section assumes that you have installed the BlueField OS BFB on your NVIDIA® Converged Accelerator using any of the following guides:

NVIDIA® CUDA® (GPU driver) must be installed in order to use the GPU. For information on how to install CUDA on your Converged Accelerator, refer to NVIDIA CUDA Installation Guide for Linux.

After installing the BFB, you may now select the mode you want your NVIDIA Converged Accelerator to operate in.

  • Standard (default) – the NVIDIA® BlueField® DPU and the GPU operate separately (GPU is owned by the host)

  • BlueField-X – the GPU is exposed to the DPU and is no longer visible on the host (GPU is owned by the DPU)

Warning

It is is important to learn your DPU's device-id for performing some of the software installations or upgrades in this guide.

To determine the device ID of the DPUs on your setup, run:

Copy
Copied!
            

mst start mst status -v

Example output:

Copy
Copied!
            

MST modules: ------------ MST PCI module is not loaded MST PCI configuration module loaded PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField2(rev:1) /dev/mst/mt41686_pciconf0.1 3b:00.1 mlx5_1 net-ens1f1 0   BlueField2(rev:1) /dev/mst/mt41686_pciconf0 3b:00.0 mlx5_0 net-ens1f0 0   BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1   e2:00.1   mlx5_1          net-ens7f1np1             4   BlueField3(rev:1)       /dev/mst/mt41692_pciconf0     e2:00.0   mlx5_0          net-ens7f0np0             4

The device IDs for the BlueField-2 and BlueField-3 DPUs in this example are /dev/mst/mt41686_pciconf0 and /dev/mst/mt41692_pciconf0 respectively.

BlueField-X Mode

  1. Run the following command from the host:

    Copy
    Copied!
                

    mlxconfig -d /dev/mst/<device-name> s PCI_DOWNSTREAM_PORT_OWNER[4]=0xF

  2. Power cycle the host for the configuration to take effect.

Standard Mode

To return the DPU from BlueField-X mode to Standard mode:

  1. Run the following command from the host:

    Copy
    Copied!
                

    mlxconfig -d /dev/mst/<device-name> s PCI_DOWNSTREAM_PORT_OWNER[4]=0x0

  2. Power cycle the host for the configuration to take effect.

Use the following command from host or from DPU:

Copy
Copied!
            

$ sudo mlxconfig -d /dev/mst/<device-name> q PCI_DOWNSTREAM_PORT_OWNER[4]

Example of Standard mode output:

Copy
Copied!
            

Device #1: ----------   [...]   Configurations: Next Boot PCI_DOWNSTREAM_PORT_OWNER[4] DEVICE_DEFAULT(0)

Example of BlueField-X mode output:

Copy
Copied!
            

Device #1: ---------- [...]   Configurations: Next Boot PCI_DOWNSTREAM_PORT_OWNER[4]        EMBEDDED_CPU(15)

The following are example outputs for when the DPU is configured to BlueField-X mode.

The GPU is no longer visible from the host.

Copy
Copied!
            

root@host:~# lspci | grep -i nv None

The GPU is now visible from the DPU.

Copy
Copied!
            

ubuntu@dpu:~$ lspci | grep -i nv 06:00.0 3D controller: NVIDIA Corporation GA20B8 (rev a1)

Firmware upgrade of BMC and CEC components using BMC can be performed from a remote server using openbmctool.

The following table presents the commands available to perform the upgrade:

No.

Function

Command

Description

1

Trigger a BMC secure update

Copy
Copied!
            

python3 openbmctool.py -H <ip_address> \ -U <username> \ -P <password> firmware flash bmc \ -f <path>

Where:

  • -H – BMC IP

  • -U – username

  • -P – password

  • -f – path to signed BMC image tar file

Triggers BMC secure update

2

Track a BMC firmware update

Copy
Copied!
            

python3 openbmctool.py -H <ip_address> \ -U <username> \ -P <password> task status \ -i <task-id>

Where:

  • -H – BMC IP

  • -U – username

  • -P – password

  • -i – task ID of the triggered firmware update, will be displayed after triggering the firmware update

Tracks the BMC firmware update

3

Fetch running BMC firmware version

Copy
Copied!
            

python3 openbmctool.py -H <ip_address> \ -U <username> \ -P <password> firmware running_version

Where:

  • -H – BMC IP

  • -U – username

  • -P – password

Fetches the running firmware version from BMC

4

Reset/reboot a BMC

Copy
Copied!
            

python3 openbmctool.py -H <ip_address> \ -U <username> \ -P <password> bmc reset warm

Where:

  • -H – BMC IP

  • -U – username

  • -P – password

Reboots/resets the BMC

5

Trigger a CEC secure update

Copy
Copied!
            

python3 openbmctool.py -H <ip_address> \ -U <username> \ -P <password> apfirmware flash cec \ -f <path>

Where:

  • -H – BMC IP

  • -U – username

  • -P – password

  • -f – path to signed CEC OTA image file

Triggers CEC secure update

6

Track a CEC firmware update

Copy
Copied!
            

python3 openbmctool.py -H <ip_address> \ -U <username> \ -P <password> apfirmware status cec

Where:

  • -H – BMC IP

  • -U – username

  • -P – password

Tracks the CEC firmware update

7

Trigger CEC attestation/challenge-response

Copy
Copied!
            

python3 -H <bmc_ip> -U <username> \ -P <password> apfirmware getattestation cec \ --pubkeyfile <public key file> \ --randomnumbers <32-byte random number in hex format>

Where:

  • -H – BMC IP

  • -U – username

  • -P – password

  • --pubkeyfile – (optional) NVIDIA public key certificate provided for CEC validation

  • --randomnumbers – (optional) 32-byte random number in hex format (see format in the example below) to use in challenge response. The same set of numbers as provided in same order can be validated in the attestation file returned from CEC.

For example:

Copy
Copied!
            

python3 openbmctool.py -H <bmc_ip> \ -U <username> \ -P <password> apfirmware getattestation cec \ --pubkeyfile pubkey.pem \ --randomnumbers 0102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f20

In the above example the hex string represents the 32-byte decimal number "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32".

Triggers CEC attestation or challenge-response

BMC Update

The command in line #2 in the table above can be used to track the BMC firmware update. The following example shows the completion the first stage of BMC secure update.

Copy
Copied!
            

python3 openbmctool.py -H <ip_address> -U <username> -P <password> task status -i <task-id> Attempting login... Task Details: TaskState="Completed" TaskStatus="OK" TaskProgress="100" User root has been logged out

BMC reboot is required to complete the BMC secure update operation. BMC reboot can be triggered after the completion of the first stage of BMC secure update operation.

CEC Update

The command in line #6 in the table above can be used to track the CEC firmware update. The following example shows the completion of the first stage of CEC secure update:

Copy
Copied!
            

python3 openbmctool.py -H <bmc_ip> -U <username> -P <password> apfirmware status cec Firmware update status for the component cec as below. TaskState=Frimware update succeeded. TaskStatus=OK TaskProgress=100

Power-cycle/cold reset is required to complete the CEC secure update operation. Power-cycle/cold reset can be triggered after the completion of the first stage of CEC secure update operation.

Get GPU Firmware

Copy
Copied!
            

smbpbi: (See SMBPBI spec)   root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x00 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x00 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x39 0x32 0x2e 0x30 root@dpu:~# root@dpu:~# root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x01 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x01 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x30 0x2e 0x36 0x42 root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x02 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x02 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x2e 0x30 0x30 0x2e root@dpu:~# i2cset -y 3 0x4f 0x5c 0x05 0x08 0x03 0x80 s root@dpu:~# i2cget -y 3 0x4f 0x5c ip 5 5: 0x04 0x05 0x08 0x03 0x5f root@dpu:~# i2cget -y 3 0x4f 0x5d ip 5 5: 0x04 0x30 0x31 0x00 0x00 root@dpu:~#   39 32 2e 30 30 2e 36 42 2e 30 30 2e 30 31 00 00 → 92.00.6B.00.01


Updating GPU Firmware

Copy
Copied!
            

root@dpu:~# scp root@10.23.201.227:/<path-to-fw-bin>/1004_0230_891__92006B0001-dbg-ota.bin /tmp/gpu_images/ root@10.23.201.227's password: 1004_0230_891__92006B0001-dbg-ota.bin 100% 384KB 384.4KB/s 00:01   root@dpu:~# cat /tmp/gpu_images/progress.txt TaskState="Running" TaskStatus="OK" TaskProgress="50"   root@dpu:~# cat /tmp/gpu_images/progress.txt TaskState="Running" TaskStatus="OK" TaskProgress="50"   root@dpu:~# cat /tmp/gpu_images/progress.txt TaskState=Frimware update succeeded. TaskStatus=OK TaskProgress=100


© Copyright 2023, NVIDIA. Last updated on Sep 9, 2023.