NVIDIA BlueField Reset and Reboot Procedures

NVIDIA BlueField DPU BSP v4.7.0

This section describes the necessary operations to load new NIC firmware, following NVIDIA® BlueField® NIC firmware update. This procedure deprecates the need for full server power cycle.

The following steps are executed in the BlueField OS:

  1. Issue a query command to ascertain whether BlueField system reboot is supported by your environment:

    Copy
    Copied!
                

    mlxfwreset -d 03:00.0

    If the output includes the following lines, proceed to step 2:

    Copy
    Copied!
                

    3: Driver restart and PCI reset         -Supported (default) ... 1: Driver is the owner                  -Supported (default)

    Note

    If it says Not Supported instead, then proceed to the instructions under section "BlueField System-level Reset".

  2. Issue a BlueField system reboot:

    Copy
    Copied!
                

    mlxfwreset -d 03:00.0 -y -l 3 --sync 1 r

This section describes the necessary system-level reset following firmware configuration changes.

The two methods for performing BlueField system-level reset are described in the following subsection. Each method is designed to support different host platforms, in which host OS/CPUs and PCIe slots may have uniform or separate power control.

In each approach, the procedure can be performed through various methods, according to resource availability and support in the user's environment.

System-level Reset for BlueField in DPU Mode with Minimal Host OS Downtime

The following is the high-level flow of the procedure:

  1. Graceful shutdown of BlueField Arm cores.

  2. Query BlueField state to affirm shutdown reached.

    Info

    In systems with multiple BlueField networking platforms, repeat steps 1 and 2 for all devices before proceeding.

  3. Warm reboot the server.

Step by step process:

  1. Graceful shutdown of BlueField Arm cores.

    Info

    This operation is expected to finish within 15 seconds.

    Options:

    • From the BlueField OS:

      Copy
      Copied!
                  

      shutdown -h now

      Or:

      Copy
      Copied!
                  

      mlxfwreset -d /dev/mst/mt*pciconf0 -l 1 -t 4 --sync 0 r

    • From the host OS:

      Info

      Not relevant when the BlueField is operating in Zero-Trust Mode.

      Copy
      Copied!
                  

      mlxfwreset -d <mst-device> -l 1 -t 4 r

    • Using the BlueField BMC:

      Copy
      Copied!
                  

      ipmitool -C 17 -I lanplus -H <bmc_ip> -U root -P <password> power soft

      Or using Redfish (BlueField-3 and above):

      Copy
      Copied!
                  

      curl -k -u root:<password> -H "Content-Type: application/json" -X POST https://<bmc_ip>/redfish/v1/Systems/Bluefield/Actions/ComputerSystem.Reset -d '{"ResetType": "GracefulShutdown"}'

  2. Query BlueField state. Options:

    • From the host OS:

      Info

      Not relevant when the BlueField is operating in Zero-Trust Mode.

      Copy
      Copied!
                  

      echo DISPLAY_LEVEL 2 > /dev/rshim0/misc cat /dev/rshim0/misc

      Expected output:

      Copy
      Copied!
                  

      INFO[BL31]: System Off

    • Utilizing the BlueField BMC:

      Copy
      Copied!
                  

      ipmitool -C 17 -I lanplus -H <bmc_ip> -U root -P <password> raw 0x32 0xA3

      Expected output: 06.

  3. Warm reboot the server:

    • From the host OS:

      Copy
      Copied!
                  

      mlxfwreset -d <mst-device> -l 4 r

      Note

      If multiple DPUs are present in the host, this command must run only once. In this case, the MST device can be of any of the DPUs for which the reset is necessary and participated in step 1.

      Or:

      Copy
      Copied!
                  

      reboot

      Note

      For external hosts which do not toggle PERST# in their standard reboot command, use the mlxfwreset option.

System-level Reset for BlueField in DPU Mode where Host is Down Throughout the Process

This procedure is only relevant to server platforms that have separate power control for PCIe slot and CPUs in which the BlueField is provided power while host OS/CPUs may be in shutdown or similar standby state.

The following is the high-level flow of the procedure:

  1. Graceful shutdown of host OS or similar CPU standby.

  2. Graceful shutdown of BlueField Arm cores.

  3. Query BlueField state to affirm shutdown reached.

  4. Full BlueField Reset

  5. Query BlueField state to affirm operational state reached

    Info

    In systems with multiple BlueField networking platforms, repeat steps 1 through 5 for all devices before proceeding.

  6. Power on the server.

Step by step process:

  1. Graceful shutdown of host OS by any means preferable.

  2. Graceful shutdown of BlueField Arm cores.

    Info

    This step normally takes up to 15 seconds to complete.

    • From the BlueField OS:

      Copy
      Copied!
                  

      shutdown -h now

    • Utilizing the BlueField BMC:

      • Using IPMI:

        Copy
        Copied!
                    

        ipmitool -C 17 -I lanplus -H <bmc_ip> -U root -P <password> power soft

      • Using Redfish (for BlueField-3 and above):

        Copy
        Copied!
                    

        curl -k -u root:<password> -H "Content-Type: application/json" -X POST https://<bmc_ip>/redfish/v1/Systems/Bluefield/Actions/ComputerSystem.Reset -d '{"ResetType": "GracefulShutdown"}'

  3. Query the BlueField's state utilizing the BlueField BMC:

    Copy
    Copied!
                

    ipmitool -C 17 -I lanplus -H <bmc_ip> -U root -P <password> raw 0x32 0xA3

    Expected output: 06.

  4. Perform BlueField hard reset utilizing the BlueField BMC:

    Info

    This step takes up to 2 minutes to complete .

    • Using IPMI:

      Copy
      Copied!
                  

      ipmitool -C 17 -I lanplus -H <bmc_ip> -U root -P <password> power cycle

    • Using Redfish (for BlueField-3 and above):

      Copy
      Copied!
                  

      curl -k -u root:<password> -H "Content-Type: application/json" -X POST https://<bmc_ip>/redfish/v1/Systems/Bluefield/Actions/ComputerSystem.Reset -d '{"ResetType" : "PowerCycle"}'

  5. Query BlueField operational state u tilizing the BlueField BMC :

    Info

    At this point, the BlueField is expected to b e operational .

    Copy
    Copied!
                

    ipmitool -C 17 -I lanplus -H <bmc_ip> -U root -P <password> raw 0x32 0xA3

    Expected output: 05.

  6. Power on/boot up the host OS.

System-level Reset for BlueField in NIC Mode

Perform warm reboot of the host OS:

Copy
Copied!
            

mlxfwreset -d <mst-device> -l 4 r

Or:

Copy
Copied!
            

reboot

Note

For external hosts which do not toggle PERST# in their standard reboot command, use the mlxfwreset option.


© Copyright 2024, NVIDIA. Last updated on May 9, 2024.