NVIDIA BlueField BSP v4.9.0

RAS

Note

RAS is supported on NVIDIA® BlueField®-3 and higher only.

Reliability, availability and serviceability (RAS) reduces and avoids unplanned outages because:

  • Errors can be detected and corrected by the hardware before they cause system failures

  • Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.

  • Errors can be predicted ahead of time to allow replacement of defective hardware

Note

SDEI NMI watchdog is supported on Linux OS openEuler and Anolis only.

Software-delegated exception interface (SDEI) non-maskable interrupt (NMI) watchdog offers a way to detect Linux kernel hard lockup (e.g., hang in interrupt handler). It uses high-priority per-core secure timer as a keepalive mechanism and prints the call stack information when Linux kernel is stuck.

Info

The SDEI NMI watchdog is an implementation for the "Software watchdog timer" as described in section 2.1.1 "Typical use cases" of the Software Delegated Exception Interface (SDEI) Platform Design Document.

The b oot error record table (BERT) is used to report errors that occurred and were unhandled in a previous boot. On the subsequent boot, the OS reports the error using the common platform error record (CPER) format.

Info

For more information on BERT, refer to section 18.3 of the ACPI specification.

Info

For more information on CPER, refer to appendix N of the UEFI specification.

In BlueField-3, BERT reports all RShim log messages. For example, ASSERTs which occurred in the UEFI generate a Linux error report in the subsequent boot.

Example of a boot where ASSERTs occurred:

Copy
Copied!
            

INFO[BL2]: start INFO[BL2]: DDR POST passed INFO[BL2]: UEFI loaded INFO[BL31]: start INFO[BL31]: lifecycle GA Secured INFO[BL31]: runtime INFO[UEFI]: UPVS valid WARN[UEFI]: UPVS full WARN[UEFI]: UPVS reclaim WARN[UEFI]: Var reclaim WARN[UEFI]: Var reclaim done INFO[UEFI]: eMMC init INFO[UEFI]: eMMC probed INFO[UEFI]: PMI: updates started INFO[UEFI]: PMI: total updates: 1 INFO[UEFI]: PMI: updates completed, status 0 ASSERT[UEFI]: MdePkg/Library/BaseLib/String.c: 173 ASSERT[UEFI]: PC=0xF56706B8 ASSERT[UEFI]: PC=0xF5668A2C ASSERT[UEFI]: PC=0xF567E414 ASSERT[UEFI]: PC=0xFDE33DA4 ASSERT[UEFI]: PC=0x88009338 ASSERT[UEFI]: PC=0x880094D4 INFO[UEFI]: PCIe enum start INFO[UEFI]: PCIe enum end INFO[UEFI]: exit Boot Service INFO[MISC]: Found bf.cfg INFO[MISC]: Ubuntu installation started INFO[MISC]: Installing OS image INFO[MISC]: Changing the default password for user ubuntu INFO[MISC]: Running bfb_modify_os from bf.cfg INFO[MISC]: ===================== bfb_modify_os ===================== INFO[MISC]: Installation finished

Example print out of the next boot:

Copy
Copied!
            

[ 0.736968] [Hardware Error]: event severity: fatal [ 0.736971] [Hardware Error]: Error 0, type: fatal [ 0.736976] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1 [ 0.736978] [Hardware Error]: section length: 0x33 [ 0.736983] [Hardware Error]: 00000000: 53534120 5b545245 49464555 4d203a5d ASSERT[UEFI]: M [ 0.736986] [Hardware Error]: 00000010: 6b506564 694c2f67 72617262 61422f79 dePkg/Library/Ba [ 0.736989] [Hardware Error]: 00000020: 694c6573 74532f62 676e6972 203a632e seLib/String.c: [ 0.736991] [Hardware Error]: 00000030: 31 37 33 173 [ 0.736993] [Hardware Error]: Error 1, type: fatal [ 0.736995] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1 [ 0.736996] [Hardware Error]: section length: 0x1c [ 0.737000] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P [ 0.737002] [Hardware Error]: 00000010: 78303d43 37363546 38423630 C=0xF56706B8 [ 0.737004] [Hardware Error]: Error 2, type: fatal [ 0.737005] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1 [ 0.737007] [Hardware Error]: section length: 0x1c [ 0.737009] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P [ 0.737012] [Hardware Error]: 00000010: 78303d43 36363546 43324138 C=0xF5668A2C [ 0.737013] [Hardware Error]: Error 3, type: fatal [ 0.737015] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1 [ 0.737016] [Hardware Error]: section length: 0x1c [ 0.737019] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P [ 0.737021] [Hardware Error]: 00000010: 78303d43 37363546 34313445 C=0xF567E414 [ 0.737022] [Hardware Error]: Error 4, type: fatal [ 0.737024] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1 [ 0.737025] [Hardware Error]: section length: 0x1c [ 0.737028] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P [ 0.737030] [Hardware Error]: 00000010: 78303d43 33454446 34414433 C=0xFDE33DA4 [ 0.737031] [Hardware Error]: Error 5, type: fatal [ 0.737033] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1 [ 0.737035] [Hardware Error]: section length: 0x1c [ 0.737037] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P [ 0.737040] [Hardware Error]: 00000010: 78303d43 30303838 38333339 C=0x88009338 [ 0.737041] [Hardware Error]: Error 6, type: fatal [ 0.737043] [Hardware Error]: section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1 [ 0.737044] [Hardware Error]: section length: 0x1c [ 0.737060] [Hardware Error]: 00000000: 53534120 5b545245 49464555 50203a5d ASSERT[UEFI]: P [ 0.737062] [Hardware Error]: 00000010: 78303d43 30303838 34443439 C=0x880094D4

On BlueField-3, the h ardware error source t able (HEST) is used for reporting errors to the OS. The error injection (EINJ) table is used for injecting errors.

Info

For more information on HEST, refer to section 18.3.2 of the ACPI specification.

By default, Disable HEST is FALSE, delegating error handling to the OS. Setting Disable HEST to TRUE using Redfish or the UEFI menu, shifts error handling to the BlueField Arm firmware which would limit the chance for the error to propagate through the system.

Disabling HEST via UEFI Menu

  1. Open these SSH terminal windows to the host:

    1. From Terminal 1, open the OOB console via the RShim console device file. Example:

      Copy
      Copied!
                  

      $ sudo minicom --color=on -D /dev/rshim0/console 115200

    2. From Terminal 2, reset Arm from the host:

      Copy
      Copied!
                  

      $ sudo sh -c "echo 'SW_RESET 1' > /dev/rshim0/misc"

  2. To enter the UEFI menu, press ESC in the OOB terminal when the following screen appears:

    arm-reset-version-1-modificationdate-1734031476300-api-v2.png

  3. Navigate the UEFI menu to Device Manager>System Configuration which contains the Disable HEST setting.

  4. Navigate down to Disable HEST setting and press Space to toggle the Disable HEST value and enable it.

    enable-disable-hest-version-1-modificationdate-1734031475827-api-v2.png

  5. Press F10 then Y to save the new setting.

    save-settings-version-1-modificationdate-1734031475400-api-v2.png

  6. UEFI reboots to apply the new setting.

Disabling HEST via Redfish

DisableHEST can be configured via the Redfish REST interface with the REST server running on the BMC. The process involves using curl commands from the host to send REST commands to the BMC's Redfish server.

Run the following steps on the host terminal:

  1. Install the prerequisite software:

    Copy
    Copied!
                

    $ sudo yum install curl # or sudo apt install curl $ sudo yum install jq # or sudo apt install jq

  2. Check the current DisableHEST value:

    Copy
    Copied!
                

    $ curl -sku root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios | jq '.Attributes|{DisableHEST}'

    Example output:

    Copy
    Copied!
                

    { "DisableHEST": false }

  3. Set DisableHEST to true:

    Copy
    Copied!
                

    $ curl -sku root:'<password>' -H 'content-type: application/json' -d '{ "Attributes": { "DisableHEST": true } }' -X PATCH https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios/Settings | jq '."@Message.ExtendedInfo"[0]|{Message}'

  4. Restart BlueField-3 to apply the pending configuration changes:

    Copy
    Copied!
                

    $ sudo sh -c "echo 'SW_RESET 1' > /dev/rshim0/misc"

Memory Errors

BlueField-3 has RAS support for memory errors which can be one of these types:

  • Single-bit errors – Correctable errors (CE) which are non-fatal and are corrected by hardware

  • Double-bit errors which could be:

    • Non-fatal errors – Uncorrectable recoverable errors which do not interrupt services

    • Fatal errors – Uncorrectable errors which cause the system to abort or reboot

Processor Errors

BlueField-3 supports RAS for processor errors which can be either:

  • Correctable errors

  • Uncorrectable non fatal errors

  • Uncorrectable fatal errors

Error Injection

BlueField-3 error handling may be tested by injecting errors using the following methods:

  • Error injection (EINJ) ACPI table (for memory and processor errors)

  • ras-tools (for memory and processor errors)

    Note

    ras-tools has been verified to run with Anolis only. Use it with other OSs at your own discretion.

EINJ Table

EINJ table is the standard way of injecting errors on BlueField-3 Linux distributions.

Info

For more information on EINJ, refer to section 18.6.1 of the ACPI specification.

Error Injection Commands

In Linux, the einj driver creates a sysfs for injecting errors. To load einj on BlueField-3:

Copy
Copied!
            

# modprobe einj

The EINJ sysfs is:

Copy
Copied!
            

# cd /sys/kernel/debug/apei/einj

The directory contains the following files:

Copy
Copied!
            

[root@localhost ~]# cd /sys/kernel/debug/apei/einj/ [root@localhost einj]# ls available_error_type error_type notrigger param2 param4 error_inject flags param1 param3

Where:

Parameter

Definition

available_error_type

Contains the available error types:

  • 0x00000001 – Processor correctable error

  • 0x00000002 – Processor uncorrectable non-fatal error

  • 0x00000004 – Processor uncorrectable fatal error

  • 0x00000008 – Memory correctable error

  • 0x00000010 – Memory uncorrectable non-fatal error

  • 0x00000020 – Memory uncorrectable fatal error

error_type

The type of error to inject (according to available_error_type)

param1

Physical address to inject an error to

Info

Relevant to memory error injection only.

param2

Physical address mask of param1

Info

Relevant to memory error injection only.

param3

The number of the core to inject an error to

Info

Relevant to processor error injection only.

error_inject

Set to 1 to inject error

notrigger

Is set to 0 by default. Changing this to 1 stops the error from triggering automatically. The user would need to trigger the error using a simple access to the CPU, memory location, or a device which is the target of the error injection.


Injecting Memory Errors

To inject memory errors:

  1. Specify the physical address you want to inject the error to in param1 and its address mask in param2:

    Copy
    Copied!
                

    # echo 0x400000000 > param1 # echo 0xfffffffffffff000 > param2

    Info

    This depends on the size of memory available.

  2. Specify the error type to inject. To inject a memory correctable error:

    Copy
    Copied!
                

    # echo 0x8 > error_type # echo 1 > error_inject

Expected Memory Error Report Format

Correctable Memory Error

When a correctable memory error is injected, the BlueField console displays output similar to the following:

Copy
Copied!
            

# dmesg ERROR: BL31: MSS1 C0 Single bit ECC error detected. IRQ 91 [ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [ 234.638588] {1}[Hardware Error]: event severity: corrected [ 234.638590] {1}[Hardware Error]: Error 0, type: corrected [ 234.638591] {1}[Hardware Error]: section_type: memory error [ 234.638592] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 234.638594] {1}[Hardware Error]: physical_address: 0x0000000000000080 [ 234.638595] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 234.638598] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480 [ 234.638599] {1}[Hardware Error]: error_type: 2, single-bit ECC [ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)

The RShim log would show a concise message such as the following:

Copy
Copied!
            

ERR[BL31]: [75512]mss1: C0 single-bit ecc, IRQ[91]


Uncorrectable Non-fatal Memory Error

When an uncorrectable, non-fatal memory error is injected, the BlueField console displays output similar to the following:

Copy
Copied!
            

ERROR: BL31: MSS1 C0 Double bit ECC error detected. IR [ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 219.795253] {1}[Hardware Error]: event severity: recoverable [ 219.800897] {1}[Hardware Error]: Error 0, type: recoverable [ 219.806540] {1}[Hardware Error]: section_type: memory error [ 219.812269] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 219.818518] {1}[Hardware Error]: physical_address: 0x0000000000000080 [ 219.825114] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 219.832146] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0 [ 219.841952] {1}[Hardware Error]: error_type: 3, multi-bit ECC [ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 page:0x0 offset:0x80 grain:-281474976710655 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 status(0x0000000000010400): Storage error in DRAM memory) [ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80

If the non-fatal error targets userspace memory, it is recovered by the OS. If the non-fatal error targets kernel memory, it is marked as fatal by the OS.

Uncorrectable Fatal Memory Error

When an uncorrectable, fatal memory error is injected, the BlueField console displays output similar to the following:

Copy
Copied!
            

root@localhost:~[ 79.351190] EINJ: Error INJection is initialized. ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ 93 [ 79.636470] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 79.636476] {1}[Hardware Error]: event severity: fatal [ 79.636478] {1}[Hardware Error]: Error 0, type: fatal [ 79.636481] {1}[Hardware Error]: section_type: memory error [ 79.636482] {1}[Hardware Error]: error_status: 0x0000000000010400 [ 79.636483] {1}[Hardware Error]: physical_address: 0x000000016cb48000 [ 79.636484] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff [ 79.636489] {1}[Hardware Error]: module: 0 rank: 0 bank: 1539 bank_group: 6 row: 15149 column: 2 bit_position: 0 [ 79.636491] {1}[Hardware Error]: error_type: 14, scrub uncorrected error [ 79.636495] Kernel panic - not syncing: Fatal hardware error! [ 79.636499] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.15.0-1053.55.24.g9cc17fe-bluefield #g9cc17fe [ 79.636502] Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.9.1.13411 Nov 18 2024 [ 79.636505] Call trace: [ 79.636505] dump_backtrace+0x0/0x200 [ 79.636515] show_stack+0x20/0x2c [ 79.636517] dump_stack_lvl+0x68/0x84 [ 79.636524] dump_stack+0x18/0x34 [ 79.636527] panic+0x1b0/0x3a8 [ 79.636529] __raw_spin_lock_irqsave.constprop.0+0x0/0xcc [ 79.636533] ghes_in_nmi_queue_one_entry+0x204/0x300 [ 79.636535] ghes_sdei_critical_callback+0x58/0xd0 [ 79.636536] sdei_event_handler+0x28/0x90 [ 79.636541] do_sdei_event+0xa4/0x180 [ 79.636544] __sdei_handler+0x5c/0xa0 [ 79.636548] __sdei_asm_handler+0xe8/0x19c [ 79.636551] cpu_do_idle+0x14/0x74 [ 79.636553] default_idle_call+0x44/0x150 [ 79.636555] cpuidle_idle_call+0x174/0x200 [ 79.636559] do_idle+0xac/0x100 [ 79.636561] cpu_startup_entry+0x30/0x70 [ 79.636563] rest_init+0xec/0x120 [ 79.636565] arch_call_rest_init+0x18/0x24 [ 79.636569] start_kernel+0x4b4/0x4ec [ 79.636570] __primary_switched+0xbc/0xc4 [ 79.636573] SMP: stopping secondary CPUs [ 79.636590] Kernel Offset: 0x20fb357f0000 from 0xffff800008000000 [ 79.636592] PHYS_OFFSET: 0x80000000 [ 79.636592] CPU features: 0x0,000005c1,a3332e5a [ 79.636595] Memory Limit: none [ 80.752101] pstore: crypto_comp_compress failed, ret = -22! Nvidia BlueField-3 rev1 BL1 V1.0 INFO: psc supervisor init. INFO: psc_irq_init... INFO: force_crs_enable=0 pcr.lock0 = 1, time = 60105 INFO: enter idle task. NOTICE: Running as 9009D3D400ENHAA system

A double bit error should abort or reboot a system depending on the OS. In this output, you can see that Ubuntu 5.15 reboots BlueField. In Anolis, it just aborts and does not reboot.

The RShim log should also indicate that a double ECC error has occurred.

Injecting Processor Errors

To inject processor errors:

  1. Specify the core you would like to inject your error to in param3:

    Copy
    Copied!
                

    # echo 7 > param3

    To list all core IDs available, use cat /proc/cpuinfo.

  2. Specify the error type to inject. To inject a processor correctable error:

    Copy
    Copied!
                

    # echo 0x1 > error_type # echo 1 > error_inject

Expected Processor Error Report Format

Correctable Processor Error

When a correctable processor error is injected, the BlueField console displays output similar to the following:

Copy
Copied!
            

# dmesg [ 4098.340991] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 4098.340995] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [ 4098.340996] {1}[Hardware Error]: event severity: corrected [ 4098.340998] {1}[Hardware Error]: Error 0, type: corrected [ 4098.340999] {1}[Hardware Error]: section_type: ARM processor error [ 4098.341001] {1}[Hardware Error]: MIDR: 0x00000000410fd421 [ 4098.341002] {1}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081010000 [ 4098.341003] {1}[Hardware Error]: running state: 0x1 [ 4098.341004] {1}[Hardware Error]: Power State Coordination Interface state: 0 [ 4098.341005] {1}[Hardware Error]: Error info structure 0: [ 4098.341005] {1}[Hardware Error]: num errors: 2 [ 4098.341006] {1}[Hardware Error]: error_type: 0, cache error


Uncorrectable Non-fatal Processor Error

When an uncorrectable, non-fatal processor error is injected, the BlueField console displays output similar to the following:

Copy
Copied!
            

[ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 5008.520195] {2}[Hardware Error]: event severity: recoverable [ 5008.525837] {2}[Hardware Error]: Error 0, type: recoverable [ 5008.531478] {2}[Hardware Error]: section_type: ARM processor error [ 5008.537813] {2}[Hardware Error]: MIDR: 0x00000000410fd421 [ 5008.543366] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081070100 [ 5008.552042] {2}[Hardware Error]: running state: 0x1 [ 5008.557074] {2}[Hardware Error]: Power State Coordination Interface state: 0 [ 5008.564275] {2}[Hardware Error]: Error info structure 0: [ 5008.569741] {2}[Hardware Error]: num errors: 2 [ 5008.574340] {2}[Hardware Error]: error_type: 0, cache error


Uncorrectable Fatal Processor Error

When an uncorrectable, fatal processor error is injected, the BlueField console displays output similar to the following:

Copy
Copied!
            

[ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 5008.520195] {2}[Hardware Error]: event severity: fatal [ 5008.525837] {2}[Hardware Error]:  Error 0, type: fatal [ 5008.531478] {2}[Hardware Error]:   section_type: ARM processor error [ 5008.537813] {2}[Hardware Error]:   MIDR: 0x00000000410fd421 [ 5008.543366] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081070100 [ 5008.552042] {2}[Hardware Error]:   running state: 0x1 [ 5008.557074] {2}[Hardware Error]:   Power State Coordination Interface state: 0 [ 5008.564275] {2}[Hardware Error]:   Error info structure 0: [ 5008.569741] {2}[Hardware Error]:   num errors: 2 [ 5008.574340] {2}[Hardware Error]:    error_type: 0, cache error

A double bit error should abort and reboot the system. In Anolis, the system aborts and reboots after a few seconds.

Ras-tools

Note

ras-tools has been verified for Anolis OS only. Use it with on other OSs at your own discretion.

  1. Load the einj driver:

    Copy
    Copied!
                

    # modprobe einj

  2. Install ras-tools:

    Copy
    Copied!
                

    # yum install ras-tools

Setting Correctable Errors Thresholds

By default, the correctable errors threshold ( CeThreshold ) value is 5000 (i.e., CE is only reported to the user at the 5000th error). This means that the user would not see any CE error indication in the RShim log nor the Arm console as long as the amount of errors injected is not a multiple of 5000.

The following procedure demonstrates how CeThreshold can be modified using Redfish:

  1. Check current BIOS settings:

    Copy
    Copied!
                

    curl -k -u root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios

    Example output:

    Copy
    Copied!
                

    "Attributes": { "Boot Partition Protection": false, "CeThreshold": 3, "CurrentUefiPassword": "", "DateTime": "2024-06-26T15:13:45Z", "DefaultPasswordPolicy": true, "Disable PCIe": false, "Disable SPMI": false, "Disable TMFF": false, "EmmcWipe": false, "Enable 2nd eMMC": false, "Enable OP-TEE": false, "Enable SMMU": true, "Field Mode": false, "Host Privilege Level": "Privileged", "Internal CPU Model": "Embedded", "LegacyPasswordEnable": false, "NicMode": "DpuMode", "NvmeWipe": false, "OsArgs": "", "ResetEfiVars": false, "SPCR UART": "Disabled", "UefiArgs": "", "UefiPassword": "" },

  2. Change CeThreshold to 5:

    Copy
    Copied!
                

    curl -k -u root:'<password>' -H 'content-type: application/json' -d '{ "Attributes": { "CeThreshold":5 } }' -X PATCH https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios/Settings

  3. Check the new attribute settings:

    Copy
    Copied!
                

    curl -k -u root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios

    Example output:

    Copy
    Copied!
                

    { "@odata.id": "/redfish/v1/Systems/Bluefield/Bios/Settings", "@odata.type": "#Bios.v1_2_0.Bios", "Attributes": { "CeThreshold": 5 }, "Description": "BIOS Settings", "Id": "BIOS_Settings", "Name": "BIOS Configuration" }

  4. Reboot Arm. For example:

    Copy
    Copied!
                

    dpu> echo "SW_RESET 1" > /dev/rshim0/misc

  5. Check that CeThreshold has been updated in the BIOS:

    Copy
    Copied!
                

    curl -k -u root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfis/v1/Systems/Bluefield/Bios "Attributes": { "Boot Partition Protection": false, "CeThreshold": 5, "CurrentUefiPassword": "", "DateTime": "2024-06-26T15:13:45Z", "DefaultPasswordPolicy": true, "Disable PCIe": false, "Disable SPMI": false, "Disable TMFF": false, "EmmcWipe": false, "Enable 2nd eMMC": false, "Enable OP-TEE": false, "Enable SMMU": true, "Field Mode": false, "Host Privilege Level": "Privileged", "Internal CPU Model": "Embedded", "LegacyPasswordEnable": false, "NicMode": "DpuMode", "NvmeWipe": false, "OsArgs": "", "ResetEfiVars": false, "SPCR UART": "Disabled", "UefiArgs": "", "UefiPassword": "" },

BlueField Arm only supports memory and CPU error injection and handling.

BlueField-3 supports the handling of the following NIC firmware errors:

  • PCIe errors

    Note

    C urrently, error injection is not supported.

  • NIC RAM errors

    Note

    C urrently, error injection is not supported.

  • Network errors

NIC Network Errors

CRC errors can be forced during traffic using the mlxreg command to write PTER (Port Transmit Errors Register) access register.

Info

See section 30.4.11, "PTER - Port Transmit Errors Register" in the NVIDIA Adapters Programmer's Reference Manual .

Copy
Copied!
            

{ "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17", "@odata.type": "#LogEntry.v1_15_0.LogEntry", "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17/attachment", "CPER": { "NotificationType": "00000000-0000-0000-0000-000000000000", "Oem": { "Nvidia": { "@odata.type": "#NvidiaCPER.v1_0_0.NvidiaCPER", "Nvidia": { "ErrorInstance": 0, "ErrorType": 2, "InstanceBase": 0, "RegisterCount": 3, "Registers": [ { "Address": 1, "Value": <icrc errors> }, { "Address": 2, "Value": <vcrc errors> } { "Address": 3, "Value": <fc/eth crc errors> } ], "Severity": { "Code": 0, "Name": <severity string> }, "Signature": "NBU", "Socket": 0 } } }, "SectionType": "6d5244f2-2712-11ec-bea7-cb3fdb95c786" }, "Created": "2024-11-19T18:37:28+00:00", "DiagnosticDataType": "CPERSection", "EntryType": "Event", "Id": "17", "Message": "A platform error occurred.", "MessageArgs": [], "MessageId": "Platform.1.0.PlatformError", "Name": "System Event Log Entry", "Resolution": "Check additional diagnostic data if available.", "Resolved": false, "Severity": "Critical" },


Expected NIC Error Report Format

If a NIC hardware error occurs, the report is only sent to the BMC. See section "BMC SEL" for more details.

The report would look similar to the following:

Copy
Copied!
            

{ "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17", "@odata.type": "#LogEntry.v1_15_0.LogEntry", "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17/attachment", "CPER": { "NotificationType": "00000000-0000-0000-0000-000000000000", "Oem": { "Nvidia": { "@odata.type": "#NvidiaCPER.v1_0_0.NvidiaCPER", "Nvidia": { "ErrorInstance": 0, "ErrorType": 1, "InstanceBase": 0, "RegisterCount": 2, "Registers": [ { "Address": 1, "Value": <reg address> }, { "Address": 2, "Value": <value> } ], "Severity": { "Code": <severity value>, "Name": <severity string> }, "Signature": "NBU", "Socket": 0 } } }, "SectionType": "6d5244f2-2712-11ec-bea7-cb3fdb95c786" }, "Created": "2024-11-19T18:37:28+00:00", "DiagnosticDataType": "CPERSection", "EntryType": "Event", "Id": "17", "Message": "A platform error occurred.", "MessageArgs": [], "MessageId": "Platform.1.0.PlatformError", "Name": "System Event Log Entry", "Resolution": "Check additional diagnostic data if available.", "Resolved": false, "Severity": "Critical" },


Querying the SEL log can be done using IPMI or Redfish:

  • Redfish

    Copy
    Copied!
                

    curl -k -u root:'<password>' -X GET https://<bmc-ip>/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries in "Members": { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/694", "@odata.type": "#LogEntry.v1_15_0.LogEntry", "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/694/attachment", "Created": "2024-11-20T16:36:41+00:00", "EntryType": "Event", "Id": "694", "Message": "SEL event for single bit ECC", "Modified": "2024-11-20T16:36:41+00:00", "Name": "System Event Log Entry", "Resolved": false, "Severity": "OK" },   { "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/696", "@odata.type": "#LogEntry.v1_15_0.LogEntry", "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/696/attachment", "Created": "2024-11-20T16:41:54+00:00", "EntryType": "Event", "Id": "696", "Message": "SEL event for multi bit ECC", "Modified": "2024-11-20T16:41:54+00:00", "Name": "System Event Log Entry", "Resolved": false, "Severity": "OK" },

  • IPMI

    Copy
    Copied!
                

    # ipmitool sel list

    Example output:

    Copy
    Copied!
                

    2b6 | 11/20/24 | 16:36:41 UTC | Memory #0x19 | Correctable ECC | Asserted 2b8 | 11/20/24 | 16:41:54 UTC | Memory #0x19 | Uncorrectable ECC | Asserted

The bftraining_results script prints the training parameter results with results reported separately for each memory channel.

Example output:

Copy
Copied!
            

dpu# bftraining_results Memory controller 0 Channel 0 Read Data worst timing margin 23 Write Data worst timing margin 21 Read Data worst Vref margin 43 Write Data worst Vref margin 39 TX CS worst timing margin 58 TX CA worst timing margin 55   Memory controller 0 Channel 1 Unpopulated   Memory controller 1 Channel 0 Read Data worst timing margin 23 Write Data worst timing margin 22 Read Data worst Vref margin 41 Write Data worst Vref margin 40 TX CS worst timing margin 59 TX CA worst timing margin 55   Memory controller 1 Channel 1 Unpopulated

© Copyright 2025, NVIDIA. Last updated on Mar 9, 2025.