RAS - NVIDIA Docs

Note

RAS is supported on NVIDIA® BlueField®-3 and higher only.

Reliability, availability and serviceability (RAS) reduces and avoids unplanned outages because:

Errors can be detected and corrected by the hardware before they cause system failures
Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.
Errors can be predicted ahead of time to allow replacement of defective hardware

SDEI NMI Watchdog

Note

SDEI NMI watchdog is supported on Linux OS openEuler and Anolis only.

Software-delegated exception interface (SDEI) non-maskable interrupt (NMI) watchdog offers a way to detect Linux kernel hard lockup (e.g., hang in interrupt handler). It uses high-priority per-core secure timer as a keepalive mechanism and prints the call stack information when Linux kernel is stuck.

Info

The SDEI NMI watchdog is an implementation for the "Software watchdog timer" as described in section 2.1.1 "Typical use cases" of the Software Delegated Exception Interface (SDEI) Platform Design Document.

Boot Error Record Table

The b oot error record table (BERT) is used to report errors that occurred and were unhandled in a previous boot. On the subsequent boot, the OS reports the error using the common platform error record (CPER) format.

Info

For more information on BERT, refer to section 18.3 of the ACPI specification.

Info

For more information on CPER, refer to appendix N of the UEFI specification.

In BlueField-3, BERT reports all RShim log messages. For example, ASSERTs which occurred in the UEFI generate a Linux error report in the subsequent boot.

Example of a boot where ASSERTs occurred:

Copy
Copied!

            
            INFO[BL2]: start
 INFO[BL2]: DDR POST passed
 INFO[BL2]: UEFI loaded
 INFO[BL31]: start
 INFO[BL31]: lifecycle GA Secured
 INFO[BL31]: runtime
 INFO[UEFI]: UPVS valid
 WARN[UEFI]: UPVS full
 WARN[UEFI]: UPVS reclaim
 WARN[UEFI]: Var reclaim
 WARN[UEFI]: Var reclaim done
 INFO[UEFI]: eMMC init
 INFO[UEFI]: eMMC probed
 INFO[UEFI]: PMI: updates started
 INFO[UEFI]: PMI: total updates: 1
 INFO[UEFI]: PMI: updates completed, status 0
 ASSERT[UEFI]: MdePkg/Library/BaseLib/String.c: 173
 ASSERT[UEFI]: PC=0xF56706B8
 ASSERT[UEFI]: PC=0xF5668A2C
 ASSERT[UEFI]: PC=0xF567E414
 ASSERT[UEFI]: PC=0xFDE33DA4
 ASSERT[UEFI]: PC=0x88009338
 ASSERT[UEFI]: PC=0x880094D4
 INFO[UEFI]: PCIe enum start
 INFO[UEFI]: PCIe enum end
 INFO[UEFI]: exit Boot Service
 INFO[MISC]: Found bf.cfg
 INFO[MISC]: Ubuntu installation started
 INFO[MISC]: Installing OS image
 INFO[MISC]: Changing the default password for user ubuntu
 INFO[MISC]: Running bfb_modify_os from bf.cfg
 INFO[MISC]: ===================== bfb_modify_os =====================
 INFO[MISC]: Installation finished

Example print out of the next boot:

Copy
Copied!

            
            [    0.736968] [Hardware Error]: event severity: fatal
[    0.736971] [Hardware Error]:  Error 0, type: fatal
[    0.736976] [Hardware Error]:   section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[    0.736978] [Hardware Error]:   section length: 0x33
[    0.736983] [Hardware Error]:   00000000: 53534120 5b545245 49464555 4d203a5d   ASSERT[UEFI]: M
[    0.736986] [Hardware Error]:   00000010: 6b506564 694c2f67 72617262 61422f79  dePkg/Library/Ba
[    0.736989] [Hardware Error]:   00000020: 694c6573 74532f62 676e6972 203a632e  seLib/String.c: 
[    0.736991] [Hardware Error]:   00000030: 31 37 33                                         173
[    0.736993] [Hardware Error]:  Error 1, type: fatal
[    0.736995] [Hardware Error]:   section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[    0.736996] [Hardware Error]:   section length: 0x1c
[    0.737000] [Hardware Error]:   00000000: 53534120 5b545245 49464555 50203a5d   ASSERT[UEFI]: P
[    0.737002] [Hardware Error]:   00000010: 78303d43 37363546 38423630           C=0xF56706B8
[    0.737004] [Hardware Error]:  Error 2, type: fatal
[    0.737005] [Hardware Error]:   section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[    0.737007] [Hardware Error]:   section length: 0x1c
[    0.737009] [Hardware Error]:   00000000: 53534120 5b545245 49464555 50203a5d   ASSERT[UEFI]: P
[    0.737012] [Hardware Error]:   00000010: 78303d43 36363546 43324138           C=0xF5668A2C
[    0.737013] [Hardware Error]:  Error 3, type: fatal
[    0.737015] [Hardware Error]:   section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[    0.737016] [Hardware Error]:   section length: 0x1c
[    0.737019] [Hardware Error]:   00000000: 53534120 5b545245 49464555 50203a5d   ASSERT[UEFI]: P
[    0.737021] [Hardware Error]:   00000010: 78303d43 37363546 34313445           C=0xF567E414
[    0.737022] [Hardware Error]:  Error 4, type: fatal
[    0.737024] [Hardware Error]:   section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[    0.737025] [Hardware Error]:   section length: 0x1c
[    0.737028] [Hardware Error]:   00000000: 53534120 5b545245 49464555 50203a5d   ASSERT[UEFI]: P
[    0.737030] [Hardware Error]:   00000010: 78303d43 33454446 34414433           C=0xFDE33DA4
[    0.737031] [Hardware Error]:  Error 5, type: fatal
[    0.737033] [Hardware Error]:   section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[    0.737035] [Hardware Error]:   section length: 0x1c
[    0.737037] [Hardware Error]:   00000000: 53534120 5b545245 49464555 50203a5d   ASSERT[UEFI]: P
[    0.737040] [Hardware Error]:   00000010: 78303d43 30303838 38333339           C=0x88009338
[    0.737041] [Hardware Error]:  Error 6, type: fatal
[    0.737043] [Hardware Error]:   section type: unknown, c6adf9e6-1108-4760-8827-003d059fe2e1
[    0.737044] [Hardware Error]:   section length: 0x1c
[    0.737060] [Hardware Error]:   00000000: 53534120 5b545245 49464555 50203a5d   ASSERT[UEFI]: P
[    0.737062] [Hardware Error]:   00000010: 78303d43 30303838 34443439           C=0x880094D4

Hardware Error Source Table

On BlueField-3, the h ardware error source t able (HEST) is used for reporting errors to the OS. The error injection (EINJ) table is used for injecting errors.

Info

For more information on HEST, refer to section 18.3.2 of the ACPI specification.

By default, Disable HEST is FALSE, delegating error handling to the OS. Setting Disable HEST to TRUE using Redfish or the UEFI menu, shifts error handling to the BlueField Arm firmware which would limit the chance for the error to propagate through the system.

Disabling HEST via UEFI Menu

Open these SSH terminal windows to the host:

From Terminal 1, open the OOB console via the RShim console device file. Example:

Copy
Copied!

            
            $ sudo minicom --color=on -D /dev/rshim0/console 115200

From Terminal 2, reset Arm from the host:

Copy
Copied!

            
            $ sudo sh -c "echo 'SW_RESET 1' > /dev/rshim0/misc"

To enter the UEFI menu, press ESC in the OOB terminal when the following screen appears:
Navigate the UEFI menu to Device Manager>System Configuration which contains the Disable HEST setting.
Navigate down to Disable HEST setting and press Space to toggle the Disable HEST value and enable it.
Press F10 then Y to save the new setting.
UEFI reboots to apply the new setting.

Disabling HEST via Redfish

DisableHEST can be configured via the Redfish REST interface with the REST server running on the BMC. The process involves using curl commands from the host to send REST commands to the BMC's Redfish server.

Run the following steps on the host terminal:

Install the prerequisite software:

Copy
Copied!

            
            $ sudo yum install curl   # or sudo apt install curl
$ sudo yum install jq     # or sudo apt install jq

Check the current DisableHEST value:

Copy
Copied!

            
            $ curl -sku root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios | jq '.Attributes|{DisableHEST}'

Example output:

Copy
Copied!

            
            {
  "DisableHEST": false
}

Set DisableHEST to true:

Copy
Copied!

            
            $ curl -sku root:'<password>' -H 'content-type: application/json' -d '{ "Attributes": { "DisableHEST": true } }' -X PATCH https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios/Settings | jq '."@Message.ExtendedInfo"[0]|{Message}'

Restart BlueField-3 to apply the pending configuration changes:

Copy
Copied!

            
            $ sudo sh -c "echo 'SW_RESET 1' > /dev/rshim0/misc"

Arm Errors

Memory Errors

BlueField-3 has RAS support for memory errors which can be one of these types:

Single-bit errors – Correctable errors (CE) which are non-fatal and are corrected by hardware
Double-bit errors which could be:
- Non-fatal errors – Uncorrectable recoverable errors which do not interrupt services
- Fatal errors – Uncorrectable errors which cause the system to abort or reboot

Processor Errors

BlueField-3 supports RAS for processor errors which can be either:

Correctable errors
Uncorrectable non fatal errors
Uncorrectable fatal errors

Error Injection

BlueField-3 error handling may be tested by injecting errors using the following methods:

Error injection (EINJ) ACPI table (for memory and processor errors)
ras-tools (for memory and processor errors)

Note

ras-tools has been verified to run with Anolis only. Use it with other OSs at your own discretion.

EINJ Table

EINJ table is the standard way of injecting errors on BlueField-3 Linux distributions.

Info

For more information on EINJ, refer to section 18.6.1 of the ACPI specification.

Error Injection Commands

In Linux, the einj driver creates a sysfs for injecting errors. To load einj on BlueField-3:

Copy
Copied!

            
            # modprobe einj

The EINJ sysfs is:

Copy
Copied!

            
            # cd /sys/kernel/debug/apei/einj

The directory contains the following files:

Copy
Copied!

            
            [root@localhost ~]# cd /sys/kernel/debug/apei/einj/
[root@localhost einj]# ls
available_error_type  error_type  notrigger  param2  param4
error_inject          flags       param1     param3

Where:

Parameter	Definition
`available_error_type`	Contains the available error types: `0x00000001` – Processor correctable error `0x00000002` – Processor uncorrectable non-fatal error `0x00000004` – Processor uncorrectable fatal error `0x00000008` – Memory correctable error `0x00000010` – Memory uncorrectable non-fatal error `0x00000020` – Memory uncorrectable fatal error
`error_type`	The type of error to inject (according to `available_error_type`)
`param1`	Physical address to inject an error to Info Relevant to memory error injection only.
`param2`	Physical address mask of `param1` Info Relevant to memory error injection only.
`param3`	The number of the core to inject an error to Info Relevant to processor error injection only.
`error_inject`	Set to `1` to inject error
`notrigger`	Is set to `0` by default. Changing this to `1` stops the error from triggering automatically. The user would need to trigger the error using a simple access to the CPU, memory location, or a device which is the target of the error injection.

Injecting Memory Errors

To inject memory errors:

Specify the physical address you want to inject the error to in param1 and its address mask in param2:
Copy

Copied!
```
            
            # echo 0x400000000 > param1 
# echo 0xfffffffffffff000 > param2
        
```
Info

This depends on the size of memory available.

Specify the error type to inject. To inject a memory correctable error:

Copy
Copied!

            
            # echo 0x8 > error_type 
# echo 1 > error_inject

Expected Memory Error Report Format

Correctable Memory Error

When a correctable memory error is injected, the BlueField console displays output similar to the following:

Copy
Copied!

            
            # dmesg
ERROR:   BL31: MSS1 C0 Single bit ECC error detected. IRQ 91
[  234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[  234.638588] {1}[Hardware Error]: event severity: corrected
[  234.638590] {1}[Hardware Error]:  Error 0, type: corrected
[  234.638591] {1}[Hardware Error]:   section_type: memory error
[  234.638592] {1}[Hardware Error]:   error_status: 0x0000000000010400
[  234.638594] {1}[Hardware Error]:   physical_address: 0x0000000000000080
[  234.638595] {1}[Hardware Error]:   physical_address_mask: 0x0000ffffffffffff
[  234.638598] {1}[Hardware Error]:   module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480
[  234.638599] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[  234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)

The RShim log would show a concise message such as the following:

Copy
Copied!

            
            ERR[BL31]: [75512]mss1: C0 single-bit ecc, IRQ[91]

Uncorrectable Non-fatal Memory Error

When an uncorrectable, non-fatal memory error is injected, the BlueField console displays output similar to the following:

Copy
Copied!

            
            ERROR:   BL31: MSS1 C0 Double bit
ECC error detected. IR
[  219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[  219.795253] {1}[Hardware Error]: event severity: recoverable
[  219.800897] {1}[Hardware Error]:  Error 0, type: recoverable
[  219.806540] {1}[Hardware Error]:   section_type: memory error
[  219.812269] {1}[Hardware Error]:   error_status: 0x0000000000010400
[  219.818518] {1}[Hardware Error]:   physical_address: 0x0000000000000080
[  219.825114] {1}[Hardware Error]:   physical_address_mask: 0x0000ffffffffffff
[  219.832146] {1}[Hardware Error]:   module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[  219.841952] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
[  219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 page:0x0 offset:0x80 grain:-281474976710655 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:0 status(0x0000000000010400): Storage error in DRAM memory)
[  219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80

If the non-fatal error targets userspace memory, it is recovered by the OS. If the non-fatal error targets kernel memory, it is marked as fatal by the OS.

Uncorrectable Fatal Memory Error

When an uncorrectable, fatal memory error is injected, the BlueField console displays output similar to the following:

Copy
Copied!

            
            root@localhost:~[   79.351190] EINJ: Error INJection is initialized.
ERROR:   BL31: MSS1 C0 Double bit ECC error detected. IRQ 93
[   79.636470] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   79.636476] {1}[Hardware Error]: event severity: fatal
[   79.636478] {1}[Hardware Error]:  Error 0, type: fatal
[   79.636481] {1}[Hardware Error]:   section_type: memory error
[   79.636482] {1}[Hardware Error]:   error_status: 0x0000000000010400
[   79.636483] {1}[Hardware Error]:   physical_address: 0x000000016cb48000
[   79.636484] {1}[Hardware Error]:   physical_address_mask: 0x0000ffffffffffff
[   79.636489] {1}[Hardware Error]:   module: 0 rank: 0 bank: 1539 bank_group: 6 row: 15149 column: 2 bit_position: 0
[   79.636491] {1}[Hardware Error]:   error_type: 14, scrub uncorrected error
[   79.636495] Kernel panic - not syncing: Fatal hardware error!
[   79.636499] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G     OE     5.15.0-1053.55.24.g9cc17fe-bluefield #g9cc17fe
[   79.636502] Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.9.1.13411 Nov 18 2024
[   79.636505] Call trace:
[   79.636505]  dump_backtrace+0x0/0x200
[   79.636515]  show_stack+0x20/0x2c
[   79.636517]  dump_stack_lvl+0x68/0x84
[   79.636524]  dump_stack+0x18/0x34
[   79.636527]  panic+0x1b0/0x3a8
[   79.636529]  __raw_spin_lock_irqsave.constprop.0+0x0/0xcc
[   79.636533]  ghes_in_nmi_queue_one_entry+0x204/0x300
[   79.636535]  ghes_sdei_critical_callback+0x58/0xd0
[   79.636536]  sdei_event_handler+0x28/0x90
[   79.636541]  do_sdei_event+0xa4/0x180
[   79.636544]  __sdei_handler+0x5c/0xa0
[   79.636548]  __sdei_asm_handler+0xe8/0x19c
[   79.636551]  cpu_do_idle+0x14/0x74
[   79.636553]  default_idle_call+0x44/0x150
[   79.636555]  cpuidle_idle_call+0x174/0x200
[   79.636559]  do_idle+0xac/0x100
[   79.636561]  cpu_startup_entry+0x30/0x70
[   79.636563]  rest_init+0xec/0x120
[   79.636565]  arch_call_rest_init+0x18/0x24
[   79.636569]  start_kernel+0x4b4/0x4ec
[   79.636570]  __primary_switched+0xbc/0xc4
[   79.636573] SMP: stopping secondary CPUs
[   79.636590] Kernel Offset: 0x20fb357f0000 from 0xffff800008000000
[   79.636592] PHYS_OFFSET: 0x80000000
[   79.636592] CPU features: 0x0,000005c1,a3332e5a
[   79.636595] Memory Limit: none
[   80.752101] pstore: crypto_comp_compress failed, ret = -22!
Nvidia BlueField-3 rev1 BL1 V1.0
INFO: psc supervisor init.
INFO: psc_irq_init...
INFO: force_crs_enable=0 pcr.lock0 = 1, time = 60105
INFO: enter idle task.
NOTICE:  Running as 9009D3D400ENHAA system

A double bit error should abort or reboot a system depending on the OS. In this output, you can see that Ubuntu 5.15 reboots BlueField. In Anolis, it just aborts and does not reboot.

The RShim log should also indicate that a double ECC error has occurred.

Injecting Processor Errors

To inject processor errors:

Specify the core you would like to inject your error to in param3:
Copy

Copied!
```
            
            # echo 7 > param3
        
```
To list all core IDs available, use cat /proc/cpuinfo.

Specify the error type to inject. To inject a processor correctable error:

Copy
Copied!

            
            # echo 0x1 > error_type
# echo 1 > error_inject

Expected Processor Error Report Format

Correctable Processor Error

When a correctable processor error is injected, the BlueField console displays output similar to the following:

Copy
Copied!

            
            # dmesg
[ 4098.340991] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 4098.340995] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 4098.340996] {1}[Hardware Error]: event severity: corrected
[ 4098.340998] {1}[Hardware Error]:  Error 0, type: corrected
[ 4098.340999] {1}[Hardware Error]:   section_type: ARM processor error
[ 4098.341001] {1}[Hardware Error]:   MIDR: 0x00000000410fd421
[ 4098.341002] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081010000
[ 4098.341003] {1}[Hardware Error]:   running state: 0x1
[ 4098.341004] {1}[Hardware Error]:   Power State Coordination Interface state: 0
[ 4098.341005] {1}[Hardware Error]:   Error info structure 0:
[ 4098.341005] {1}[Hardware Error]:   num errors: 2
[ 4098.341006] {1}[Hardware Error]:    error_type: 0, cache error

Uncorrectable Non-fatal Processor Error

When an uncorrectable, non-fatal processor error is injected, the BlueField console displays output similar to the following:

Copy
Copied!

            
            [ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 5008.520195] {2}[Hardware Error]: event severity: recoverable
[ 5008.525837] {2}[Hardware Error]:  Error 0, type: recoverable
[ 5008.531478] {2}[Hardware Error]:   section_type: ARM processor error
[ 5008.537813] {2}[Hardware Error]:   MIDR: 0x00000000410fd421
[ 5008.543366] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081070100
[ 5008.552042] {2}[Hardware Error]:   running state: 0x1
[ 5008.557074] {2}[Hardware Error]:   Power State Coordination Interface state: 0
[ 5008.564275] {2}[Hardware Error]:   Error info structure 0:
[ 5008.569741] {2}[Hardware Error]:   num errors: 2
[ 5008.574340] {2}[Hardware Error]:    error_type: 0, cache error

Uncorrectable Fatal Processor Error

When an uncorrectable, fatal processor error is injected, the BlueField console displays output similar to the following:

Copy
Copied!

            
            [ 5008.509118] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 5008.520195] {2}[Hardware Error]: event severity: fatal
[ 5008.525837] {2}[Hardware Error]:  Error 0, type: fatal
[ 5008.531478] {2}[Hardware Error]:   section_type: ARM processor error
[ 5008.537813] {2}[Hardware Error]:   MIDR: 0x00000000410fd421
[ 5008.543366] {2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081070100
[ 5008.552042] {2}[Hardware Error]:   running state: 0x1
[ 5008.557074] {2}[Hardware Error]:   Power State Coordination Interface state: 0
[ 5008.564275] {2}[Hardware Error]:   Error info structure 0:
[ 5008.569741] {2}[Hardware Error]:   num errors: 2
[ 5008.574340] {2}[Hardware Error]:    error_type: 0, cache error

A double bit error should abort and reboot the system. In Anolis, the system aborts and reboots after a few seconds.

Ras-tools

Note

ras-tools has been verified for Anolis OS only. Use it with on other OSs at your own discretion.

Load the einj driver:

Copy
Copied!

            
            # modprobe einj

Install ras-tools:

Copy
Copied!

            
            # yum install ras-tools

Setting Correctable Errors Thresholds

By default, the correctable errors threshold ( CeThreshold ) value is 5000 (i.e., CE is only reported to the user at the 5000th error). This means that the user would not see any CE error indication in the RShim log nor the Arm console as long as the amount of errors injected is not a multiple of 5000.

The following procedure demonstrates how CeThreshold can be modified using Redfish:

Check current BIOS settings:

Copy
Copied!

            
            curl -k -u root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios

Example output:

Copy
Copied!

            
            "Attributes": {
    "Boot Partition Protection": false,
    "CeThreshold": 3,
    "CurrentUefiPassword": "",
    "DateTime": "2024-06-26T15:13:45Z",
    "DefaultPasswordPolicy": true,
    "Disable PCIe": false,
    "Disable SPMI": false,
    "Disable TMFF": false,
    "EmmcWipe": false,
    "Enable 2nd eMMC": false,
    "Enable OP-TEE": false,
    "Enable SMMU": true,
    "Field Mode": false,
    "Host Privilege Level": "Privileged",
    "Internal CPU Model": "Embedded",
    "LegacyPasswordEnable": false,
    "NicMode": "DpuMode",
    "NvmeWipe": false,
    "OsArgs": "",
    "ResetEfiVars": false,
    "SPCR UART": "Disabled",
    "UefiArgs": "",
    "UefiPassword": ""
  },

Change CeThreshold to 5:

Copy
Copied!

            
            curl -k -u root:'<password>' -H 'content-type: application/json' -d '{ "Attributes": { "CeThreshold":5 } }' -X PATCH https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios/Settings

Check the new attribute settings:

Copy
Copied!

            
            curl -k -u root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfish/v1/Systems/Bluefield/Bios

Example output:

Copy
Copied!

            
            {
  "@odata.id": "/redfish/v1/Systems/Bluefield/Bios/Settings",
  "@odata.type": "#Bios.v1_2_0.Bios",
  "Attributes": {
    "CeThreshold": 5
  },
  "Description": "BIOS Settings",
  "Id": "BIOS_Settings",
  "Name": "BIOS Configuration"
}

Reboot Arm. For example:

Copy
Copied!

            
            dpu> echo "SW_RESET 1" > /dev/rshim0/misc

Check that CeThreshold has been updated in the BIOS:

Copy
Copied!

            
            curl -k -u root:'<password>' -H 'content-type: application/json' -X GET https://<bmc_ip>/redfis/v1/Systems/Bluefield/Bios
"Attributes": {
    "Boot Partition Protection": false,
    "CeThreshold": 5,
    "CurrentUefiPassword": "",
    "DateTime": "2024-06-26T15:13:45Z",
    "DefaultPasswordPolicy": true,
    "Disable PCIe": false,
    "Disable SPMI": false,
    "Disable TMFF": false,
    "EmmcWipe": false,
    "Enable 2nd eMMC": false,
    "Enable OP-TEE": false,
    "Enable SMMU": true,
    "Field Mode": false,
    "Host Privilege Level": "Privileged",
    "Internal CPU Model": "Embedded",
    "LegacyPasswordEnable": false,
    "NicMode": "DpuMode",
    "NvmeWipe": false,
    "OsArgs": "",
    "ResetEfiVars": false,
    "SPCR UART": "Disabled",
    "UefiArgs": "",
    "UefiPassword": ""
  },

NIC Errors

BlueField Arm only supports memory and CPU error injection and handling.

BlueField-3 supports the handling of the following NIC firmware errors:

PCIe errors

Note

C urrently, error injection is not supported.
NIC RAM errors

Note

C urrently, error injection is not supported.
Network errors

NIC Network Errors

CRC errors can be forced during traffic using the mlxreg command to write PTER (Port Transmit Errors Register) access register.

Info

See section 30.4.11, "PTER - Port Transmit Errors Register" in the NVIDIA Adapters Programmer's Reference Manual .

Copy
Copied!

            
            {
            "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17",
            "@odata.type": "#LogEntry.v1_15_0.LogEntry",
            "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17/attachment",
            "CPER": {
                "NotificationType": "00000000-0000-0000-0000-000000000000",
                "Oem": {
                    "Nvidia": {
                        "@odata.type": "#NvidiaCPER.v1_0_0.NvidiaCPER",
                        "Nvidia": {
                            "ErrorInstance": 0,
                            "ErrorType": 2,
                            "InstanceBase": 0,
                            "RegisterCount": 3,
                            "Registers": [
                                {
                                    "Address": 1,
                                    "Value": <icrc errors>
                                },
                                {
                                    "Address": 2,
                                    "Value": <vcrc errors>
                                }
                                {
                                    "Address": 3,
                                    "Value": <fc/eth crc errors>
                                }
                            ],
                            "Severity": {
                                "Code": 0,
                                "Name": <severity string>
                            },
                            "Signature": "NBU",
                            "Socket": 0
                        }
                    }
                },
                "SectionType": "6d5244f2-2712-11ec-bea7-cb3fdb95c786"
            },
            "Created": "2024-11-19T18:37:28+00:00",
            "DiagnosticDataType": "CPERSection",
            "EntryType": "Event",
            "Id": "17",
            "Message": "A platform error occurred.",
            "MessageArgs": [],
            "MessageId": "Platform.1.0.PlatformError",
            "Name": "System Event Log Entry",
            "Resolution": "Check additional diagnostic data if available.",
            "Resolved": false,
            "Severity": "Critical"
        },

Expected NIC Error Report Format

If a NIC hardware error occurs, the report is only sent to the BMC. See section "BMC SEL" for more details.

The report would look similar to the following:

Copy
Copied!

            
            {
            "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17",
            "@odata.type": "#LogEntry.v1_15_0.LogEntry",
            "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/17/attachment",
            "CPER": {
                "NotificationType": "00000000-0000-0000-0000-000000000000",
                "Oem": {
                    "Nvidia": {
                        "@odata.type": "#NvidiaCPER.v1_0_0.NvidiaCPER",
                        "Nvidia": {
                            "ErrorInstance": 0,
                            "ErrorType": 1,
                            "InstanceBase": 0,
                            "RegisterCount": 2,
                            "Registers": [
                                {
                                    "Address": 1,
                                    "Value": <reg address>
                                },
                                {
                                    "Address": 2,
                                    "Value": <value>
                                }
                            ],
                            "Severity": {
                                "Code": <severity value>,
                                "Name": <severity string>
                            },
                            "Signature": "NBU",
                            "Socket": 0
                        }
                    }
                },
                "SectionType": "6d5244f2-2712-11ec-bea7-cb3fdb95c786"
            },
            "Created": "2024-11-19T18:37:28+00:00",
            "DiagnosticDataType": "CPERSection",
            "EntryType": "Event",
            "Id": "17",
            "Message": "A platform error occurred.",
            "MessageArgs": [],
            "MessageId": "Platform.1.0.PlatformError",
            "Name": "System Event Log Entry",
            "Resolution": "Check additional diagnostic data if available.",
            "Resolved": false,
            "Severity": "Critical"
        },

BMC SEL

Querying the SEL log can be done using IPMI or Redfish:

Redfish

Copy
Copied!

            
            curl -k -u root:'<password>' -X GET https://<bmc-ip>/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries
in "Members":
     {
            "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/694",
            "@odata.type": "#LogEntry.v1_15_0.LogEntry",
            "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/694/attachment",
            "Created": "2024-11-20T16:36:41+00:00",
            "EntryType": "Event",
            "Id": "694",
            "Message": "SEL event for single bit ECC",
            "Modified": "2024-11-20T16:36:41+00:00",
            "Name": "System Event Log Entry",
            "Resolved": false,
            "Severity": "OK"
        },
 
{
            "@odata.id": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/696",
            "@odata.type": "#LogEntry.v1_15_0.LogEntry",
            "AdditionalDataURI": "/redfish/v1/Systems/Bluefield/LogServices/EventLog/Entries/696/attachment",
            "Created": "2024-11-20T16:41:54+00:00",
            "EntryType": "Event",
            "Id": "696",
            "Message": "SEL event for multi bit ECC",
            "Modified": "2024-11-20T16:41:54+00:00",
            "Name": "System Event Log Entry",
            "Resolved": false,
            "Severity": "OK"
        },

IPMI

Copy
Copied!

            
            # ipmitool sel list

Example output:

Copy
Copied!

            
            2b6 | 11/20/24 | 16:36:41 UTC | Memory #0x19 | Correctable ECC | Asserted
2b8 | 11/20/24 | 16:41:54 UTC | Memory #0x19 | Uncorrectable ECC | Asserted

DDR Training Parameter Results Script

The bftraining_results script prints the training parameter results with results reported separately for each memory channel.

Example output:

Copy
Copied!

            
            dpu# bftraining_results 
Memory controller 0 Channel 0 
Read Data worst timing margin 23
Write Data worst timing margin 21
Read Data worst Vref margin 43
Write Data worst Vref margin 39
TX CS worst timing margin 58
TX CA worst timing margin 55
 
Memory controller 0 Channel 1 
Unpopulated
 
Memory controller 1 Channel 0 
Read Data worst timing margin 23
Write Data worst timing margin 22
Read Data worst Vref margin 41
Write Data worst Vref margin 40
TX CS worst timing margin 59
TX CA worst timing margin 55
 
Memory controller 1 Channel 1 
Unpopulated

On This Page

Correctable Memory Error

Uncorrectable Non-fatal Memory Error

Uncorrectable Fatal Memory Error

Correctable Processor Error

Uncorrectable Non-fatal Processor Error

Uncorrectable Fatal Processor Error