Safety Features - NVIDIA Docs

Report Error (nvDemoAppSwErr)

The error reporting use case is demonstrated by the demo application – nvDemoAppSwErr, which runs on CCPLEX and demonstrates usage of Error Propagation Library to report errors.

Validation Steps

The app is located on CCPLEX at /opt/nvidia/ccplex_sf/error_propagation/nvDemoAppSwErr.

Enter sudo mode.

Set runtime library path as follows:

Copy
Copied!

            
            export LD_LIBRARY_PATH=/opt/nvidia/ccplex_sf/error_propagation/:$LD_LIBRARY_PATH

Execute the DemoAppSwErr.
Select the option s to report Software Errors. After selecting s, enter the details for the software error.
Enter the ReporterId.
Enter the Error Code.
Enter the Error Attribute.
Upon successful reporting of the software error, DemoAppSwErr will display the reported software error details on the SMCU Console.

Example

Copy
Copied!

            
            root@IGX-ubuntu:/opt/nvidia/ccplex_sf/error_propagation# ./nvDemoAppSwErr

  DriveOS_ErrPropagationDemoApp main menu
  _____________________________________________

  [m] Display the menu
  [s] Report Software Error
  [q] Terminate the app

s
Enter the Software Error Reporter ID (0x0000 - 0xFFFF)
8051
Enter the Software Error Code (0x00000000 - 0xFFFFFFFF)
12
Enter the Software Error Attribute (0x00000000 - 0xFFFFFFFF)
1
Reported SW Error
Software Error ReporterId : 0x8051
Software Errror ErrorCode : 0x0012
Software Error Attribute   : 0x00000001

  DriveOS_ErrPropagationDemoApp main menu
  _____________________________________________

  [m] Display the menu
  [s] Report Software Error
  [q] Terminate the app

q

Result

The reported error should be displayed on the SMCU console.

Copy
Copied!

            
            MCU_FOH: ErrReport: ErrorCode-0x12 ReporterId-0x8051 Error_Attribute-0x1 Timestamp-0x8387aa57

ESTOP is the Emergency Stop functionality to turn off the system in case there is a critical issue seen related to safety. There are two ESTOP GPIO pins - ESTOP_EN1 and ESTOP_EN2. Buttons can be attached to these gpio-pins, and you can use these buttons to power-off the NVIDIA IGX Orin Developer Kit.

Below are the hardware details for ESTOP-INX:

And the PCB layout for these pins should be Jumper Header1 and Header2, which can be replaced with the button.

Implementation Steps

ESTOP_IN1 would be a higher priority than ESTOP_IN2.
Power-off triggers will happen only when the NVIDIA IGX Orin Developer Kit is in the power-on state.

Validation Steps

Make sure the NVIDIA IGX Orin Developer Kit is in the Power-on state.
Short (or press Button if attached as per above schematic) the ESTOP_PIN1 / ESTOP_PIN2 gpio-pin.
Expect a power-off sequence on the SMCU side. (There will not be any logs on the NVIDIA IGX Orin Developer Kit side).

Result

What is the expected result?

If the NVIDIA IGX Orin Developer Kit is in the power-off state (fans/LED will turn OFF), verification is successful; otherwise, the verification has FAILED.

SPI-ROM Features

There is SPI-ROM on the NVIDIA IGX Orin Developer Kit, which can be used to store hardware version details.

Validation Steps

Steps or commands to test the feature:

inforomflash
inforomdump

Example:

Copy
Copied!

            
            inforomdump

Result

What is the expected result?

Copy
Copied!

            
            NvShell>inforomdump
Info: Executing cmd: inforomdump, argc: 0, args:
4D 4F 52 03 00 58 00 D6 E0 0F
19 04 21 20 53 59 53 58 00 4B
57 52 88 01 FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF 53 59
53 05 00 30 01 07 19 04 21 20
30 30 30 2D 31 30 30 30 2D 33
36 36 33 36 2D 30 37 36 00 00
00 00 00 00 41 4E 33 34 30 30
34 36 31 32 39 30 38 35 31 00
00 00 30 00 00 00 00 00 FF FF
FF FF 30 30 30 2D 31 30 30 30
2D 33 36 36 33 36 2D 39 39 36
00 00 00 00 00 00 00 41 41 4E
33 34 30 30 34 36 31 32 39 30
38 35 31 00 00 00 00 00 05 00
BC B9 4C 2D B0 48 BD B9 4C 2D
B0 48 BE B9 4C 2D B0 48 BF B9
4C 2D B0 48 C0 B9 4C 2D B0 48
C1 B9 4C 2D B0 48 C2 B9 4C 2D
B0 48 C3 B9 4C 2D B0 48 C4 B9
4C 2D B0 48 C5 B9
Command Executed
NvShell>

SMCU-PMIC

SMCU-PMIC is a power supply regulator for MCU. PMIC provides robust, safe, and reliable voltage supply for the Aurix microcontroller. PMIC is rated ASIL-D functional safe certification to supply the AURIX microcontroller and the other loads in the AURIX system.

Validation Steps

Testing of SMCU PMIC is not suggested, due to the sheer vulnerabilities in the SMCU working if the PMIC misbehaves. If the SMCU boots successfully, the SMCU PMIC voltage regulator works as expected.

Safe-Shutdown Feature

Safe-Shutdown feature is to provide CCPLEX applications to shutdown IGX gracefully. This feature sends a “graceful” shutdown command to CCPLEX client applications over SPI. Upon receiving the command, the CCPLEX-Client application initiates the NVIDIA IGX Orin Developer Kit software shutdown sequence.

Validation Steps

Example of steps or commands to test the feature:

Copy
Copied!

            
            poweroff safeshutdown

You need to open the SMCU console in BMC, then run the command. The commands are in the BMC User Guide.

Result

What is the expected result?

Below, we have the printout with safeshutdown.

Copy
Copied!

            
            NvShell>poweroff safeshutdown
Info: Executing cmd: poweroff, argc: 1, args: safeshutdown
NvShell>INFO: MCU_PLTFPWRMGR: Powering off
INFO: AURIX_CCPLEX_COM: Telemetry Disable!
INFO: MCU_PLTFPWRMGR: VRS11 PG Monitoring disable.
INFO: NVMCU_ORINPWRCTRL: Wait for Safe Shutdown notification (20s max)
ERROR: NvMCU_OrinVMON: VMON XA Notification
ERROR: NvMCU_OrinVMON: UV fault has occurred !
ERROR: NvMCU_OrinVMON: UV fault has occurred !
ERROR: NvMCU_OrinVMON: UV fault has occurred !
INFO: NvMCU_OrinVMON: Get VRS11 Status..
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x10: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x10: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x11: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x11: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x12: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x12: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x13: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x13: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x14: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x14: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x15: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x15: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x16: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x16: 0x0
INFO: NvMCU_OrinVMON: Get VRS10 Status..
INFO: NvMCU_OrinVMON: INT_SRC1: 0x0
INFO: NvMCU_OrinVMON: INT_SRC2: 0x0
INFO: NvMCU_OrinVMON: INT_VENDOR: 0x0
ERROR: MCU_ERRHANDLER: OrinVMON : ReportedID - 0x810B ErrorCode - 0xB
INFO: NVMCU_ORINPWRCTRL: Safe-Shutdown done!
INFO: SftyMon_IoHwAbs: Force Shutdown asserted
INFO: NVMCU_ORINPWRCTRL: P3740 Poweroff sequence start
INFO: NVMCU_ORINPWRCTRL: CARRIER_POWER_ON low - PASS
INFO: NVMCU_ORINPWRCTRL: ATX_PG low - PASS
INFO: NvMCU_OrinVMON: VMON Power-down sequence is ok !
INFO: AURIX_CCPLEX_COM: Telemetry enable/disable request already taken!
INFO: MCU_PLTFPWRMGR: Orin TMON disable.
INFO: MCU_PLTFPWRMGR: Board TMON disable.
INFO: MCU_PLTFPWRMGR: Power down sequence is complete !
Command Executed

IST-Manager – Features

IST-Manager runs on Aurix-P3740 platform, IST-Manager sets IST-Trigger config and monitors IST-DONE gpio-pins, once IST-Completes and Orin is in a power-off state, IST-Manager wakes up Orin by issuing a poweron sequence.

Implementation Details

There are commands provided for the user to set an IST-Trigger configuration; the current configuration to trigger IST is once the poweron sequence has been initiated.
Once configuration sets and poweron sequence is initiated, IST triggers, and IST-Manager monitors IST-Done gpio-pins to make sure IST is completed.
After the IST-Done gpio pin toggle, IST-Manager calls IST-Callback notification and prints information about IST.
After IST-Done notification, IST-Manager will initiate Orin poweron sequence, then you can issue the getistinfo / getistresult command to get IST-Result data from IST-Client.
Based on IST-Result data, the IST-Result handler will be called, and the current implementation of the handler is just to print messages about IST-Fail/Pass.

Validation Steps

Copy
Copied!

            
            1.  setistconfig 0x1      // to enable IST-Trigger
2.  poweroff
3.  poweron (IST will trigger in this stage)
4.  wait for IST-Completion notification and then SMCU will poweron Orin.
5.  getistinfo (should be called once Orinis up)
6.  setistconfig 0x0       // to disable IST-Trigger

Result

What is the expected result?

Copy
Copied!

            
            getistinfo/getistresult should show IST-FAIL/IST-PASS

CAN – Features

There are 4-CAN controllers in Aurix but only CAN1 has been enabled. CAN1 has been validated by connecting Aurix-CAN1 to Orin-CAN0.

This diagram shows an example of a correct connection:

And below is a wire connection image:

Validation Steps

The following command should be run from SMCU:

Copy
Copied!

            
            cycliccanon CAN     # this will transmit continuous CAN Data (0x119) over CAN1
cycliccanoff CAN     # this will stop transmitting.

The below commands need to be run on the CCPLEX side.

Make sure CAN nodes are enabled in kernel-dtb:

Copy
Copied!

            
            cat /proc/device-tree/bus\@0/mttcan\@c310000/status
cat /proc/device-tree/bus\@0/mttcan\@c320000/status

Pinmux settings:

Copy
Copied!

            
            CAN1_DIN: ./devmem2 0x0c303008 w 0x458
CAN1_DOUTM: ./devmem2 0x0c303000 w 0x400

Install CAN modules:

Copy
Copied!

            
            modprobe can
modprobe can_raw
modprobe mttcan

Setup interface properties:

Copy
Copied!

            
            ip link set can0 up type can bitrate 500000
ip link set can1 up type can bitrate 500000

Receive CAN Packet; K5.15 has CAN0 mapped to CAN1, so CAN data will appear on CAN1 instead of CAN0:
Copy

Copied!
```
            
            candump -x any &
        
```

Result

What is the expected result?

Copy
Copied!

            
            command “candump -x any &” should show CAN data as 0x119 transmitted from SMCU

TMON – Features

There are two TMP451 temperature sensors present on the IGX Developer Kit, one TMP451 is present on Orin, and another is on P3740. The TMON feature is to monitor both TMP451 and take action based on value; the action could be to shutdown Orin.

Implementation Details

This includes TMON OV_HF/LF and UV_HF/LV tests:

Program the high-limit/low-limit temperature value and initialize TMON for further processing.
Check if SOC_THERM and SOC_ALERT GPIO pins are HIGH, if these pins are HIGH, Orin Tempr is initialized, else it fails to initialize.
Program Orin-TMP451 to threshold values and check for SOC_ALERT/SOC_THERM pins to LOW, these pins are LOW that means the test is passing, else it fails.
The above steps are applicable for SMCU-TMP451.

Validation Steps

After poweron command on SMCU, the below prints should appear:

Copy
Copied!

            
            NvShell>poweron
Info: Executing cmd: poweron, argc: 0, args:
NvShell>INFO: MCU_PLTFPWRMGR: Powering up
INFO: NVMCU_ORINPWRCTRL: P3740 Poweron sequence start
INFO: NVMCU_ORINPWRCTRL: ATX_PG high - PASS
INFO: PLTFPWRMGR_IOHWABS: PG Signal Passed
INFO: NVMCU_ORINPWRCTRL: FUNC_NIRQ continuous monitoring Enabled!
ERROR: MCU_ERRHANDLER: OrinPwrCtrl : ReporterID - 0x810D ErrorCode - 0xE
INFO: SMCU_INPUTS: Tegra Reset requested
ERROR: MCU_PLTFPWRMGR: Tegra Reset Request failed - Preconditions not met(0x504)
INFO: NvMCU_OrinVMON: VMON rail seq on order was ok!
INFO: NvMCU_OrinTMON: toggle check of local and remote sensor successfull
INFO: NvMCU_OrinTMON: Orin Temperature sensor initialized
INFO: MCU_PLTFPWRMGR: Orin TMON enabled ....
INFO: NVMCU_ORINPWRCTRL: Tegra x1 Boot Chain: A
INFO: SftyMon_tmon: Board Temperature sensor initialized
INFO: MCU_PLTFPWRMGR: Board TMON enabled.
INFO: NVMCU_ORINPWRCTRL: Carrier_POWER_ON high - PASS
MCU_FOH: MCU FOH : Initiate SOC Error Pin Monitoring & SPI communication
INFO: PLTFPWRMGR_IOHWABS: SW Powerup, Wait for Tegra UP or timeout: (180)s
NvShell>

Result

What is the expected result?

Copy
Copied!

            
            If prints from the aforementioned “Validation Steps” appear, consider this as TMON passes. If there is any failure related to TMON, consider TMON failure.

VMON – Features

There are three VRS12 voltage-monitors present on the Orin platform. These VRS can be communicated over I2C from SMCU. SMCU programs these VRS for OV_HF/LF and UV_HF/LF, then monitors NIRQ pin if there is any error from OV/UV, the implementation details in SMCU software are provided below:

Initial configuration and setup for VMON, reading the results of the hardware BIST executed by VMON.
Programming the VMON chip with thresholds for UV/OV detection. Configurable thresholds are provided separately for functional mode, IST mode and SC7 mode.
Continuous monitoring of VMON NIRQ pin indicating UV/OV and reporting the detected errors to the Error handler module.
Monitors the power-up and power-down sequences for NVIDIA DRIVE Orin SoC rails.
Hardware Safety requirements from VMON HSIs and SISRs are supported in software. All the detected failures are reported to the customer Error Handler module.
The following error scenarios are detected:
1. VMON Chip internal errors are detected by BIST failures.
2. I2C communication failure while read/write operations are detected by CRC.
3. Errors during the programming of VMON chips are detected by reading back the values written to the VMON registers.
4. Voltage rails are checked for plausibility by reading the actual voltage from the VMON ADC and checking for deviation from the expected levels.
5. OV/UV power-up test is executed as per the safety manual of VMON and hardware failures are detected.
6. Power-up and power-down sequences are checked for the specified rails.
The VRS12 device stores the fault information in status registers, which are accessible over I2C. AUTOSAR RTE interfaces are provided to read the status from VMON and clear the faults if any.

Validation Steps

Run the poweron command on the SMCU side and during poweron, check if the below prints appear:

Copy
Copied!

            
            INFO: NvMCU_OrinVMON: VMON rail seq on order was ok!

Result

What is the expected result?

Copy
Copied!

            
            INFO: NvMCU_OrinVMON: VMON rail seq on order was ok!

Heartbeat – Features

Heartbeat functionality is added to get the Orin’s status after Orin is powered-on, and SMCU will continuously ping Orin in every 3 seconds to get the current status of Orin over SPI3. Currently, only getting Thermal details from Orin and the same can be printed with the “print telemetry” command in SMCU.

Implementation Details

There is a telemetry-client running on CCPLEX, and this application registers itself to a spi-server that is used to send/receive telemetry data over SPI3 to SMCU.
On the SMCU side, there is a periodic callback function called every 1 second, and it sends commands to retrieve telemetry data from telemetry-client.
If there is any fault on the Orin side and somehow SMCU couldn’t receive telemetry data, SMCU will just print the warning message that Orin is in bad state.
There is a “print telemetry” command added in SMCU to retrieve the temperature for CPU/GPU/SOCX.

Validation Steps

Copy
Copied!

            
            NvShell>print_telemetry
Info: Executing cmd: print_telemetry, argc: 0, args:
INFO: AURIX_CCPLEX_COM: Telemetry Data:
INFO: AURIX_CCPLEX_COM: THERMAL NODE PRESENT = 63
INFO: AURIX_CCPLEX_COM: CPU SOC THERM TEMPERATURE = 53
INFO: AURIX_CCPLEX_COM: GPU SOC THERM TEMPERATURE = 49
INFO: AURIX_CCPLEX_COM: SOC0 THERM TEMPERATURE = 53
INFO: AURIX_CCPLEX_COM: SOC1 THERM TEMPERATURE = 56
INFO: AURIX_CCPLEX_COM: SOC2 THERM TEMPERATURE = 49
INFO: AURIX_CCPLEX_COM: CX7 THERM TEMPERATURE = 46

Result

What is the expected result?

If there are 0 values for CPU/GPU/SOC0/SOC1/SOC2, consider this as FAIL; else it is a PASS.

I2C-FPGA Feature

There is a requirement to send Orin-error data to FPGA. Based on the received data, FPGA-SW will take further action that could be to stop doing any work which is in scheduled/ in progress or issue a warning message and continue the task or pause the running task. The data (Orin-Error) transmission that happens between SMCU and FPGA is over I2C1.

Below are the steps for routing Orin-Error data to FPGA:

FOH module will collect all the Orin-failure from different SMCU-SW modules and send command to FPGA, command can be [STOP | PAUSE | RESUME] will be send with API exported by FOH to send these commands to the FPGA device:
1. Aurix-CCPLEX-SW modules which collect CCPLEX-failure information with the help of Telemetry application and send STOP/PAUSE/RESUME to API (defined in FOH file).
2. HSI and BPMP failure relayed by FSI to SMCU in FOH file, these failures are captured in FOH file and call API to send STOP command to FPGA.
3. ESTOP failure, if emergency button pressed then STOP command will be send via API to FPGA device.
The below commands will be considered as All-Good, critical, and non-critical:
1. Critical command: STOP
2. Non-Critical commands/SC7-Entry: PAUSE.
3. Error resolved/SC7-Exit: RESUME
If there is any issue/failure with CCPLEX-application, SMCU will send PAUSE to FPGA and FGPA can pause its task and wait for a RESUME command (RESUME command will be send once application recovered from any failure and FPGA can resume task), if application is not recovered within 3 minutes, then SMCU will send STOP command to FPGA.
During SC7 entry, SMCU will send the PAUSE command to FPGA and while resuming from SC7, SMCU will send the RESUME command. Based on these commands, FPGA can pause/resume tasks.
Customers should program FPGA devices at address 0x65 on I2C1.

Validation Steps

Example of steps or commands to test the feature:

Verification can be done only by getting logs in SMCU:

CCPLEX: Stop SPI-Server in CCPLEX:

Copy
Copied!

            
            systemctl stop nv-ccplex-telemetry.service

SMCU: below message will appear within 3 seconds.

Copy
Copied!

            
            MCU_FOH: Sending message to FPGA: PAUSE(0x2)

CCPLEX: Start SPI-Server in CCPLEX:

Copy
Copied!

            
            systemctl start nv-ccplex-telemetry.service

SMCU: below message will appear within 3 seconds.

Copy
Copied!

            
            MCU_FOH: Sending message to FPGA: RESUME(0x1)

CCPLEX: Stop SPI-Server in CCPLEX:

Copy
Copied!

            
            systemctl stop nv-ccplex-telemetry.service

SMCU: PAUSE message will appear within 3sec and STOP will appear within 3 minutes.

Copy
Copied!

            
            MCU_FOH: Sending message to FPGA: PAUSE(0x2)
MCU_FOH: Sending message to FPGA: STOP(0x3)

Result

What is the expected result?

Below, we have the printout with FPGA. The below messages are expected in smcu-console:

In case of SC7 entry/Application failure:

Copy
Copied!

            
            MCU_FOH: Sending message to FPGA: PAUSE(0x2)

In case of recovered applications:

Copy
Copied!

            
            MCU_FOH: Sending message to FPGA: RESUME(0x1)

In case of error:

Copy
Copied!

            
            MCU_FOH: Sending message to FPGA: STOP(0x3)

Fan Control

On NVIDIA IGX Orin Boards Kit, there are two active cooling devices (fans) connected to the fan headers J55 and J64. The fan PWM and the tachometer pins are connected to fan IC which is interfaced from SMCU via I2C1 controller (on SMCU). The SMCU application, #. algorithm to tune the fan PWM based on the HW temperature sensors’ readings. #. read RPM reported by the tachometer and fine tune the PWM in case the RPM does not meet the thermal requirements.

The important components of the fan control application in SMCU are, #. TMARGIN Temperature - The temperature used to control the fan PWM is calculated based on the below equation,

\[tmargin = min((MAXTEMP(SOC) − Average(Temp(Soctherms))), (MAXTEMP(CX7) − Temp(CX7)))\]

Fan Table - The following table provides the fan curve details which the fan control application uses to choose the pwm to set and rpm to check.

Tmargin	PWM	RPM
118 30 10 8 4 0	0 0 40 105 255 255	1000 1000 1400 1800 2700 2700

State Machine

The different states in the fan control application are:

Fan Init - Initialize the fan HW with the required configuration and set the default PWM.
Temp Control - Read the Tmargin value and set the corresponding PWM from the Temp-PWM mapping table.
RPM in limits - Based on the data sheet RPM values are expected to be in the 10% error range. Check if the current RPM is in the +/- 10% range of desired RPM.
RPM Stable - Wait for five seconds as the temperature does not change quickly.
Stabilize RPM - RPM is not in limits, so increase or decrease the PWM value to match the RPM.
Stabilize Timeout - If the RPM Stabilization fails, we need to revert back to Temperature based control as the recovery is not in the scope of SW in this case.

Validation Steps

print_telemetry command in SMCU console should give you the TMARGIN TEMPERATURE.

Copy
Copied!

            
            NvShell>print_telemetry
Info: Executing cmd: print_telemetry, argc: 0, args:
INFO: AURIX_CCPLEX_COM: Telemetry Data:
INFO: AURIX_CCPLEX_COM: THERMAL NODE PRESENT = 63
INFO: AURIX_CCPLEX_COM: CPU SOC THERM TEMPERATURE = 47
INFO: AURIX_CCPLEX_COM: GPU SOC THERM TEMPERATURE = 43
INFO: AURIX_CCPLEX_COM: SOC0 THERM TEMPERATURE = 46
INFO: AURIX_CCPLEX_COM: SOC1 THERM TEMPERATURE = 49
INFO: AURIX_CCPLEX_COM: SOC2 THERM TEMPERATURE = 42
INFO: AURIX_CCPLEX_COM: CX7 THERM TEMPERATURE = 37
**INFO: AURIX_CCPLEX_COM: TMARGIN TEMPERATURE = 73**

show_fanrpm command in SMCU console should give you the fan rpm values for both the fans.

Copy
Copied!

            
            NvShell>show_fanrpm
Info: Executing cmd: show_fanrpm, argc: 0, args:
Read fan2: 0x3da
Read fan4: 0x3e4
**Fan 2 RPM : 986**
**Fan 4 RPM : 996**

Result

The tmargin and rpm values are expected to be in the 10% range of desired values.
If the tmargin values fall in between two entries in the Fan Table above, the rpm values are obtained by a direct extrapolation.

DRAM-ECC

Over a period of time (typically after many years of usage), DRAM hardware will degrade and start reporting errors. ECC protection will help detect the DRAM errors. When DRAM ECC is enabled, every 32 bytes of data is protected with 2 bytes of ECC. During every data byte(s) write, a new ECC is calculated and 2 ECC bytes are written. During every data byte(s) read 2 ECC bytes are also read to confirm the data is not corrupted (expected ECC equals calculated ECC). In case of any mismatch, the Error will be reported by MSS hardware.

Bad Page Partition

A bad page partition is a dedicated partition in QSPI to store the list of bad page pages and it is not part of any boot chain. During Coldboot, bad pages from this partition are used by MB1 (to skip these pages while allocating carveouts) and by UEFI (to skip these pages in physical memory handling for Kernel).

Page Retirement Flow (PRL)

Page retirement (PRL) flow is triggered in the event of DRAM ECC uncorrected errors. PRL flow is also triggered during the very first boot just after flashing.

Verification Steps

Enable injection and flash safety build:

Copy
Copied!

            
            # Enable ECC Injection flag
cd <Linux_for_Tegra>
vim bootloader/tegra234-mb1-bct-dram-ecc-l4t.dtsi
# Enable the injection configuration
enable_dram_error_injection = <1>;
# flash igx-devkit/igx-safety

MB1 should read ECC region correctly and bad-page binary read successfully.

Copy
Copied!

            
            [0000.401] I> Task: Load Page retirement list
[0000.405] I> Slot: 0
[0000.407] I> Binary[4] block-125952 (partition size: 0x80000)
[0000.413] I> Binary name: DRAM bad page list (P)
[0000.417] I> Size of crypto header is 8192
[0000.421] I> Size of crypto header is 8192
[0000.425] I> strt_pg_num(125952) num_of_pgs(16) read_buf(0x40050000)
[0000.431] I> BCH of DRAM bad page list (P) read from storage
[0000.437] I> BCH address is : 0x40050000
[0000.441] I> component binary type is 4
[0000.444] I> DRAM bad page list (P) header integrity check is success
[0000.451] I> Binary magic in BCH component 0 is BINF
[0000.456] I> component binary type is 4
[0000.459] I> component binary type is 4
[0000.463] I> Size of crypto header is 8192
[0000.467] I> component binary type is 4
[0000.471] I> strt_pg_num(125968) num_of_pgs(8) read_buf(0x40040000)
[0000.477] I> DRAM bad page list (P) binary is read from storage
[0000.483] I> DRAM bad page list (P) binary integrity check is success
**[0000.489] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000)**
[0000.499] I> Task: SDRAM params override
[0000.503] I> Task: Save mem-bct info
[0000.506] I> Task: Carveout allocate
[0000.510] I> RCM blob carveout will not be allocated
[0000.515] I> Update CCPLEX IST carveout from MB1-BCT
**[0000.519] I> ECC region[0]: Start:0x80000000, End:0xe80000000**
[0000.525] I> ECC region[1]: Start:0x0, End:0x0
[0000.529] I> ECC region[2]: Start:0x0, End:0x0
[0000.534] I> ECC region[3]: Start:0x0, End:0x0
[0000.538] I> ECC region[4]: Start:0x0, End:0x0
[0000.542] I> Non-ECC region[0]: Start:0x0, End:0x0
[0000.547] I> Non-ECC region[1]: Start:0x0, End:0x0
[0000.551] I> Non-ECC region[2]: Start:0x0, End:0x0
[0000.556] I> Non-ECC region[3]: Start:0x0, End:0x0
[0000.561] I> Non-ECC region[4]: Start:0x0, End:0x0

Note down the carveout 49 base address.

Copy
Copied!

            
            [0000.795] I> allocated(CO:50) base:0xe2c600000 size:0x200000 align: 0x100000
[0000.802] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000
[0000.808] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000
[0000.815] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000
**[0000.822] I> allocated(CO:49) base:0xe2cd70000 size:0x10000 align: 0x10000**

SBE Testing

Read ECC injected address at <Carveout 49 base> + 0x1000 from kernel console.

Copy
Copied!

            
            ubuntu@jetson:~$ sudo ./devmem2 0xe2cd71000 w

Read the EMC channel status registers one by one from FSI-console to find the channel where ECC SBE count is increased.

Copy
Copied!

            
            FSI-SHELL>readmemory 0x02c70ac4
20000000
FSI-SHELL>readmemory 0x02c80ac4
20000000
FSI-SHELL>readmemory 0x02c90ac4
20000000
FSI-SHELL>readmemory 0x02ca0ac4
20000000
FSI-SHELL>readmemory 0x02cb0ac4
20000000
FSI-SHELL>readmemory 0x02cc0ac4
20000000
FSI-SHELL>readmemory 0x02cd0ac4
20000000
FSI-SHELL>readmemory 0x02ce0ac4
20000000
FSI-SHELL>readmemory 0x01780ac4
20010100 **← Example only;**
FSI-SHELL>readmemory 0x01790ac4
20000000
FSI-SHELL>readmemory 0x017a0ac4
20000000
FSI-SHELL>readmemory 0x017b0ac4
20000000
FSI-SHELL>readmemory 0x017c0ac4
20000000
FSI-SHELL>readmemory 0x017d0ac4
20000000
FSI-SHELL>readmemory 0x017e0ac4
20000000
FSI-SHELL>readmemory 0x017f0ac4
20000000

DBE Testing

Read ECC injected address at <Carveout 49 base> + 0x8000 from kernel console.

Copy
Copied!

            
            ubuntu@jetson:~$ sudo ./devmem2 0xe2cd78000 w

FSI should trigger L1 coldboot and MB2 should enter DRAM-ECC mode. Refer to the above diagram for the detailed flow of DRAM-ECC. Once DRAM-ECC bad page update is success, the corresponding print should be seen on the console.

Copy
Copied!

            
            I> MB2 (version: 0.0.0.0-t234-54845784-c6a05a9f)
I> t234-A01-1-Silicon (0x12347)
**I> Boot-mode : DRAM ECC**
...
...
**I> Read back verify success for primary and secondary blocks**
I> Task: DRAM ECC Mode PMC Reset
**I> Triggering PMC_RESET**

In the next boot cycle, when the MB1 reads the bad page partition, it should contain the retired bad page.

Copy
Copied!

            
            [0000.480] I> DRAM bad page list (P) binary is read from storage
[0000.485] I> DRAM bad page list (P) binary integrity check is success
[0000.492] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000)
**[0000.502] I> bad page addr 0: 0xe2cd70000**
[0000.506] I> Task: SDRAM params override
[0000.510] I> Task: Save mem-bct info
[0000.513] I> Task: Carveout allocate
[0000.517] I> RCM blob carveout will not be allocated
[0000.521] I> Update CCPLEX IST carveout from MB1-BCT
[0000.526] I> ECC region[0]: Start:0x80000000, End:0xe80000000
[0000.532] I> ECC region[1]: Start:0x0, End:0x0
[0000.536] I> ECC region[2]: Start:0x0, End:0x0
[0000.540] I> ECC region[3]: Start:0x0, End:0x0
[0000.545] I> ECC region[4]: Start:0x0, End:0x0
[0000.549] I> Non-ECC region[0]: Start:0x0, End:0x0
[0000.553] I> Non-ECC region[1]: Start:0x0, End:0x0
[0000.558] I> Non-ECC region[2]: Start:0x0, End:0x0
[0000.563] I> Non-ECC region[3]: Start:0x0, End:0x0
[0000.567] I> Non-ECC region[4]: Start:0x0, End:0x0

Accordingly, the carveout memory should be adjusted.

Copy
Copied!

            
            [0000.808] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000
[0000.815] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000
[0000.822] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000
**[0000.829] I> allocated(CO:49) base:0xe2cd60000 size:0x10000 align: 0x10000 ← Was previously 0xe2cd70000**

Kernel memory memory map entries should exclude the detected bad page and carveout 49.

Copy
Copied!

            
            [ 0.000000] node 0: [mem 0x0000000080000000-0x00000000fffdffff]
[ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x0000000e18f95fff]
[ 0.000000] node 0: [mem 0x0000000e18f96000-0x0000000e1922bfff]
[ 0.000000] node 0: [mem 0x0000000e1922c000-0x0000000e2670ffff]
[ 0.000000] node 0: [mem 0x0000000e26710000-0x0000000e2864ffff]
[ 0.000000] node 0: [mem 0x0000000e28650000-0x0000000e2c5fffff]
[ 0.000000] node 0: [mem 0x0000000e2c600000-0x0000000e2c7fffff]
**[ 0.000000] node 0: [mem 0x0000000e2c800000-0x0000000e2cd5ffff] ←
Does not contain 0xe2cd60000 and 0xe2cd70000 pages**
[ 0.000000] node 0: [mem 0x0000000e32000000-0x0000000e33ffffff]