Safety Features
The error reporting use case is demonstrated by the demo application – nvDemoAppSwErr
, which runs on CCPLEX and demonstrates usage of Error Propagation Library to report errors.
Validation Steps
The app is located on CCPLEX at /opt/nvidia/ccplex_sf/error_propagation/nvDemoAppSwErr
.
Enter sudo mode.
Set runtime library path as follows:
export LD_LIBRARY_PATH=/opt/nvidia/ccplex_sf/error_propagation/:$LD_LIBRARY_PATH
Execute the
DemoAppSwErr
.Select the option
s
to report Software Errors. After selectings
, enter the details for the software error.Enter the ReporterId.
Enter the Error Code.
Enter the Error Attribute.
Upon successful reporting of the software error,
DemoAppSwErr
will display the reported software error details on the SMCU Console.
Example
root@IGX-ubuntu:/opt/nvidia/ccplex_sf/error_propagation# ./nvDemoAppSwErr
DriveOS_ErrPropagationDemoApp main menu
_____________________________________________
[m] Display the menu
[s] Report Software Error
[q] Terminate the app
s
Enter the Software Error Reporter ID (0x0000 - 0xFFFF)
8051
Enter the Software Error Code (0x00000000 - 0xFFFFFFFF)
12
Enter the Software Error Attribute (0x00000000 - 0xFFFFFFFF)
1
Reported SW Error
Software Error ReporterId : 0x8051
Software Errror ErrorCode : 0x0012
Software Error Attribute : 0x00000001
DriveOS_ErrPropagationDemoApp main menu
_____________________________________________
[m] Display the menu
[s] Report Software Error
[q] Terminate the app
q
Result
The reported error should be displayed on the SMCU console.
MCU_FOH: ErrReport: ErrorCode-0x12 ReporterId-0x8051 Error_Attribute-0x1 Timestamp-0x8387aa57
ESTOP is the Emergency Stop functionality to turn off the system in case there is a critical issue seen related to safety. There are two ESTOP GPIO pins - ESTOP_EN1 and ESTOP_EN2. Buttons can be attached to these gpio-pins, and you can use these buttons to power-off the NVIDIA IGX Orin Developer Kit.
Below are the hardware details for ESTOP-INX:
![estop-features.png](https://docscontent.nvidia.com/dims4/default/1addeaf/2147483647/strip/true/crop/610x321+0+0/resize/610x321!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Festop-features.png)
And the PCB layout for these pins should be Jumper Header1 and Header2, which can be replaced with the button.
![estop-jumpers.png](https://docscontent.nvidia.com/dims4/default/2a390c8/2147483647/strip/true/crop/610x305+0+0/resize/610x305!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Festop-jumpers.png)
Implementation Steps
ESTOP_IN1 would be a higher priority than ESTOP_IN2.
Power-off triggers will happen only when the NVIDIA IGX Orin Developer Kit is in the power-on state.
Validation Steps
Make sure the NVIDIA IGX Orin Developer Kit is in the Power-on state.
Short (or press Button if attached as per above schematic) the
ESTOP_PIN1
/ESTOP_PIN2
gpio-pin.Expect a power-off sequence on the SMCU side. (There will not be any logs on the NVIDIA IGX Orin Developer Kit side).
Result
What is the expected result?
If the NVIDIA IGX Orin Developer Kit is in the power-off state (fans/LED will turn OFF), verification is successful; otherwise, the verification has FAILED.
There is SPI-ROM on the NVIDIA IGX Orin Developer Kit, which can be used to store hardware version details.
Validation Steps
Steps or commands to test the feature:
inforomflash
inforomdump
Example:
inforomdump
Result
What is the expected result?
NvShell>inforomdump
Info: Executing cmd: inforomdump, argc: 0, args:
4D 4F 52 03 00 58 00 D6 E0 0F
19 04 21 20 53 59 53 58 00 4B
57 52 88 01 FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF FF FF
FF FF FF FF FF FF FF FF 53 59
53 05 00 30 01 07 19 04 21 20
30 30 30 2D 31 30 30 30 2D 33
36 36 33 36 2D 30 37 36 00 00
00 00 00 00 41 4E 33 34 30 30
34 36 31 32 39 30 38 35 31 00
00 00 30 00 00 00 00 00 FF FF
FF FF 30 30 30 2D 31 30 30 30
2D 33 36 36 33 36 2D 39 39 36
00 00 00 00 00 00 00 41 41 4E
33 34 30 30 34 36 31 32 39 30
38 35 31 00 00 00 00 00 05 00
BC B9 4C 2D B0 48 BD B9 4C 2D
B0 48 BE B9 4C 2D B0 48 BF B9
4C 2D B0 48 C0 B9 4C 2D B0 48
C1 B9 4C 2D B0 48 C2 B9 4C 2D
B0 48 C3 B9 4C 2D B0 48 C4 B9
4C 2D B0 48 C5 B9
Command Executed
NvShell>
SMCU-PMIC is a power supply regulator for MCU. PMIC provides robust, safe, and reliable voltage supply for the Aurix microcontroller. PMIC is rated ASIL-D functional safe certification to supply the AURIX microcontroller and the other loads in the AURIX system.
![smcu-pmic-flow.png](https://docscontent.nvidia.com/dims4/default/1138bf2/2147483647/strip/true/crop/952x425+0+0/resize/952x425!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Fsmcu-pmic-flow.png)
Validation Steps
Testing of SMCU PMIC is not suggested, due to the sheer vulnerabilities in the SMCU working if the PMIC misbehaves. If the SMCU boots successfully, the SMCU PMIC voltage regulator works as expected.
Safe-Shutdown feature is to provide CCPLEX applications to shutdown IGX gracefully. This feature sends a “graceful” shutdown command to CCPLEX client applications over SPI. Upon receiving the command, the CCPLEX-Client application initiates the NVIDIA IGX Orin Developer Kit software shutdown sequence.
Validation Steps
Example of steps or commands to test the feature:
poweroff safeshutdown
You need to open the SMCU console in BMC, then run the command. The commands are in the BMC User Guide.
Result
What is the expected result?
Below, we have the printout with safeshutdown.
NvShell>poweroff safeshutdown
Info: Executing cmd: poweroff, argc: 1, args: safeshutdown
NvShell>INFO: MCU_PLTFPWRMGR: Powering off
INFO: AURIX_CCPLEX_COM: Telemetry Disable!
INFO: MCU_PLTFPWRMGR: VRS11 PG Monitoring disable.
INFO: NVMCU_ORINPWRCTRL: Wait for Safe Shutdown notification (20s max)
ERROR: NvMCU_OrinVMON: VMON XA Notification
ERROR: NvMCU_OrinVMON: UV fault has occurred !
ERROR: NvMCU_OrinVMON: UV fault has occurred !
ERROR: NvMCU_OrinVMON: UV fault has occurred !
INFO: NvMCU_OrinVMON: Get VRS11 Status..
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x10: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x10: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x11: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x11: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x12: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x12: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x13: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x13: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x14: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x14: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x15: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x15: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x20 Register-0x16: 0x0
INFO: NvMCU_OrinVMON: VRS11@0x22 Register-0x16: 0x0
INFO: NvMCU_OrinVMON: Get VRS10 Status..
INFO: NvMCU_OrinVMON: INT_SRC1: 0x0
INFO: NvMCU_OrinVMON: INT_SRC2: 0x0
INFO: NvMCU_OrinVMON: INT_VENDOR: 0x0
ERROR: MCU_ERRHANDLER: OrinVMON : ReportedID - 0x810B ErrorCode - 0xB
INFO: NVMCU_ORINPWRCTRL: Safe-Shutdown done!
INFO: SftyMon_IoHwAbs: Force Shutdown asserted
INFO: NVMCU_ORINPWRCTRL: P3740 Poweroff sequence start
INFO: NVMCU_ORINPWRCTRL: CARRIER_POWER_ON low - PASS
INFO: NVMCU_ORINPWRCTRL: ATX_PG low - PASS
INFO: NvMCU_OrinVMON: VMON Power-down sequence is ok !
INFO: AURIX_CCPLEX_COM: Telemetry enable/disable request already taken!
INFO: MCU_PLTFPWRMGR: Orin TMON disable.
INFO: MCU_PLTFPWRMGR: Board TMON disable.
INFO: MCU_PLTFPWRMGR: Power down sequence is complete !
Command Executed
IST-Manager runs on Aurix-P3740 platform, IST-Manager sets IST-Trigger config and monitors IST-DONE gpio-pins, once IST-Completes and Orin is in a power-off state, IST-Manager wakes up Orin by issuing a poweron sequence.
![ist-manager.png](https://docscontent.nvidia.com/dims4/default/3e2cbf8/2147483647/strip/true/crop/610x151+0+0/resize/610x151!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Fist-manager.png)
Implementation Details
There are commands provided for the user to set an IST-Trigger configuration; the current configuration to trigger IST is once the
poweron
sequence has been initiated.Once configuration sets and
poweron
sequence is initiated, IST triggers, and IST-Manager monitors IST-Done gpio-pins to make sure IST is completed.After the IST-Done gpio pin toggle, IST-Manager calls IST-Callback notification and prints information about IST.
After IST-Done notification, IST-Manager will initiate Orin
poweron
sequence, then you can issue thegetistinfo
/getistresult
command to get IST-Result data from IST-Client.Based on IST-Result data, the IST-Result handler will be called, and the current implementation of the handler is just to print messages about IST-Fail/Pass.
Validation Steps
1. setistconfig 0x1 // to enable IST-Trigger
2. poweroff
3. poweron (IST will trigger in this stage)
4. wait for IST-Completion notification and then SMCU will poweron Orin.
5. getistinfo (should be called once Orinis up)
6. setistconfig 0x0 // to disable IST-Trigger
Result
What is the expected result?
getistinfo/getistresult should show IST-FAIL/IST-PASS
There are 4-CAN controllers in Aurix but only CAN1 has been enabled. CAN1 has been validated by connecting Aurix-CAN1 to Orin-CAN0.
This diagram shows an example of a correct connection:
![can-connection-diagram.png](https://docscontent.nvidia.com/dims4/default/b7ea9ce/2147483647/strip/true/crop/610x145+0+0/resize/610x145!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Fcan-connection-diagram.png)
And below is a wire connection image:
![can-wire-connection-image.png](https://docscontent.nvidia.com/dims4/default/2e75683/2147483647/strip/true/crop/610x373+0+0/resize/610x373!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Fcan-wire-connection-image.png)
![can-wire-connection-image2.png](https://docscontent.nvidia.com/dims4/default/e6a2705/2147483647/strip/true/crop/610x813+0+0/resize/610x813!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Fcan-wire-connection-image2.png)
Validation Steps
The following command should be run from SMCU:
cycliccanon CAN # this will transmit continuous CAN Data (0x119) over CAN1 cycliccanoff CAN # this will stop transmitting.
The below commands need to be run on the CCPLEX side.
Make sure CAN nodes are enabled in kernel-dtb:
cat /proc/device-tree/bus\@0/mttcan\@c310000/status cat /proc/device-tree/bus\@0/mttcan\@c320000/status
Pinmux settings:
CAN1_DIN: ./devmem2 0x0c303008 w 0x458 CAN1_DOUTM: ./devmem2 0x0c303000 w 0x400
Install CAN modules:
modprobe can modprobe can_raw modprobe mttcan
Setup interface properties:
ip link set can0 up type can bitrate 500000 ip link set can1 up type can bitrate 500000
Receive CAN Packet; K5.15 has CAN0 mapped to CAN1, so CAN data will appear on CAN1 instead of CAN0:
candump -x any &
Result
What is the expected result?
command “candump -x any &” should show CAN data as 0x119 transmitted from SMCU
There are two TMP451 temperature sensors present on the IGX Developer Kit, one TMP451 is present on Orin, and another is on P3740. The TMON feature is to monitor both TMP451 and take action based on value; the action could be to shutdown Orin.
Implementation Details
This includes TMON OV_HF/LF and UV_HF/LV tests:
Program the high-limit/low-limit temperature value and initialize TMON for further processing.
Check if SOC_THERM and SOC_ALERT GPIO pins are HIGH, if these pins are HIGH, Orin Tempr is initialized, else it fails to initialize.
Program Orin-TMP451 to threshold values and check for SOC_ALERT/SOC_THERM pins to LOW, these pins are LOW that means the test is passing, else it fails.
The above steps are applicable for SMCU-TMP451.
Validation Steps
After poweron
command on SMCU, the below prints should appear:
NvShell>poweron
Info: Executing cmd: poweron, argc: 0, args:
NvShell>INFO: MCU_PLTFPWRMGR: Powering up
INFO: NVMCU_ORINPWRCTRL: P3740 Poweron sequence start
INFO: NVMCU_ORINPWRCTRL: ATX_PG high - PASS
INFO: PLTFPWRMGR_IOHWABS: PG Signal Passed
INFO: NVMCU_ORINPWRCTRL: FUNC_NIRQ continuous monitoring Enabled!
ERROR: MCU_ERRHANDLER: OrinPwrCtrl : ReporterID - 0x810D ErrorCode - 0xE
INFO: SMCU_INPUTS: Tegra Reset requested
ERROR: MCU_PLTFPWRMGR: Tegra Reset Request failed - Preconditions not met(0x504)
INFO: NvMCU_OrinVMON: VMON rail seq on order was ok!
INFO: NvMCU_OrinTMON: toggle check of local and remote sensor successfull
INFO: NvMCU_OrinTMON: Orin Temperature sensor initialized
INFO: MCU_PLTFPWRMGR: Orin TMON enabled ....
INFO: NVMCU_ORINPWRCTRL: Tegra x1 Boot Chain: A
INFO: SftyMon_tmon: Board Temperature sensor initialized
INFO: MCU_PLTFPWRMGR: Board TMON enabled.
INFO: NVMCU_ORINPWRCTRL: Carrier_POWER_ON high - PASS
MCU_FOH: MCU FOH : Initiate SOC Error Pin Monitoring & SPI communication
INFO: PLTFPWRMGR_IOHWABS: SW Powerup, Wait for Tegra UP or timeout: (180)s
NvShell>
Result
What is the expected result?
If prints from the aforementioned “Validation Steps” appear, consider this as TMON passes. If there is any failure related to TMON, consider TMON failure.
There are three VRS12 voltage-monitors present on the Orin platform. These VRS can be communicated over I2C from SMCU. SMCU programs these VRS for OV_HF/LF and UV_HF/LF, then monitors NIRQ pin if there is any error from OV/UV, the implementation details in SMCU software are provided below:
Initial configuration and setup for VMON, reading the results of the hardware BIST executed by VMON.
Programming the VMON chip with thresholds for UV/OV detection. Configurable thresholds are provided separately for functional mode, IST mode and SC7 mode.
Continuous monitoring of VMON NIRQ pin indicating UV/OV and reporting the detected errors to the Error handler module.
Monitors the power-up and power-down sequences for NVIDIA DRIVE Orin SoC rails.
Hardware Safety requirements from VMON HSIs and SISRs are supported in software. All the detected failures are reported to the customer Error Handler module.
The following error scenarios are detected:
VMON Chip internal errors are detected by BIST failures.
I2C communication failure while read/write operations are detected by CRC.
Errors during the programming of VMON chips are detected by reading back the values written to the VMON registers.
Voltage rails are checked for plausibility by reading the actual voltage from the VMON ADC and checking for deviation from the expected levels.
OV/UV power-up test is executed as per the safety manual of VMON and hardware failures are detected.
Power-up and power-down sequences are checked for the specified rails.
The VRS12 device stores the fault information in status registers, which are accessible over I2C. AUTOSAR RTE interfaces are provided to read the status from VMON and clear the faults if any.
Validation Steps
Run the poweron command on the SMCU side and during poweron, check if the below prints appear:
INFO: NvMCU_OrinVMON: VMON rail seq on order was ok!
Result
What is the expected result?
INFO: NvMCU_OrinVMON: VMON rail seq on order was ok!
Heartbeat functionality is added to get the Orin’s status after Orin is powered-on, and SMCU will continuously ping Orin in every 3 seconds to get the current status of Orin over SPI3. Currently, only getting Thermal details from Orin and the same can be printed with the “print telemetry” command in SMCU.
Implementation Details
There is a telemetry-client running on CCPLEX, and this application registers itself to a spi-server that is used to send/receive telemetry data over SPI3 to SMCU.
On the SMCU side, there is a periodic callback function called every 1 second, and it sends commands to retrieve telemetry data from telemetry-client.
If there is any fault on the Orin side and somehow SMCU couldn’t receive telemetry data, SMCU will just print the warning message that
Orin is in bad state
.There is a “print telemetry” command added in SMCU to retrieve the temperature for CPU/GPU/SOCX.
Validation Steps
NvShell>print_telemetry
Info: Executing cmd: print_telemetry, argc: 0, args:
INFO: AURIX_CCPLEX_COM: Telemetry Data:
INFO: AURIX_CCPLEX_COM: THERMAL NODE PRESENT = 63
INFO: AURIX_CCPLEX_COM: CPU SOC THERM TEMPERATURE = 53
INFO: AURIX_CCPLEX_COM: GPU SOC THERM TEMPERATURE = 49
INFO: AURIX_CCPLEX_COM: SOC0 THERM TEMPERATURE = 53
INFO: AURIX_CCPLEX_COM: SOC1 THERM TEMPERATURE = 56
INFO: AURIX_CCPLEX_COM: SOC2 THERM TEMPERATURE = 49
INFO: AURIX_CCPLEX_COM: CX7 THERM TEMPERATURE = 46
Result
What is the expected result?
If there are 0
values for CPU/GPU/SOC0/SOC1/SOC2, consider this as FAIL; else it is a PASS.
There is a requirement to send Orin-error data to FPGA. Based on the received data, FPGA-SW will take further action that could be to stop doing any work which is in scheduled
/ in progress
or issue a warning message and continue the task or pause the running task. The data (Orin-Error) transmission that happens between SMCU and FPGA is over I2C1.
Below are the steps for routing Orin-Error data to FPGA:
FOH module will collect all the Orin-failure from different SMCU-SW modules and send command to FPGA, command can be [STOP | PAUSE | RESUME] will be send with API exported by FOH to send these commands to the FPGA device:
Aurix-CCPLEX-SW modules which collect CCPLEX-failure information with the help of Telemetry application and send STOP/PAUSE/RESUME to API (defined in FOH file).
HSI and BPMP failure relayed by FSI to SMCU in FOH file, these failures are captured in FOH file and call API to send STOP command to FPGA.
ESTOP failure, if emergency button pressed then STOP command will be send via API to FPGA device.
The below commands will be considered as All-Good, critical, and non-critical:
Critical command: STOP
Non-Critical commands/SC7-Entry: PAUSE.
Error resolved/SC7-Exit: RESUME
If there is any issue/failure with CCPLEX-application, SMCU will send PAUSE to FPGA and FGPA can pause its task and wait for a RESUME command (RESUME command will be send once application recovered from any failure and FPGA can resume task), if application is not recovered within 3 minutes, then SMCU will send STOP command to FPGA.
During SC7 entry, SMCU will send the PAUSE command to FPGA and while resuming from SC7, SMCU will send the RESUME command. Based on these commands, FPGA can pause/resume tasks.
Customers should program FPGA devices at address 0x65 on I2C1.
Validation Steps
Example of steps or commands to test the feature:
Verification can be done only by getting logs in SMCU:
CCPLEX: Stop SPI-Server in CCPLEX:
systemctl stop nv-ccplex-telemetry.service
SMCU: below message will appear within 3 seconds.
MCU_FOH: Sending message to FPGA: PAUSE(0x2)
CCPLEX: Start SPI-Server in CCPLEX:
systemctl start nv-ccplex-telemetry.service
SMCU: below message will appear within 3 seconds.
MCU_FOH: Sending message to FPGA: RESUME(0x1)
CCPLEX: Stop SPI-Server in CCPLEX:
systemctl stop nv-ccplex-telemetry.service
SMCU: PAUSE message will appear within 3sec and STOP will appear within 3 minutes.
MCU_FOH: Sending message to FPGA: PAUSE(0x2) MCU_FOH: Sending message to FPGA: STOP(0x3)
Result
What is the expected result?
Below, we have the printout with FPGA. The below messages are expected in smcu-console
:
In case of SC7 entry/Application failure:
MCU_FOH: Sending message to FPGA: PAUSE(0x2)
In case of recovered applications:
MCU_FOH: Sending message to FPGA: RESUME(0x1)
In case of error:
MCU_FOH: Sending message to FPGA: STOP(0x3)
On NVIDIA IGX Orin Boards Kit, there are two active cooling devices (fans) connected to the fan headers J55 and J64. The fan PWM and the tachometer pins are connected to fan IC which is interfaced from SMCU via I2C1 controller (on SMCU). The SMCU application, #. algorithm to tune the fan PWM based on the HW temperature sensors’ readings. #. read RPM reported by the tachometer and fine tune the PWM in case the RPM does not meet the thermal requirements.
The important components of the fan control application in SMCU are, #. TMARGIN Temperature - The temperature used to control the fan PWM is calculated based on the below equation,
\[tmargin = min((MAXTEMP(SOC) − Average(Temp(Soctherms))), (MAXTEMP(CX7) − Temp(CX7)))\]
Fan Table - The following table provides the fan curve details which the fan control application uses to choose the pwm to set and rpm to check.
Tmargin |
PWM |
RPM |
---|---|---|
118 30 10 8 4 0 | 0 0 40 105 255 255 | 1000 1000 1400 1800 2700 2700 |
State Machine
The different states in the fan control application are:
![fan-control-state-machine.png](https://docscontent.nvidia.com/dims4/default/35b6f6c/2147483647/strip/true/crop/1344x1104+0+0/resize/1344x1104!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2Ffan-control-state-machine.png)
Fan Init - Initialize the fan HW with the required configuration and set the default PWM.
Temp Control - Read the Tmargin value and set the corresponding PWM from the Temp-PWM mapping table.
RPM in limits - Based on the data sheet RPM values are expected to be in the 10% error range. Check if the current RPM is in the +/- 10% range of desired RPM.
RPM Stable - Wait for five seconds as the temperature does not change quickly.
Stabilize RPM - RPM is not in limits, so increase or decrease the PWM value to match the RPM.
Stabilize Timeout - If the RPM Stabilization fails, we need to revert back to Temperature based control as the recovery is not in the scope of SW in this case.
Validation Steps
print_telemetry command in SMCU console should give you the TMARGIN TEMPERATURE.
NvShell>print_telemetry Info: Executing cmd: print_telemetry, argc: 0, args: INFO: AURIX_CCPLEX_COM: Telemetry Data: INFO: AURIX_CCPLEX_COM: THERMAL NODE PRESENT = 63 INFO: AURIX_CCPLEX_COM: CPU SOC THERM TEMPERATURE = 47 INFO: AURIX_CCPLEX_COM: GPU SOC THERM TEMPERATURE = 43 INFO: AURIX_CCPLEX_COM: SOC0 THERM TEMPERATURE = 46 INFO: AURIX_CCPLEX_COM: SOC1 THERM TEMPERATURE = 49 INFO: AURIX_CCPLEX_COM: SOC2 THERM TEMPERATURE = 42 INFO: AURIX_CCPLEX_COM: CX7 THERM TEMPERATURE = 37 **INFO: AURIX_CCPLEX_COM: TMARGIN TEMPERATURE = 73**
show_fanrpm command in SMCU console should give you the fan rpm values for both the fans.
NvShell>show_fanrpm Info: Executing cmd: show_fanrpm, argc: 0, args: Read fan2: 0x3da Read fan4: 0x3e4 **Fan 2 RPM : 986** **Fan 4 RPM : 996**
Result
The tmargin and rpm values are expected to be in the 10% range of desired values.
If the tmargin values fall in between two entries in the Fan Table above, the rpm values are obtained by a direct extrapolation.
Over a period of time (typically after many years of usage), DRAM hardware will degrade and start reporting errors. ECC protection will help detect the DRAM errors. When DRAM ECC is enabled, every 32 bytes of data is protected with 2 bytes of ECC. During every data byte(s) write, a new ECC is calculated and 2 ECC bytes are written. During every data byte(s) read 2 ECC bytes are also read to confirm the data is not corrupted (expected ECC equals calculated ECC). In case of any mismatch, the Error will be reported by MSS hardware.
Bad Page Partition
A bad page partition is a dedicated partition in QSPI to store the list of bad page pages and it is not part of any boot chain. During Coldboot, bad pages from this partition are used by MB1 (to skip these pages while allocating carveouts) and by UEFI (to skip these pages in physical memory handling for Kernel).
Page Retirement Flow (PRL)
Page retirement (PRL) flow is triggered in the event of DRAM ECC uncorrected errors. PRL flow is also triggered during the very first boot just after flashing.
![Page-Retirement-Flow.png](https://docscontent.nvidia.com/dims4/default/30041f5/2147483647/strip/true/crop/825x923+0+0/resize/825x923!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-cb71-d700-a7ef-fbf5449f0000%2Figx-orin%2Fsep-developer-guide%2Flatest%2F_images%2FPage-Retirement-Flow.png)
Verification Steps
Enable injection and flash safety build:
# Enable ECC Injection flag cd <Linux_for_Tegra> vim bootloader/tegra234-mb1-bct-dram-ecc-l4t.dtsi # Enable the injection configuration enable_dram_error_injection = <1>; # flash igx-devkit/igx-safety
MB1 should read ECC region correctly and bad-page binary read successfully.
[0000.401] I> Task: Load Page retirement list [0000.405] I> Slot: 0 [0000.407] I> Binary[4] block-125952 (partition size: 0x80000) [0000.413] I> Binary name: DRAM bad page list (P) [0000.417] I> Size of crypto header is 8192 [0000.421] I> Size of crypto header is 8192 [0000.425] I> strt_pg_num(125952) num_of_pgs(16) read_buf(0x40050000) [0000.431] I> BCH of DRAM bad page list (P) read from storage [0000.437] I> BCH address is : 0x40050000 [0000.441] I> component binary type is 4 [0000.444] I> DRAM bad page list (P) header integrity check is success [0000.451] I> Binary magic in BCH component 0 is BINF [0000.456] I> component binary type is 4 [0000.459] I> component binary type is 4 [0000.463] I> Size of crypto header is 8192 [0000.467] I> component binary type is 4 [0000.471] I> strt_pg_num(125968) num_of_pgs(8) read_buf(0x40040000) [0000.477] I> DRAM bad page list (P) binary is read from storage [0000.483] I> DRAM bad page list (P) binary integrity check is success **[0000.489] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000)** [0000.499] I> Task: SDRAM params override [0000.503] I> Task: Save mem-bct info [0000.506] I> Task: Carveout allocate [0000.510] I> RCM blob carveout will not be allocated [0000.515] I> Update CCPLEX IST carveout from MB1-BCT **[0000.519] I> ECC region[0]: Start:0x80000000, End:0xe80000000** [0000.525] I> ECC region[1]: Start:0x0, End:0x0 [0000.529] I> ECC region[2]: Start:0x0, End:0x0 [0000.534] I> ECC region[3]: Start:0x0, End:0x0 [0000.538] I> ECC region[4]: Start:0x0, End:0x0 [0000.542] I> Non-ECC region[0]: Start:0x0, End:0x0 [0000.547] I> Non-ECC region[1]: Start:0x0, End:0x0 [0000.551] I> Non-ECC region[2]: Start:0x0, End:0x0 [0000.556] I> Non-ECC region[3]: Start:0x0, End:0x0 [0000.561] I> Non-ECC region[4]: Start:0x0, End:0x0
Note down the carveout 49 base address.
[0000.795] I> allocated(CO:50) base:0xe2c600000 size:0x200000 align: 0x100000 [0000.802] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000 [0000.808] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000 [0000.815] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000 **[0000.822] I> allocated(CO:49) base:0xe2cd70000 size:0x10000 align: 0x10000**
SBE Testing
Read ECC injected address at <Carveout 49 base> + 0x1000 from kernel console.
ubuntu@jetson:~$ sudo ./devmem2 0xe2cd71000 w
Read the EMC channel status registers one by one from FSI-console to find the channel where ECC SBE count is increased.
FSI-SHELL>readmemory 0x02c70ac4 20000000 FSI-SHELL>readmemory 0x02c80ac4 20000000 FSI-SHELL>readmemory 0x02c90ac4 20000000 FSI-SHELL>readmemory 0x02ca0ac4 20000000 FSI-SHELL>readmemory 0x02cb0ac4 20000000 FSI-SHELL>readmemory 0x02cc0ac4 20000000 FSI-SHELL>readmemory 0x02cd0ac4 20000000 FSI-SHELL>readmemory 0x02ce0ac4 20000000 FSI-SHELL>readmemory 0x01780ac4 20010100 **← Example only;** FSI-SHELL>readmemory 0x01790ac4 20000000 FSI-SHELL>readmemory 0x017a0ac4 20000000 FSI-SHELL>readmemory 0x017b0ac4 20000000 FSI-SHELL>readmemory 0x017c0ac4 20000000 FSI-SHELL>readmemory 0x017d0ac4 20000000 FSI-SHELL>readmemory 0x017e0ac4 20000000 FSI-SHELL>readmemory 0x017f0ac4 20000000
DBE Testing
Read ECC injected address at <Carveout 49 base> + 0x8000 from kernel console.
ubuntu@jetson:~$ sudo ./devmem2 0xe2cd78000 w
FSI should trigger L1 coldboot and MB2 should enter DRAM-ECC mode. Refer to the above diagram for the detailed flow of DRAM-ECC. Once DRAM-ECC bad page update is success, the corresponding print should be seen on the console.
I> MB2 (version: 0.0.0.0-t234-54845784-c6a05a9f) I> t234-A01-1-Silicon (0x12347) **I> Boot-mode : DRAM ECC** ... ... **I> Read back verify success for primary and secondary blocks** I> Task: DRAM ECC Mode PMC Reset **I> Triggering PMC_RESET**
In the next boot cycle, when the MB1 reads the bad page partition, it should contain the retired bad page.
[0000.480] I> DRAM bad page list (P) binary is read from storage [0000.485] I> DRAM bad page list (P) binary integrity check is success [0000.492] I> Binary DRAM bad page list (P) loaded successfully at 0x40040000 (0x1000) **[0000.502] I> bad page addr 0: 0xe2cd70000** [0000.506] I> Task: SDRAM params override [0000.510] I> Task: Save mem-bct info [0000.513] I> Task: Carveout allocate [0000.517] I> RCM blob carveout will not be allocated [0000.521] I> Update CCPLEX IST carveout from MB1-BCT [0000.526] I> ECC region[0]: Start:0x80000000, End:0xe80000000 [0000.532] I> ECC region[1]: Start:0x0, End:0x0 [0000.536] I> ECC region[2]: Start:0x0, End:0x0 [0000.540] I> ECC region[3]: Start:0x0, End:0x0 [0000.545] I> ECC region[4]: Start:0x0, End:0x0 [0000.549] I> Non-ECC region[0]: Start:0x0, End:0x0 [0000.553] I> Non-ECC region[1]: Start:0x0, End:0x0 [0000.558] I> Non-ECC region[2]: Start:0x0, End:0x0 [0000.563] I> Non-ECC region[3]: Start:0x0, End:0x0 [0000.567] I> Non-ECC region[4]: Start:0x0, End:0x0
Accordingly, the carveout memory should be adjusted.
[0000.808] I> allocated(CO:52) base:0xe2cdc0000 size:0x30000 align: 0x10000 [0000.815] I> allocated(CO:48) base:0xe2cda0000 size:0x20000 align: 0x10000 [0000.822] I> allocated(CO:69) base:0xe2cd80000 size:0x20000 align: 0x10000 **[0000.829] I> allocated(CO:49) base:0xe2cd60000 size:0x10000 align: 0x10000 ← Was previously 0xe2cd70000**
Kernel memory memory map entries should exclude the detected bad page and carveout 49.
[ 0.000000] node 0: [mem 0x0000000080000000-0x00000000fffdffff] [ 0.000000] node 0: [mem 0x00000000fffe0000-0x00000000ffffffff] [ 0.000000] node 0: [mem 0x0000000100000000-0x0000000e18f95fff] [ 0.000000] node 0: [mem 0x0000000e18f96000-0x0000000e1922bfff] [ 0.000000] node 0: [mem 0x0000000e1922c000-0x0000000e2670ffff] [ 0.000000] node 0: [mem 0x0000000e26710000-0x0000000e2864ffff] [ 0.000000] node 0: [mem 0x0000000e28650000-0x0000000e2c5fffff] [ 0.000000] node 0: [mem 0x0000000e2c600000-0x0000000e2c7fffff] **[ 0.000000] node 0: [mem 0x0000000e2c800000-0x0000000e2cd5ffff] ← Does not contain 0xe2cd60000 and 0xe2cd70000 pages** [ 0.000000] node 0: [mem 0x0000000e32000000-0x0000000e33ffffff]