Power Management for Jetson AGX Xavier Devices

NVIDIA Tegra Linux Driver Package

Development Guide

31.1 Release

NVIDIA® Tegra® system-on-a-chip and NVIDIA® Tegra® Board Support Package (BSP) provides many features related to power management, thermal management, and electrical management. These features deliver the best user experience possible given the constraints of a particular platform. The target user experience ensures the perception that the device provides:

• Uniformly high performance

• Excellent battery life

• Perfect stability

• Comfortable and cool to the touch

This topic describes the power, thermal, and electrical management features visible to software, as well as some tools and related techniques.

Interacting Features

Power, thermal, and electrical management features place dynamic constraints on many operational settings (“knobs”), such as:

• Clock gate settings

• Clock frequencies

• Power gate (or regulator enable) settings

• Voltages

• Processor power state (e.g., which idle state is selected for the CPU)

• Peripheral power state (e.g., which idle state is selected for an I/O controller)

• Chipset power state

• Availability of CPU cores to the OS

Some of these knobs are constrained by more than one feature. For example, cpufreq implements load based scaling based on the how busy the CPU is, and adjusts the CPU frequency accordingly. CPU thermal management, however, can override the target frequency of cpufreq. Therefore, before you attempt to debug power, performance, thermal, or electrical problems, familiarize yourself with all of the power, thermal, and electrical management features in the BSP.

Kernel-Space Power Saving Features

This topic describes the features that the SoC implements to save power and extend battery life. Many of these features are implemented by the Linux kernel, with support from firmware and hardware, and without significant involvement from the user space.

Chipset Power States

The supported power states are listed in order of increasing flexibility or configurability:

• Off: There is only one way for a system to be off.

• SC7 (Deep Sleep) offers a small amount of configurability. For example, prior to entering Deep Sleep, software can select which of the many hardware wake events can wake the chip from Deep Sleep.

• Active state is extraordinarily flexible in terms of power and performance. It encompasses activity levels from low power audio playback through peak performance scenarios. Power consumption in the active state can range from tens of milliwatts up to several watts.

Supported Power States

The following table describes the supported power states.

Power State	Functionality	Characteristics
Off	Power rails	None of the power rails supplying the SoC and DRAM are powered.
	State	No state is maintained in the SoC or DRAM.
	Exiting	The only way to exit this state is via a cold-boot (into active mode).
SC7 (Deep Sleep)	Power rails	VDD_RTC, VDDIO_DDR, VDDIO_SYS, and DRAM power rails are powered on. VDD_CORE and VDD_CPU are powered off.
	State	The SoC maintains a small amount of state in the PMC block. DRAM maintains state.
	Exiting	Exit from this state occurs based on a pre-defined set of wake events into active mode.
Active	Power rails	VDD_RTC, VDDIO_DDR, VDDIO_SYS, VDD_CORE, and DRAM power rails are powered on. Other power rails (including VDD_CPU) may be on.
	State	Software actively manages the power states of the many devices comprising the SoC.
	Exiting	Software can initiate a transition from Active to any other power state.

Power State Mapping to Linux

Tegra BSP maps these hardware power states onto Linux power states as follows.

Linux Power State	Chipset Power State	Comments
Off	Off	—
Suspend	SC7	Software can choose whether to enter SC7 before the OS enters suspend.
Running/Idle (display on or off)	Active	Many devices in the SoC may be idle or disabled. They are under driver control. For example, VDD_GPU may be powered off and the companion GPU may be power-gated.

Note:

For NVIDIA Tegra Xavier the chipset power state is named SCy instead of LPx.

Clock and Voltage Management

Because frequency is proportional to voltage, dynamic voltage scaling is closely related to frequency scaling. For example, higher frequencies require higher voltages and vice versa.

Most clock register manipulation on Tegra Xavier is handled by the Boot and Power Management (BPMP) firmware: power management firmware running on the BPMP. A Linux kernel driver on the CPU exposes a somewhat simplified view of the physical clock tree to software on the main CPU via the Linux Common Clock Framework.

Each of the significant clock domains on the chip has its own dedicated clock source known as a Noise Aware Frequency Lock Loop (NAFLL).

Regulator Framework

The Linux regulator framework provides an abstraction that allows regulator consumer drivers to dynamically adjust voltage or current regulators at runtime, without knowledge of the underlying hardware power tree.

The framework provides a mechanism that platform initialization code can use to declare a power tree topology and assign a driver that provides regulators for each node in the hardware power tree. Such a driver is called a regulator provider driver.

Tegra BSP configures the platform power tree appropriately for Tegra devices. Additionally, drivers within Tegra BSP act as regulator consumers, where appropriate.

When porting Tegra BSP to a new platform, ensure that:

• The platform power tree is correctly configured to match the underlying hardware.

• All drivers for peripheral devices use the regulator consumer APIs correctly.

• The Device Tree and board configuration file information for your new platform avoids conflicts between functions using the same I/O pads. BSP drivers registering as regulator consumers can cause I/O pads on the chip to be unavailable for other functions.

The SoC core power rails (VDD_CORE, VDD_CPU, VDD_GPU, VDD_CV) are under the direct control of the BPMP firmware. They are configured via the BPMP device tree blob (which is distinct from the Linux device tree blob)

CPU Power Management

Tegra CPU power management strategy includes dynamic frequency scaling with dynamic voltage scaling, idle power states, and core management tuned for the Tegra Xavier architecture.

Frequency Management with cpufreq

Tegra BSP implements CPU Dynamic Frequency Scaling (DFS) with the Linux cpufreq subsystem. The cpufreq subsystem comprises:

• Platform drivers to implement the clock adjustment mechanism.

• Governors to implement frequency scaling policies

• Core framework to connect governors to platform drivers

The policy for frequency scaling depends on which cpufreq governor is selected at runtime.

For details consult the information available at:

<top>/kernel/kernel/kernel-4.9/Documentation/cpu-freq/

For each SoC hardware reference design, a cpufreq governor is selected and tuned to achieve a balance between power and performance.

When a governor requests a CPU frequency change, the SoC-specific cpufreq platform driver reconciles that request with limits imposed by thermal or electrical constraints. The driver updates the clock speed of the CPU.

Tegra Xavier uses an NAFLL to clock each CPU. The NAFLLs are configured for AVFS. Hardware, with the assistance of the BPMP, ensures that the CPU voltage is appropriate for the NAFLL to deliver requested CPU frequencies.

Idle Management with cpuidle

The Linux cpuidle infrastructure supports the implementation of SoC-specific idle states for each CPU core. cpuidle lacks direct support for idle states applicable to an entire CPU cluster and for idle states extending beyond a CPU cluster.

For more information on the Linux cpuidle infrastructure, see:

<top>/kernel/kernel/kernel-4.9/Documentation/cpuidle/

NVIDIA provides an SoC-specific cpuidle driver that plugs into the cpuidle framework to enable CPU idle power management.

Per-Core CPU Idle States

The SoC cpuidle driver exposes two per-core CPU idle states as follows:

Core State	Description
C1	CPU core is clock-gated
C6	Cluster power gating (includes non-CPU logic)

Transition CPU Cluster to Idle

When the final core within a cluster transitions to idle, the SoC-specific cpuidle driver can transition the CPU cluster to a cluster idle state.

Cluster State	Description
CC1	CPU cluster’s clock is halted
CC6	Cluster power gating (includes non-CPU logic)

Additionally, as the final core within a cluster transitions to idle, the SoC’s cpuidle driver optionally disables any SoC resources if the CPU was the last active user.

For example, the final CPU core transitioning to idle can optionally do one or more of the following:

• Transition DRAM to self-refresh

• Clock-gate MC/EMC

• Halt various PLLs

CPU Idle

The idle task is scheduled when there are no runnable tasks left in the run queue for a particular core. This task, through the cpuidle driver and cpuidle governor, selects the core and puts it into a low-power state, where it stays until an interrupt wakes it up to process more work.

When the last active core in a CPU cluster goes into an idle or offline state, the idle task puts the entire CPU cluster in a low-power state.

CCPLEX Idle States

Core states are denoted as Cx states, cluster states are denoted as CCx states, and CCPlex states are denoted as CCPx states. The table below summarizes the different states available onXavier.

State	Meaning
Core states
C1	Clock gating
C6	Virtual retention (power-gating and architecture state restored by MTS)
Cluster states
CC1	Auto clock gating
CC3	fmax@Vmin or specified idle frequency
CC6	Cluster power gating (includes non-CPU logic)
CCPlex states
CG7	CCPLEX rail gating

Using KConfig and Device Tree Node to Enable cpuidle

Use KConfig and the device tree node to enable cpuidle.

To enable cpuidle from the configuration file

• Set the following option:

CONFIG_CPU_IDLE=y

To enable cpuidle from device tree

• Use the following compatibility string if not already enabled:

cpuidle {

compatible = "nvidia,tegra19x-cpuidle";

status = "okay";

};

To get and set the core power state of the CPU

The pathnames of the nodes that represent core power states are:

/sys/devices/system/cpu/cpu<x>/cpuidle/state<y>

Where state<y> is mapped to a specific CPU core power state.

• To determine which core power state state<y> is mapped to, run the command:

cat /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/disable

• To get the status of core power state <y> on core <x>, read the node:

cat /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/disable

• To change the status of core power state <y> on CPU core <x>, run the command:

echo <b> > /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/disable

Where <b> is 0 or 1. (This writes ASCII 0 or 1 to the node.)

Note:

ASCII 1 corresponds to “disabled,” and 0 to “enabled.”

To get cluster states

• To get the status of the cluster states enabled for each cluster, read the following node:

/sys/kernel/debug/tegra_cpuidle/deepest_cc_state

The value returned is:

• 1: Only CC1 is enabled

• 6: CC1 and CC6 are enabled

To get the per-core state statistics

• To get the number of times the kernel requested a specified core to enter a specified state, read the following node:

cat /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/usage

• To get the number of times a specified core actually entered a specified state, run the following command:

cat /sys/kernel/debug/tegra_mce/cstats

The command requests information from MCE/MTS, which actually sets the state. MCE/MTS may decline to set a requested state, for example, because the actual idle time for the core is less than the crossover threshold value.

Note:

Background translations in Denver can bloat the Denver idle state counts.

• To get the total time in microseconds that a specified core has spent in a specified state since boot, read the following device:

cat /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/time

For example, to get the number of times that Denver core 2 has entered state C6, run the following command:

cat /sys/devices/system/cpu/cpu2/cpuidle/state1/usage

To get the total time in microseconds that Denver core 2 has spent in state C6, run the following command:

cat /sys/devices/system/cpu/cpu2/cpuidle/state1/time

To disable cpuidle at boot time

• Remove or disable the compatibility string nvidia,tegra19x-cpuidle from the appropriate device tree file.

To disable a core/cluster power state at boot time

• Remove or disable the appropriate core/cluster state nodes from the following device tree:

tegra194-cpuidle.dtsi

Memory Power Management

NVIDIA SoC chipsets include power saving features whose operation is largely invisible to software at runtime. Most of those features are statically enabled at boot, according to settings in the boot configuration table (BCT).

Additionally, Tegra BSP implements EMC frequency scaling, which is dynamic frequency scaling for the memory controller (EMC/MC) and DRAM. This is a critical power saving feature that requires tuning and characterization for each new printed circuit board design.

The calibration results include a BCT and an EMC DVFS table specific to the board design. The EMC DVFS table must be included in the platform BPMP device tree file.

EMC Frequency Scaling Policy

The following factors affect EMC frequency scaling policy at runtime:

• The entries in the EMC DVFS table

• The average memory bandwidth being used (as measured by hardware)

• Requests made by various device drivers (cpufreq, graphics drivers, USB, HDMI™, and display)

• Any limits dynamically imposed by thermal throttling

Supported Modes and Power Efficiency

The Jetson AGX Xavier is designed with high efficiency PMIC, voltage regulators, and power tree to optimize power efficiency. It supports three different power modes, such as 10W, 15W, and 30W. For each mode, several configurations with various CPU frequencies and number of cores online are possible.

Capping the memory, CPU, and GPU frequencies and number of online CPU, GPU TPC, DLA and PVA cores at a pre-qualified level confines the module to the target mode. The configurations pre-defined by NVIDIA are as follows.

NVPModel Clock Configuration for Jetson AGX Xavier

Mode Name	MaxN	10W	15W	30W	30W	30W	30W
Power Budget	n/a	10W	15W	30W	30W	30W	30W
Mode ID	0	1	2	3	4	5	6
Online CPU	8	2	4	8	6	4	2
CPU Maximal Frequency (MHz)	2265.6	1200	1200	1200	1450	1780	2100
GPU TPC	4	2	4	4	4	4	4
GPU Maximal Frequency (MHz)	1377	520	670	900	900	900	900
DLA cores	2	2	2	2	2	2	2
DLA Maximal Frequency	1395.2	550	750	1050	1050	1050	1050
PVA cores	2	0	1	1	1	1	1
PVA Maximal Frequency	1088	0	550	760	760	760	760
Memory Maximal Frequency (MHz)	2133	1066	1333	1600	1600	1600	1600
The default mode is 15W (id:2).

To change the power mode

• Run the following command:

sudo /usr/sbin/nvpmodel -m x

Where x is the power mode ID.

To find the current power mode

• Run the following command:

sudo /usr/sbin/nvpmodel -q

To learn about other options

• Run the following command:

/usr/sbin/nvpmodel -h

Once you set a power mode, the module stays in that mode until you change it. The mode persists across power cycles and SC7.

You can define your own custom mode by adding a mode definition to the following file:

<top>/device/t19x/t194/nvpmodel_t194ref.conf

Following is an example entry for mode 2:

< POWER_MODEL ID=2 NAME=MODE_15W >

CPU_ONLINE CORE_0 1

CPU_ONLINE CORE_1 1

CPU_ONLINE CORE_2 1

CPU_ONLINE CORE_3 1

CPU_ONLINE CORE_4 0

CPU_ONLINE CORE_5 0

CPU_ONLINE CORE_6 0

CPU_ONLINE CORE_7 0

CPU_DENVER_0 MIN_FREQ 1200000

CPU_DENVER_0 MAX_FREQ 1200000

CPU_DENVER_1 MIN_FREQ 1200000

CPU_DENVER_1 MAX_FREQ 1200000

GPU MIN_FREQ 0

GPU MAX_FREQ 670000000

EMC MAX_FREQ 1331200000

DLA_CORE MAX_FREQ 750000000

DLA_FALCON MAX_FREQ 450000000

PVA_VPS MAX_FREQ 550000000

PVA_CORE MAX_FREQ 385000000

The frequency unit of measure for the CPU is kilohertz. The unit for GPU and EMMC is hertz. You must assign a unique number in the ID field. Test your use case to:

• Determine how many active cores to use

• The frequencies for each CPU cluster and GPU and EMC frequencies

The frequencies you select are subject to the MaxN limit defined in mode 0.

Thermal Management

Thermal management is essential for system stability and quality of user experience. Tegra Xavier thermal management provides the following capabilities:

• Sensing: for temperature reporting

• Cooldown: for removing heat via the fan and for controlling heat via software clock throttling

• Slowdown: for hardware clock throttling

• Shutdown: for orderly software shutdown and hardware thermal reset

Thermal management in Tegra Xavier is performed by:

• Drivers for the on-die thermal sensors

• The Board and Power Management Processor (BPMP) for slowdown and hardware thermal reset

The following table identifies each thermal management action and the associated module for the SoC.

Thermal Action	Module Name	Tegra Xavier
Sensing	soctherm.c	BPMP firmware
	aotag.c	BPMP firmware
	nct1008.c	Kernel software
Cooldown for software throttling	tegraXX_throttle.c	Kernel software
Cooldown for software throttling	pwm_fan.c	Kernel software
Slowdown for hardware throttling	soctherm.c	BPMP firmware
Software shutdown	thermal_core.c	Kernel software
Hardware shutdown	soctherm.c and aotag.c	BPMP firmware

Linux Thermal Framework

The Linux thermal framework provides generic user space and kernel space interfaces for working with devices that measure temperature and devices that control temperature. The central component of the framework is the Tthermal Zzone.

More information about the Linux thermal framework is available at:

<top>/kernel-4.9/Documentation/thermal/sysfs-api.txt

Thermal Zone

A thermal zone is a virtual object that represents an area on the die whose temperature is monitored and controlled. A thermal zone acts as an object with the following components:

• Temperature sensor

• Cooling device

• Trip points

• Governor

Tegra BSP includes drivers that provide interfaces to these components.

This topic introduces these components and demonstrates how they form a thermal zone on a Tegra-based device.

Configuring a Thermal Zone Using the Device Tree

A thermal zone provides knobs to tune the thermal response of the zone. Tegra BSP provides several thermal zones tuned to provide optimum thermal performance. These provided thermal zones can be modified by editing the entries in the device tree. Users can define sensors to use temperature limits and cooling actions on those limits. When a device becomes too hot, in most cases, it can be resolved by tuning the thermal zone.

The following code snippet provides an example of a thermal zone for Tegra Xavier. This thermal zone monitors the temperature of the THERMAL_ZONE_GPU sensor.

Clock throttling is performed using the CPU-balanced cooling device when the passive trip point, trip_bthrot, is crossed at 88° Celsius.

GPU-therm {

status = "okay";

polling-delay-passive = <500>;

thermal-zone-params {

governor-name = "step_wise";

};

trips {

trip_critical {

temperature = <93500>;

type = "critical";

hysteresis = <0>;

writable;

};

trip_bthrot {

temperature = <88000>;

type = "passive";

hysteresis = <0>;

writable;

};

cooling-maps {

map0 {

trip = <&{/thermal-zones/GPU-therm/trips/trip_bthrot}>;

cdev-type = "cpu-balanced";

cooling-device = <&{/bthrot_cdev/gpu_balanced} THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;

};

More information about thermal knobs is available at:

<top>/kernel/kernel-4.9/Documentation/devicetree/bindings/thermal/thermal.txt

Temperature Sensors

A temperature sensor in a thermal zone is responsible for reporting the temperature in millidegrees Celsius. Tegra has several types of temperature sensors spread across on the die and board.

For more information see Thermal Sensing in Linux.

Trip Points

A trip point is used to communicate with a thermal zone. A trip point specifies the temperature (the trip) at which to perform a thermal action. Trip points are classified as active or passive, based on the type of cooling they trigger. A trip point is classified as critical if it triggers a thermal shutdown. A cooling map specifies how a cooling device is associated with certain trip points. Tegra BSP supports fan and clock throttling.

Cooling Devices

A cooling device does not actually remove heat; only a fan cooling device can do that. The Linux thermal framework makes this distinction by classifying a fan cooling device as an “active” cooling device. Clock throttling, the other type of cooling device, is classified as a “passive” cooling device.

For more information, see Cooling Devices.

Governors

Thermal management requires some form of feedback control system that keeps the device within a safe operating temperature. A governor implements this feedback control loop. While the Linux thermal framework provides many different governors, Tegra BSP provides a simple Proportional Integral Derivative (PID) controller for all passive throttling needs.

Tegra BSP Specific Thermal Zones

The Tegra BSP defines platform specific thermal zones. They are tuned to provide the best performance within the thermal constraints of the device. Each thermal zone uses a temperature sensor that is controlled by either the Linux kernel or the BPMP firmware, as described in the following table.

Thermal Zone	Thermal Sensor ABI Name	Cooling Action	Balanced Throttle Temperature in Degrees Celsius
CPU-therm	THERMAL_ZONE_PLLX	cpu_balanced	90.0
GPU-therm	THERMAL_ZONE_GPU	gpu_balanced	92.5
AUX-therm	THERMAL_ZONE_AUX	aux_balanced	89.0
AO-therm	THERMAL_ZONE_AO	bwmgr-therm-handler	85.0
Tdiode_tegra	tmp451	tegra-eqos	40.0
PMIC-Die	PMIC	emergency-balanced	120.0
Tboard_tegra	tmp451	—	—
thermal-fan-est	Weighted average of CPU-therm, GPU-therm, & AUX-therm (3:3:4)	pwm_fan	50.0

For more information, see Thermal Management in BPMP.

All gains achieved by tuning are limited by the Thermal Design Power (TDP) of the system. Tuning cannot remedy a faulty TDP. Removing all of the thermal zones does not guarantee maximum performance, and can cause resets and/or irreversible damage to the device.

Thermal Sensing in Linux

Tegra BSP includes several drivers for temperature sensing.

NCT Sensors

Tegra BSP includes a driver for devices such as:

• NCT1008

• NCT72

• TMP451

These devices can sense their own temperature as well as the temperature of a remote diode. Tegra platforms have these sensors set up as follows:

Thermal Zone	Thermal Sensor	Sensed Location
Tdiode_tegra	Remote sensor	Temperature on die near GPU
Tboard_tegra	Local sensor	Temperature of the board

Tegra BSP configures these sensors to operate in an extended mode to increase the temperature range to −64° Celsius to 191° Celsius.

Operation During SC7

On many platforms, the voltage rail that powers the sensor is gated when Tegra enters SC7 state. Consequently, the sensor is stopped when Tegra enters SC7 and turned back on when Tegra exits SC7 state.

Thermal Capabilities

The NCT sensors generate thermal events for:

• Thermal zone trip points

• Hardware thermal shutdown

Correction Offset

The NCT sensors allow software to program a static offset temperature for the remote sensor. This accounts for any inaccuracy that may be present in the sensor hardware. Tegra BSP reads the offset from the Device Tree and programs it into the offset register on boot. The offset is calculated and validated via oil bath experiments.

BPMP Sensors

Tegra Xavier replaces the soctherm and aotag drivers in the Linux kernel with the tegra_bpmp_thermal sensor driver. This module registers itself as the sensor device driver with the Linux thermal framework for all the thermal sensors except the NCT sensors.

Each BPMP sensor is exposed using the Application Binary Interface (ABI), and has an ABI name as shown in the table in Tegra BSP Specific Thermal Zones. BPMP sensors, without the thermal_zone prefix, work as described in the following paragraphs. All BPMP sensors have one programmable temperature threshold (one trip), allocated for a thermal zone trip point.

The thermal framework walks through the list of thermal trip points in a thermal zone based on the current temperature. It then comes up with trip to program the BPMP sensor that is specified in the thermal zone. The tegra_bpmp_thermal driver then uses the following thermal message requests (MRQs) to communicate with the BPMP thermal framework.

• CMD_THERMAL_QUERY_ABI

• CMD_THERMAL_GET_TEMP

• CMD_THERMAL_SET_TRIP

• CMD_THERMAL_GET_NUM_ZONES

The driver receives a CMD_THERMAL_HOST_TRIP_REACHED MRQ message when a particular sensor crosses a trip. The message is then relayed back to the Linux thermal framework.

For more information on these thermal management features provided as part of the Tegra BSP, see Thermal Management in BPMP.

Cooling Devices

Tegra BSP provides thermal cooling via fan control and throttling of various clocks in the system.

Fan Management

Tegra BSP provides active cooling by fan management through the cooling device pwm-fan, which provides:

• Fan speed control by programming the PWM controller

• Ramp-up and ramp-down control to change the speed of the fan smoothly

• Fan control during various power states

The PWM-RPM mapping, and the various ramp rates, are stored as part of the device tree binary. The pwm-fan cooling device maps these PWM values to a cooling state. The fan cooling device can be attached to monitor the temperature of any of the Tegra BSP sensors. As the temperature increases, the governor progressively picks a deeper cooling state for the fan. This results in a higher RPM for the fan which then results in more cooling.

SoC thermal management uses the fan as the first line of defense to delay clock throttling until a much higher temperature is reached.

Clock Throttling

Tegra BSP provides thermal cooling by throttling various clocks in the system. When a rising temperature crosses a trip point, clock throttling relies on the DVFS capabilities of the clocks to reduce their operating frequency, and thereby the voltage of the rail that powers the clock. This lowered frequency and voltage reduces the power consumption which then helps in controlling the temperature.

Because cooling is achieved by reducing the clock frequency, it has a direct impact on performance and user experience. If a device feels warm and seems sluggish, it may be due to thermal throttling on the clocks. This can be remedied by tuning thermal zones that are mapped to the following balanced cooling devices:

• gpu_balanced

• cpu_balanced

• aux_balanced

• emergency_balanced

Each of these balanced cooling devices provides several cooling states that translate to a maximum allowable operating frequency for the CPU, GPU, and EMC clocks. These frequencies are optimized to provide the best possible performance at a given temperature. The frequency tables for these clocks are part of the device tree binary.

The governor uses the current temperature of the sensor as an input to the feedback control loop. Similarly, the governor uses the control loop’s output as the new cooling state for the operation of the cooling device. As the device heats up the governor progressively picks a higher cooling setting, which then results in a higher frequency cap for all the clocks, and potentially higher cooling. Tegra BSP performs this thermal throttling of the clocks to maintain the junction temperature of the die within the recommended safe limits.

Software Thermal Shutdown

The thermal zones also define a special type of trip point called a critical trip point that triggers a software shutdown. This special trip point allows the operating system to save its state and perform an orderly shutdown before a hardware reset due to high temperature rates. Tegra BSP defines one critical trip point per thermal zone. Users can set the lower limit for the orderly shutdown. A thermal shutdown occurs after all the other cooling strategies have failed. It is considered a rare event. The shutdown limits are as follows.

Thermal Zone	Shutdown Limit in Degrees Celsius
CPU-therm	95.5
GPU-therm	98
AUX-therm	94.5

Thermal Management in BPMP

The Tegra BSP thermal management features are part of the firmware running on BPMP for Tegra platforms running any host operating system (host OS) on the CPU.

Thermal Sensing

The BPMP firmware hosts the soctherm and aotag drivers for the on-die thermal sensors as follows:

Thermal Sensor		ABI Name	Sensed Location
AOTAG	AOTAG	THERMAL_ZONE_AO	—
SOC_THERM	PLLX	THERMAL_ZONE_PLLX	Center of CPU cluster
	AUX (x3)	THERMAL_ZONE_AUX	Near CV cluster, SOC cluster
	CPU	THERMAL_ZONE_CPU	—
	GPU	THERMAL_ZONE_GPU	Center of GPU

SOC_THERM

SOC_THERM is the collection of on-chip ring oscillators whose frequency changes are based on the temperature. To convert a measured frequency to a temperature, the oscillating frequency of the sensor, at a fixed temperature, must be known in advance and stored in the on-chip fuses.

The soctherm driver uses these fuses during boot and calibrates the sensor. Once the calibration is complete, the temperature sensor reports the temperature, in degrees Celsius, with a precision margin of 0.5° Celsius.

Sensors and Sensor Groups

The temperature sensors on the chip are logically grouped into sensor groups, based on their proximity to certain hardware blocks. The sensor groups are represented as a single sensor to the host OS and the BPMP firmware.

For example, Tegra Xavier has two temperature sensors in the SOC cluster and one near the CV cluster. These are grouped as AUX sensors that are represented as THERMAL_ZONE_AUX to the operating system running on the CPUs. SOC_THERM reports the temperature of a given group by taking the maximum of all the sensors in the group.

Thermal Event Detection

Thermal sensors can report the temperature when the current temperature crosses a software programmed trip point. The sensors are capable of monitoring several of these software trip points to perform the following thermal actions:

• Report when the thermal trip point has been crossed

• Trigger a hardware thermal shutdown

• Trigger hardware throttling

Voltage Rail Dependencies

To provide accurate temperature sensing, the sensors require a minimum voltage. Additionally, the sensors cannot operate when the rail is power-gated.

When the system is in a low-power state, the firmware provides the following modes of operation:

• No temperature measurements during SC7: Because the rail powering the sensor is power-gated in the SC7 state, the oscillator is not running. Therefore, the frequency-to-temperature conversion may result in inaccurate values. To ensure no spurious temperature reports from the sensors, stop the sensor before entering the SC7 state.

The firmware provides the AOTAG sensor for measuring temperature in the SC7 state. When the SC7 state is exited, the sensors are restarted.

• Fallback to PLLX sensor on Tegra Xavier: To ensure accurate temperature readings during minimum voltage, use the PLLX sensor’s oscillator. On platforms where the minimum voltage is not guaranteed, the firmware falls back on the PLLX sensor’s oscillator with a programmable offset. The result is that all the sensors invalidate their oscillators and use the PLLX sensor’s oscillator with the added offset. This fallback on the PLLX sensor’s oscillator allows for continuous temperature measurement, even at lower voltage levels.

As a side effect of PLLX fallback, the programmable offset compensates for the fact that the PLLX sensor’s oscillator is farther away from the oscillator that it is replacing. The host OS continues to use all the thermal zones without side effects. The offset ensures that the CPU sensor reports more accurate temperatures than the PLLX sensor’s sensor. The host OS must therefore continue to use the right sensors for measuring the CPU temperatures.

AOTAG

The Always-On Thermal Alert Generator (AOTAG) is a ring oscillator based temperature sensor. It is in the always-on power domain and can monitor temperatures even when the device is in the SC7 state. Apart from this distinction, the AOTAG sensor operates the same as any of the SOC_THERM sensors.

Thermal Event Detection

Just like the SOC_THERM sensor, the AOTAG sensor can generate interrupts. Additionally, it can monitor two software programmed levels that the BSP uses as:

• Thermal zone trip points

• Hardware thermal shutdown

BPMP Thermal Framework

The BPMP firmware hosts a thermal framework to:

• Register thermal sensors as thermal zones as identified in Thermal Sensing

• Allow BPMP modules to register trips on the thermal zones

• Allow the host OS to register trips using thermal MRQ messages

• Provide trip management and reporting

The thermal framework maintains a list of trips per sensor that includes the current trip from the host OS and various BPMP modules. As temperatures change, the framework examines the list of current trips and notifies the owner of the trip of the temperature change. The notification is sent using a callback for the BPMP owned trips and the thermal MRQ command CMD_THERMAL_HOST_TRIP_REACHED for trips that are owned by the host OS.

The primary thermal MRQ requests handled by the framework are:

• CMD_THERMAL_QUERY_ABI

• CMD_THERMAL_GET_TEMP

• CMD_THERMAL_SET_TRIP

• CMD_THERMAL_GET_NUM_ZONES

For details on these MRQ requests, see the comments in this header file in the release package:

<top>/kernel/nvidia/include/soc/tegra/bpmp_abi.h

Since there can be several trips on a given sensor, the thermal framework must ensure that a notification is generated whenever a given trip is crossed. For example, if THERMAL_ZONE_CPU has a trip at 55°, 60°, 65°, and 70° Celsius, the thermal framework sends a single notification when the temperature crosses 55°, 60°, 65 ° and 70°.

Additionally, the framework implements hysteresis to prevent sending too many notifications. So, for the above example, the framework:

• Sends one notification when the temperature reaches 55° Celsius

• Waits until the temperature drops below 54°

• Sends another notification when the temperature rises back to 55°

To perform the above notifications, the thermal framework sets low trips on the sensors to receive events that the temperature has dropped below the limit.

Hardware Throttling

Each element in a power delivery system includes limitations such as:

• The amount of current a battery can supply without shutting down

• The amount of current a regulator can provide before it fails to maintain its output voltage

• The amount of ripple current an inductor in a switching regulator can tolerate without overheating

These limitations can result in fast transient electrical and thermal events such as:

• Overcurrent at the battery

• Voltage droop at the PMIC

• Temperature spikes

The firmware refers to these as OC alarms and triggers hardware throttling of the clocks to handle them.

Impact

Similar to software throttling, hardware throttling may cause lower performance. However, since the triggering events are rare and transient in nature, the user experience is minimally impacted.

The host OS is not notified of these events, but can detect the drop in the clocks using some performance measuring tools that sample the CPU cycle counters. While the thermal management in the host OS works to keep temperature under control, the hardware throttling performs a clampdown of the clocks to handle events.

Throttle Points and Vector Configuration

The BPMP device tree binary holds the various throttle points and the throttle settings that govern when and how throttling is performed. The soctherm driver in the firmware programs the hardware and handles any interrupts resulting from these events. The throttle points can be modified by changing the BPMP device tree.

The throttle temperatures are as follows:

Thermal Zone	Hardware Throttle Limit in Degrees Celsius	Hardware Throttling
THERMAL_ZONE_PLLX	94	Heavy
THERMAL_ZONE_GPU	96.5	Heavy
THERMAL_ZONE_AUX	93	Heavy

The hardware throttling levels are as follows:

Hardware Throttling	Clock Throttled Percentage
Heavy	87.5
Medium	75
Light	50

Throttle vectors are optimized for limiting peak current consumption while maximizing performance. To manage peak current consumption, the firmware supports capping the CPU and GPU clocks at three levels (light, medium, and heavy), as described in the device tree bindings. This capping prevents the CPU and GPU from drawing more current than their voltage regulators can supply.

Design Considerations

Designing failsafe measures into Power Management Integrated Circuits (PMIC), or the battery controller to shut down the device when these events occur, results in a bad user experience. Similarly, designing power delivery hardware for worst-case loads results in large and costly components. Consequently, NVIDIA SoCs are designed for power delivery systems that are adequate for common loads. Additionally, NVIDIA SoCs actively manage their components to avoid exceeding the design limits. When these events are transient in nature, the need for this design management system becomes more compelling.

Hardware Thermal Shutdown

The final failsafe for firmware thermal management is a hardware thermal reset or thermtrip. If software and hardware throttling are unable to control heat generation in the system, and the software becomes unresponsive, the SoC asserts the reset pin on the PMIC as the hardware shutdown mechanism.

The following are the thermtrip temperatures in Tegra Xavier:

Thermal Zone	Shutdown Limit in Degrees Celsius
THERMAL_ZONE_PLLX	96
THERMAL_ZONE_GPU	98.5
THERMAL_ZONE_AUX	95
THERMAL_ZONE_AO	109

Software-Based Power Consumption Modeling

The Jetson AGX Xavier module has 3-channel INA3221 power monitors at I2C addresses 0x40 and 0x41.

The information from the INA3221 power monitors can be read using sysfs nodes. The naming convention for sysfs nodes is as follows:

Command	Description
rail_name_<N>	Exports the rail name.
in_current<N>_input	Exports rail current in mA.
in_voltage<N>_input	Exports rail voltage in mV.
In_power<N>_input	Exports rail power in mW.
Where <N> is a channel number 0-2.

Note:

The INA driver may also present other nodes. Do not modify any INA sysfs node value. Modifying these values can result in damage to your device.

The Jetson TX2 module has 3-channel INA3221 power monitors at I2C address 0x40 and 0x41. The sysfs nodes to read for rail names, voltage, current, and power are at:

/sys/bus/i2c/drivers/ina3221x/1-0040/iio:device0

/sys/bus/i2c/drivers/ina3221x/1-0041/iio:device1

The rail names for I2C address 0x40 are:

Rail Name	Description
Channel 0: GPU	GPU power rail
Channel 1: CPU	CPU power rail
Channel 2: SOC	SOC power rail

The rail names for I2C address 0x41 are:

Rail Name	Description
Channel 0: CV	CV power rail
Channel 1: VDDRQ	DDR power rail
Channel 2: SYS5V	System 5V power rail

Examples

• To read INA3221 at 0x40, the channel-1 rail name, execute the command:

cat /sys/bus/i2c/drivers/ina3221x/1-0040/iio:device0/rail_name_1

• To read channel-1 voltage, current, and power, execute the commands:

cat /sys/bus/i2c/drivers/ina3221x/1-0040/iio:device0/in_current1_input

cat /sys/bus/i2c/drivers/ina3221x/1-0040/iio:device0/in_voltage1_input

cat /sys/bus/i2c/drivers/ina3221x/1-0040/iio:device0/in_power1_input

Related Tools and Techniques

This section describes the tools and techniques to manage power.

3D Frequency Scaling

3D frequency scaling is enabled by default.

To disable 3D frequency scaling

• Run the following command:

echo 0 > /sys/devices/17000000.gv11b/enable_3d_scaling

To enable 3D frequency scaling

• Run the following command:

echo 1 > /sys/devices/17000000.gv11b/enable_3d_scaling

Getting and Setting Frequencies

Use the following procedures to set frequencies and report current frequency settings.

In all of these procedures, <x> is a CPU core number. For example, to apply a command to CPU core 1, replace cpu<x> with cpu1.

To get system clock information

• Run the following command:

cat /sys/kernel/debug/bpmp/debug/clk/clk_tree

To print the CPU lower boundary, upper boundary, and current frequency

• Run the following commands:

cat /sys/devices/system/cpu/cpu<x>/cpufreq/cpuinfo_min_freq

cat /sys/devices/system/cpu/cpu<x>/cpufreq/cpuinfo_max_freq

cat /sys/devices/system/cpu/cpu<x>/cpufreq/cpuinfo_cur_freq

To change the CPU upper boundary

• Run the following command:

echo <cpu_freq> > /sys/devices/system/cpu/cpu<x>/cpufreq/scaling_max_freq

To change the CPU lower boundary

• Run the following command:

echo <cpu_freq> > /sys/devices/system/cpu/cpu<x>/cpufreq/scaling_min_freq

To set the static CPU frequency

• Run the following commands:

echo <cpu_freq> > /sys/devices/system/cpu/cpu<x>/cpufreq/scaling_min_freq

echo <cpu_freq> > /sys/devices/system/cpu/cpu<x>/cpufreq/scaling_max_freq

Where:

• <cpu_freq> is the frequency value available at:

/sys/devices/system/cpu/cpu<x>/cpufreq/scaling_available_frequencies

To print the GPU lower boundary, upper boundary, and current frequency

• Run the following commands:

cat /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/min_freq

cat /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/max_freq

cat /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/cur_freq

To change the GPU upper boundary

• Run the following command:

echo <gpu_freq> > /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/max_freq

To change the GPU lower boundary

• Run the following command:

echo <gpu_freq> > /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/min_freq

To set the static GPU frequency

• Run the following command:

echo <gpu_freq> > /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/min_freq

echo <gpu_freq> > /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/max_freq

Where <gpu_freq> is the value available in:

/sys/devices/17000000.gv11b/devfreq/17000000.gv11b/available_frequencies

To print the EMC lower boundary, upper boundary, and current frequency

• Run the following commands:

cat /sys/kernel/debug/bpmp/debug/clk/emc/min_rate

cat /sys/kernel/debug/bpmp/debug/clk/emc/max_rate

cat /sys/kernel/debug/bpmp/debug/clk/emc/rate

To change the EMC upper boundary

• Run the following command:

echo <emc_freq> > /sys/kernel/debug/bpmp/debug/clk/emc/max_rate

To change the EMC lower boundary

• Run the following command:

echo <emc_freq> > /sys/kernel/debug/bpmp/debug/clk/emc/min_rate

To set static EMC frequency

• Run the following commands:

echo 1 > /sys/kernel/debug/bpmp/debug/clk/emc/mrq_rate_locked

echo 1 > /sys/kernel/debug/bpmp/debug/clk/emc/state

echo <emc_freq> > /sys/kernel/debug/bpmp/debug/clk/emc/rate

Where <emc_freq> is frequency value between EMC min_rate_and max_rate.

Maximizing Jetson AGX Xavier Performance

Tegra BSP provides the jetson_clocks.sh script to maximize Jetson AGX Xavier performance by setting static max frequency to CPU, GPU and EMC clocks. The script can also be used to show current clock settings, store current clock settings into a file, and restore clock settings from a file. The jetson_clocks.sh script is available at:

$HOME/jetson_clocks.sh

Basic usage is as follows:

jetson_clocks.sh [options]

Options	Description
--show	Displays the current settings.
--store [file]	Stores the current settings to a file. The default file is l4t_dfs.conf.
--restore [file]	Restores the saved settings from the file. The default file is l4t_dfs.conf.

To show the current settings

• Execute the command:

sudo ${HOME}/jetson_clocks.sh --show

To store the current settings

• Execute the command:

sudo ${HOME}/jetson_clocks.sh --store

To maximize Jetson AGX Xavier performance

• Execute the command:

sudo ${HOME}/jetson_clocks.sh

To restore the previous settings

• Execute the command:

sudo ${HOME}/jetson_clocks.sh --restore

Using CPU Hot Plug

Manage CPU hot plug with the following procedures.

To manually turn on/off slave CPUs

1. Run the following command to turn on the slave CPU:

echo 1 > /sys/devices/system/cpu/cpu<x>/online

2. Run the following command to turn the slave CPU off:

echo 0 > /sys/devices/system/cpu/cpu<x>/online

Where <x> is a CPU core number.

To check CPU state

• Run the following commands:

cat /sys/devices/system/cpu/cpu<x>/online

Where <X> is the CPU core number.