Jetson Thor Product Family#

This topic describes power and performance management features of the NVIDIA® Jetson™ Thor™ product family. It describes the power, thermal, and electrical management features visible to software, as well as some tools and related techniques.

NVIDIA Jetson Board Support Packages (BSP) provide many features related to power management, thermal management, and electrical management. These features deliver the best user experience possible given the constraints of a particular platform. They help to create an optimum user experience:

  • Uniformly high performance

  • Excellent battery life

  • Perfect stability

  • Cool operation (the device is comfortable to touch)

Interacting Features#

Power, thermal, and electrical management features place dynamic constraints on many operational settings (“knobs”), such as:

  • Clock gate settings

  • Clock frequencies

  • Power gate (or regulator enable) settings

  • Voltages

  • Processor power state (i.e., which idle state is selected for the CPU)

  • Peripheral power state (i.e., which idle state is selected for an I/O controller)

  • Chipset power state

  • Availability of CPU cores to the OS

Some of these knobs are constrained by more than one feature. For example, cpufreq implements load-based scaling, which adjusts the CPU frequency according to how busy the CPU is. CPU thermal management, however, can override the target frequency of cpufreq. Consequently, before you attempt to debug power, performance, thermal, or electrical problems, you must familiarize yourself with all of the power, thermal, and electrical management features in BSP.

Kernel Space Power Saving Features#

This section describes BSP features that save power and extend battery life. Many of these features are implemented by the Linux kernel, with support from firmware and hardware, and without significant involvement from the user space.

Chipset Power States#

The supported power states are listed in order of increasing flexibility or configurability:

  • Off: There is only one way for a system to be off.

  • Deep Sleep (SC7) offers a small amount of configurability. For example, before entering Deep Sleep, the software can select which hardware wake events can wake the chip from Deep Sleep.

  • Active state is extraordinarily flexible in terms of power and performance. It encompasses activity levels from low-power audio playback through peak performance. Power consumption in the Active state can range from tens of milliwatts to several watts.

Supported Power States#

The supported power states are:

Power State

Characteristics

Off

Power rails: None of the power rails supplying the SoC and DRAM are powered.

State: No state is maintained in the SoC or DRAM.

Exiting: Into Active state via cold boot.

Deep Sleep (SC7)

Power rails: VDD_RTC, VDD_1V2, DDR_VDD2, VDD_1V2, VDD_3V3, VDD_1V8, and DRAM power rails are powered on. VDD_SOC and VDD_MSS are powered off.

State: The SoC maintains a small amount of state information in the PMC block. DRAM maintains the state.

Exiting: Into Active state via a predefined set of wake events.

Active

Power rails: VDD_RTC, DDR_VDD2, VDD_SOC, VDD_MSS, and DRAM rails are powered on.

State: Software actively manages the power states of the devices that make up the SoC.

Exiting: Software can initiate a transition from Active to any other power state.

Power State Mapping to Linux#

BSP maps chipset power states to Linux power states as follows.

Chipset power state

Linux power state

Comments

Off

Off

Deep Sleep (SC7)

Suspend to RAM

The software can choose whether to enter Deep Sleep before the OS enters Suspend.

Active

Running/Idle (display on or off)

Many SoC devices may be idle or disabled under driver control. For example, VDD_GPU may be powered off and the companion GPU may be power-gated.

Deep Sleep (SC7)#

If the systemd init system is being used, you can initiate deep sleep from the user space with the following command:

$ sudo systemctl suspend

You can also use the following command:

$ sudo bash -c "echo mem > /sys/power/state"

The first method of entering deep sleep is preferred because it cooperates better with systemd, which maintains the Linux runlevel. If your system is not running systemd, use the second method.

The system can be awakened from deep sleep by common wake sources available on Jetson platforms:

Wake source

Usage

Power button

Press and release the power button on the Jetson device. If the power button is not available, connect and disconnect the power button pin and ground.

RTC alarm

Before entering low power state, program the RTC alarm with the following command:

$ sudo bash -c "echo +10 > /sys/class/rtc/rtc<x>/wakealarm"

where <x> is the RTC ID. The rtc<x> indicates the RTC used by system and can be found with the following command:

$ find /sys/class/rtc/* -maxdepth 0 -printf "%f:" -exec bash -c \
           "cat {}/hctosys" \; | grep :1 | cut -d: -f1 | head -n1

USB type-C cable hotplug

To flash the device, connect or disconnect a USB cable to the USB type-C port.

USB IO hotplug and USB remote wakeup

Before entering SC7 state, enable the wakeup functionality for USB devices:

$ echo enabled | sudo tee /sys/bus/usb/devices/<usb_device>/power/wakeup

In this command, <usb_device> is the USB device.

To wake up the device, plug in or unplug any USB I/O connected to a Type-A port or press any key on the keyboard to wake up the device.

Wake on LAN

Before entering SC7 state, enable Wake on LAN on the Ethernet port:

$ ethtool -s <interface> wol g

On another machine on the same LAN, enter the following:

$ sudo etherwake -i <interface> <MAC_address_of_target>

General Clock Management#

Clocks are managed by the BPMP on Tegra SoCs. You can manage the clocks in two ways:

  • Use the common clock framework (CCF) APIs provided by the kernel.

  • Use the BPMP clock debugfs interface.

Common Clock Framework#

The BPMP clock provider driver in the kernel implements the common clock framework (CCF) APIs. When kernel drivers use APIs (such as clk_set_rate and clk_get_rate) provided by the common clock framework (CCF) to manage the clocks, the BPMP clock provider driver translates those requests into MRQ calls to the BPMP, and the BPMP performs the actual clock configuration.

To set the clock rate to a specific value, use the clk_set_rate API. To get the clock rate of a specific clock, use the clk_get_rate API.

Typically, the first argument of CCF APIs is the clock handler of type struct clk *. To get the clock handler, define the clock properties in the device tree and get the clock handler in the kernel via the devm_clk_get API.

The following example shows how to define the clock properties in the device tree with VIC hardware:

vic@8188050000 {
    clocks = <&bpmp TEGRA264_CLK_VIC>;
    clock-names = "vic";
};

The following example shows how to get the clock handler in the kernel driver:

struct clk *clk = devm_clk_get(dev, "vic");

BPMP Clock DebugFS#

BPMP clock debugfs provides the low-level management interface to the clocks. However, use it only for debugging; it is not intended for production software.

To get an overview of the BPMP clock topology, you can dump the clock tree using the following command:

$ cat /sys/kernel/debug/bpmp/debug/clk/clk_tree

The following is example output from the clk_tree command:

clock                                                 on       rate bpmp  mrq vdd
---------------------------------------------------------------------------------
clk_m                                                  1   19200000   11    1
    actmon                                             1   19200000    2    1 vdd_core@625000
    tach0                                              1    1010526    1    1 vdd_core@625000
  • clock: Clock name. Indentation shows the clock hierarchy.

  • on: Clock state.

  • rate: Clock rate in Hertz.

  • bpmp: Reference count of the clock in BPMP.

  • mrq: Reference count of the clock outside BPMP.

  • vdd: Voltage domain of the clock and the requested voltage of the clock running at the current rate.

To get the minimum and maximum clock rates of a specific clock, you can read the min_rate and max_rate attributes:

$ cat /sys/kernel/debug/bpmp/debug/clk/<clock_name>/min_rate
$ cat /sys/kernel/debug/bpmp/debug/clk/<clock_name>/max_rate

To lock the clock rate to a specific value, you must first disable the MRQ interface of that clock:

$ echo 1 > /sys/kernel/debug/bpmp/debug/clk/<clock_name>/mrq_rate_locked

Then you can write the desired clock rate to the rate attribute:

$ echo <rate> > /sys/kernel/debug/bpmp/debug/clk/<clock_name>/rate

To unlock the clock rate, you must re-enable the MRQ interface of that clock:

$ echo 0 > /sys/kernel/debug/bpmp/debug/clk/<clock_name>/mrq_rate_locked

To get the HW clock rate of a specific clock, you can read the pto_counter attribute:

$ cat /sys/kernel/debug/bpmp/debug/clk/<clock_name>/pto_counter

The pto_counter attribute shows the real clock rate measured by the hardware. The value of pto_counter will be very close to the value read from the rate attribute, which is the software requested clock rate.

General Regulator Management#

Regulators are managed by the BPMP on Tegra SoCs. BPMP abstracts the complexity of voltage scaling and regulator control from the kernel drivers. Kernel drivers do not have direct access to regulators; some operations are automatically handled by the BPMP.

For example, when the kernel driver scales the clock rate of a specific clock, the BPMP automatically scales up or down the regulator output voltage to facilitate the change to the new clock rate. When kernel decides to power off a specific power domain, the BPMP can turn off the regulator to save power.

Four voltage domains come from two VRS11 switching regulators on Jetson Thor: VDD_CPU, VDD_GPU, VDD_MSS, and VDD_SOC.

To get the current output voltage of a specific voltage domain, you can read the voltage attribute:

$ cat /sys/kernel/debug/bpmp/debug/regulator/<voltage_domain_name>/voltage

To get the minimum and maximum output voltage of a specific voltage domain, you can read the min_uv and max_uv attributes:

$ cat /sys/kernel/debug/bpmp/debug/regulator/<voltage_domain_name>/min_uv
$ cat /sys/kernel/debug/bpmp/debug/regulator/<voltage_domain_name>/max_uv

To lock the voltage level of a specific voltage domain, write the desired voltage level to the override attribute:

$ echo <voltage> > /sys/kernel/debug/bpmp/debug/regulator/<voltage_domain_name>/override

To unlock the voltage level of a specific voltage domain, write 0 to the override attribute:

$ echo 0 > /sys/kernel/debug/bpmp/debug/regulator/<voltage_domain_name>/override

System Power Measurement#

Jetson Thor integrates a three-channel INA3221 power monitor on the module and a one-channel INA238 power monitor on the carrier board. These two sensors are exposed under the Linux hwmon subsystem in following paths:

  • INA3221: /sys/bus/i2c/devices/2-0040/hwmon/hwmon*/

  • INA238: /sys/bus/i2c/devices/2-0044/hwmon/hwmon*/

Note

For a list of the exposed sysfs nodes for each power monitor, see the following:

The following table describes the power channels monitored by INA3221:

Channel Name

Description

Channel 1: VDD_GPU

Power consumed by GPU.

Channel 2: VDD_CPU_SOC_MSS

Power consumed by CPU, engines in SOC power domain, and engines in MSS power domain.

Channel 3: VIN_SYS_5V0

Power consumed by system 5V rail, which supplies various I/O devices.

The following table describes the power channels monitored by INA238:

Channel Name

Description

Channel 1: VIN

Power consumed by overall system, including module power and carrier board power.

To get the measured power number in time-series fashion, run the tegrastats utility:

$ sudo tegrastats

System Power Capping#

To protect both chip and board from damage, the INA3221 and INA238 power monitors can prevent power consumption from exceeding the Thermal Design Power (TDP) budget of the system.

Whenever the instantaneous or average current exceeds the configured OC limit, the module performs hardware-based clock throttling to the CPU and GPU.

The following table shows TDP budgets and OC limits for various Jetson Thor modules.

Module

Module TDP Budget

Limits

NVTHERM_OC PIN

Throttling Level

Jetson AGX Thor Developer Kit

130 watts

Average VDD_GPU power: 100 watts

OC3

  • CPU: 50%

  • GPU: 50%

Instantaneous VDD_GPU power: 120 watts

OC3

  • CPU: 50%

  • GPU: 50%

Instantaneous (VDD_GPU + VDD_CPU_SOC_MSS) power: 144 watts

OC3

  • CPU: 50%

  • GPU: 50%

Average VIN power: 168 watts

OC2

  • CPU: 75%

  • GPU: 75%

Caution

The INA3221 or INA238 driver might expose additional nodes. Avoid relaxing the current or power limit values in sysfs in an attempt to bypass hardware-based clock throttling. Modifying these limits does not ensure maximum performance and can cause permanent device damage. Instead, use more conservative settings to align with cooling and power management requirements.

Note

The included power adapter (ADP-240LB) can deliver up to 140 W (28 V × 5 A). To prevent excessive current draw, it features built-in overcurrent protection (OCP). When OCP is triggered, the adapter immediately cuts power to the device. To avoid tripping the adapter’s OCP, a 168 W limit is enforced on the Jetson AGX Thor Developer Kit. Users must ensure that the device’s power consumption does not exceed the power adapter’s rated capacity.

Overcurrent Event Status#

Three OC pins/events are available in the Jetson Thor series.

  • To determine which OC event is enabled, use the following sysfs nodes:

    $ grep "" /sys/class/hwmon/hwmon*/oc*_throt_en
    
  • Use the following sysfs nodes to learn the number of OC events that occurred:

    $ grep "" /sys/class/hwmon/hwmon*/oc*_event_cnt
    

CPU Power Management#

Jetson Thor provides three main features for CPU power management:

  • CPU dynamic frequency scaling

  • CPU hotplug power management

  • CPU idle power management

The following sections provide details about each of these features.

CPU Dynamic Frequency Scaling#

A Jetson Thor device contains 14 CPU clusters in 7 cluster pairs. CPU clusters in the same cluster pair share a clock unit so that they operate at the same frequency. The dynamic voltage and frequency scaling (DVFS) policy is enforced at the cluster-pair level rather than on individual CPU clusters.

By default, all cluster pairs use the schedutil cpufreq governor, which is a load-based frequency scaling policy. This governor reduces CPU frequency when the estimated CPU load against the CPU cluster is low.

Caution

With the real-time (RT) kernel, to meet the latency requirement, CPU DVFS is disabled by default.

To achieve maximum and consistent CPU performance, you can switch to the performance governor, which disables CPU DVFS and locks all CPU clusters at their maximum frequency:

$ echo performance > /sys/devices/system/cpu/cpufreq/policy<x>/scaling_governor

To check the current CPU frequency, you can read the cpuinfo_cur_freq node:

$ cat /sys/devices/system/cpu/cpufreq/policy<x>/cpuinfo_cur_freq

To check or update the maximum CPU frequency, use the scaling_max_freq node:

$ cat /sys/devices/system/cpu/cpufreq/policy<x>/scaling_max_freq
$ echo <KHz> > /sys/devices/system/cpu/cpufreq/policy<x>/scaling_max_freq

To check or update the minimum CPU frequency, use the scaling_min_freq node:

$ cat /sys/devices/system/cpu/cpufreq/policy<x>/scaling_min_freq
$ echo <KHz> > /sys/devices/system/cpu/cpufreq/policy<x>/scaling_min_freq

CPU Hotplug Power Management#

On a Jetson Thor device, each of the 14 clusters can be individually turned on or off during runtime to save CPU power.

To turn off a specific CPU cluster, write 0 to the online node in the cpu<x> directory:

$ echo 0 > /sys/devices/system/cpu/cpu<x>/online

To turn on a specific CPU cluster, write 1 to the online node in the cpu<x> directory:

$ echo 1 > /sys/devices/system/cpu/cpu<x>/online

CPU Idle Power Management#

Jetson Thor uses ARM CPU cores, and CPU idle management is supported in the Linux kernel through the upstream psci_idle driver.

For each core, an idle task is scheduled when no other runnable tasks are left in that CPU’s runqueue. This task puts the core into a low-power state selected by the cpuidle menu governor. The core stays in that state until an interrupt wakes it to process more work.

Idle States#

The following table summarizes the supported idle states that are available on Jetson Thor devices and the BSP software.

State

Meaning

WFI

Core clock gating

CC7

Cluster power gating

Disable cpuidle Power Feature#

If you want to completely disable the cpuidle power feature, disable the cc7 idle state device-tree node:

idle-states {
      cc7 {
              entry-latency-us = <0x1388>;
              exit-latency-us = <0x1388>;
              state-name = "Cluster Powergate";
              arm,psci-suspend-param = <0x40000007>;
              compatible = "arm,idle-state";
              status = "disabled";
              phandle = <0x15f>;
              min-residency-us = <0x61a8>;
      };
};

If you want to disable specific idle state <y> on specific cluster <x> during runtime, write 1 into the disable sysfs node:

$ echo 1 > /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/disable

Statistics of Idle State Usage#

To get the number of times that cluster <x> has entered idle state <y>, read the usage node:

$ cat /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/usage

To get the total time in microseconds that cluster <x> has spent in idle state <y> since boot, read the time node:

$ cat /sys/devices/system/cpu/cpu<x>/cpuidle/state<y>/time

GPU Power Management#

GPU Low Power#

The GPU Rail-Gating low power feature allows the GPU power to be turned off when the GPU is idle.

Use the following command to determine whether the GPU Rail-Gating feature is enabled:

$ cat /proc/driver/nvidia/gpus/0000:01:00.0/power | grep "Rail-Gating"

When GPU Rail-Gating is engaged, it increases wake-up latency. If users are sensitive to wake-up latency, use the following command to disable GPU Rail-Gating:

$ echo on > /sys/bus/pci/devices/0000:01:00.0/power/control

If users are more sensitive to power consumption than to latency, use the following command to enable GPU Rail-Gating:

$ echo auto > /sys/bus/pci/devices/0000:01:00.0/power/control

GPU Dynamic Frequency Scaling#

The GPU has two main clock domains exposed through the Linux devfreq framework:

  • GPC: Controls performance of compute and graphics tasks.

  • NVD: Controls performance of multimedia tasks like video decode/encode, JPEG processing, and OFA.

These domains can be monitored and controlled through their respective sysfs interfaces:

  • GPC: /sys/class/devfreq/gpu-gpc-0/

  • NVD: /sys/class/devfreq/gpu-nvd-0/

By default, both domains use dynamic frequency scaling with the nvhost_podgov governor, which automatically adjusts clock speeds based on GPU load. The governor’s behavior can be customized through parameters in the nvhost_podgov directory of each domain:

$ cd /sys/class/devfreq/gpu-gpc-0/nvhost_podgov
$ cd /sys/class/devfreq/gpu-nvd-0/nvhost_podgov

Caution

With the real-time (RT) kernel, to improve the latency, GPU DVFS is disabled because of the slow path of GPU load queries as part of the DVFS cycle.

Note that NVIDIA does not guarantee the functionality and performance of GPU kernel driver with the PREEMPT_RT kernel.

The following table lists the configurable parameters for the nvhost_podgov governor:

Parameter

Description

load_max

Load threshold at which the governor scales the clock to maximum frequency.

load_target

Target load that the governor maintains for the engine. When the underlying load exceeds the target value, the governor scales up the clock rate to reduce the load.

load_margin

Load margin associated with the load_target parameter. When the underlying load drops below load_target - load_margin, the governor scales down the clock rate to increase the load.

k

The moving-average weight factor for the average load calculation inside the governor. The average load of the device is calculated as follows:

load_avg = (load_avg * (2**k - 1) + load) / (2**k)

up_freq_margin

Number of frequency steps for each up-scaling operation.

down_freq_margin

Number of frequency steps for each down-scaling operation.

For maximum performance, you can disable dynamic scaling by switching to the performance governor:

$ echo performance > /sys/class/devfreq/gpu-gpc-0/governor
$ echo performance > /sys/class/devfreq/gpu-nvd-0/governor

GPU Management Tool#

Use the following command to monitor the frequency and utilization of the GPU engines:

$ nvidia-smi dmon

The output contains the following information:

Column

Description

gpu Idx

Index of GPU instance.

pwr

Not supported for Jetson.

gtemp

Not supported for Jetson.

mtemp

Not supported for Jetson.

sm

Utilization of GPU GPC engine.

mem

Not supported for Jetson.

enc

Utilization of GPU Encoder engine.

dec

Utilization of GPU Decoder engine.

jpg

Utilization of GPU JPG engine.

ofa

Utilization of GPU OFA engine.

mclk MHz

Not supported for Jetson.

pclk MHz

Not supported for Jetson.

GPU Persistence Mode#

The NVIDIA kernel mode driver must be running and connected to a target GPU device before any user interactions with that device can occur. After the driver is loaded, GPU initialization is triggered only when a GPU client attempts to access the GPU. When all GPU clients terminate, the driver deactivates the GPU.

Driver initialization behavior is important for end users in two ways:

  • Application start latency: Applications that trigger GPU initialization might incur a startup cost per GPU, which can be avoided if the GPU is already initialized.

  • Preservation of driver state: If the driver deactivates a GPU, some non-persistent state associated with that GPU is lost and reverts to defaults when the GPU is initialized next time.

Under Linux systems where X runs by default on the target GPU, the kernel mode driver is generally initialized and kept alive from machine startup to shutdown, courtesy of the X process. On headless systems or situations where no long-lived X-like client maintains a handle to the target GPU, the kernel mode driver initializes and deactivates the target GPU each time a target GPU application starts and stops. Because keeping the GPU initialized is often desirable in these cases, NVIDIA provides the Persistence Mode (Legacy) option to change the driver behavior.

Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it.

Persistence mode can be set using nvidia-smi.

To enable persistence mode using nvidia-smi (as root):

$ nvidia-smi -i <target gpu> -pm ENABLED
Enabled Legacy persistence mode for GPU <target gpu>
All done.

To disable persistence mode using nvidia-smi (as root):

$ nvidia-smi -i <target gpu> -pm DISABLED
Disabled persistence mode for GPU <target gpu>
All done.

To view current persistence mode using nvidia-smi:

$ nvidia-smi -i <target gpu> - q
==============NVSMI LOG==============

Timestamp                           : ----
Driver Version                      : ----

Attached GPUs                       : ----
GPU 0000:01:00.0
    Product Name                    : ----
    Display Mode                    : ----
    Display Active                  : ----
    Persistence Mode                : Enabled
    Accounting Mode                 : ----
    ...

To prevent the GPU from being initialized, you might also want to disable the GPU Rail-Gating feature, which powers off the whole GPU and introduces more wake-up latency. For more information, refer to GPU Low Power.

Memory Power Management#

NVIDIA SoC chipsets include power saving features whose operation is largely invisible to software at runtime. Most of those features are statically enabled at boot, according to settings in the boot configuration table (BCT).

Additionally, BSP implements Dynamic Voltage and frequency scaling for the memory controller (EMC/MC) and DRAM to save power. The EMC BCT and DVFS table are specific to the board design. The EMC DVFS table is included in the platform BPMP device tree file.

EMC Dynamic Frequency Scaling#

EMC dynamic frequency scaling (DFS) is realized in the BPMP firmware, and BPMP firmware considers two factors to determine the EMC frequency:

  • Memory bandwidth QoS requests from clients

  • Memory bandwidth utilization

The following sections describe these two factors in detail.

Memory Bandwidth QoS Requests From Clients#

To influence EMC frequency selection, clients like the GPU, PCIe, and display drivers can actively send two types of QoS requests to the BPMP firmware:

  • Average bandwidth request

  • Peak bandwidth request

For average bandwidth requests, BPMP firmware combines the requested bandwidth from clients and selects the EMC frequency point that can provide that much memory bandwidth.

For peak bandwidth requests, BPMP firmware chooses the greatest peak bandwidth request from all requested peak bandwidths and selects the EMC frequency point that can provide that much memory bandwidth.

The fundamental difference between these two types of requests is that the average bandwidth request considers the overall bandwidth requirements from all clients, but the peak bandwidth request considers only the maximum requested bandwidth at any point in time. The former is typically used for bandwidth-oriented devices, and the latter is typically used for latency-oriented devices.

To inspect the requests made by individual clients, examine the following debugfs node:

$ cat /sys/kernel/debug/bpmp/debug/bwmgr/bw_request_status

The output of the preceding node is similar to the following:

ID   |Niso BW     |Iso BW      |    Floor BW
0    |0           |0           |42598400
1    |0           |0           |0
2    |0           |0           |0
3    |0           |0           |0
4    |0           |0           |0
5    |0           |0           |0
6    |0           |0           |0
7    |0           |0           |0
8    |0           |0           |0
9    |0           |0           |0
10   |0           |0           |4723200
11   |0           |0           |0
12   |0           |0           |0
13   |0           |0           |0
14   |0           |0           |0
15   |161280000   |0           |0
16   |0           |0           |0
17   |0           |0           |0
18   |0           |0           |0
19   |0           |0           |0
20   |0           |0           |0
21   |0           |0           |0
22   |0           |0           |0
23   |0           |0           |0
24   |0           |0           |0
25   |173260800   |0           |0
26   |0           |0           |0
27   |0           |0           |0
28   |0           |0           |0
29   |0           |0           |0
30   |0           |0           |0
31   |0           |0           |0
32   |0           |0           |0
33   |0           |0           |0
34   |0           |0           |0
35   |0           |0           |0
36   |0           |0           |0
37   |0           |0           |0
38   |0           |0           |0
39   |0           |0           |0
40   |0           |0           |0
41   |0           |0           |0
42   |0           |0           |0
          Iso Rate Min (KHz) : 0
        Total Rate Min (KHz) : 4266000
         LA Rate Floor (KHz) : 0
  ISO Client Only Rate (KHz) : 0
         Max Floor BW (Kbps) : 42598400

The mapping of the ID column is as follows:

TEGRA264_BWMGR_ICC_PRIMARY

1U

TEGRA264_BWMGR_DEBUG

2U

TEGRA264_BWMGR_CPU_CLUSTER0

3U

TEGRA264_BWMGR_CPU_CLUSTER1

4U

TEGRA264_BWMGR_CPU_CLUSTER2

5U

TEGRA264_BWMGR_CPU_CLUSTER3

6U

TEGRA264_BWMGR_CPU_CLUSTER4

7U

TEGRA264_BWMGR_CPU_CLUSTER5

8U

TEGRA264_BWMGR_CPU_CLUSTER6

9U

TEGRA264_BWMGR_CACTMON

10U

TEGRA264_BWMGR_DISPLAY

11U

TEGRA264_BWMGR_VI

12U

TEGRA264_BWMGR_APE

13U

TEGRA264_BWMGR_VIFAL

14U

TEGRA264_BWMGR_GPU

15U

TEGRA264_BWMGR_EQOS

16U

TEGRA264_BWMGR_PCIE_0

17U

TEGRA264_BWMGR_PCIE_1

18U

TEGRA264_BWMGR_PCIE_2

19U

TEGRA264_BWMGR_PCIE_3

20U

TEGRA264_BWMGR_PCIE_4

21U

TEGRA264_BWMGR_PCIE_5

22U

TEGRA264_BWMGR_SDMMC_1

23U

TEGRA264_BWMGR_SDMMC_2

24U

TEGRA264_BWMGR_NVDEC

25U

TEGRA264_BWMGR_NVENC

26U

TEGRA264_BWMGR_NVJPG_0

27U

TEGRA264_BWMGR_NVJPG_1

28U

TEGRA264_BWMGR_OFAA

29U

TEGRA264_BWMGR_XUSB_HOST

30U

TEGRA264_BWMGR_XUSB_DEV

31U

TEGRA264_BWMGR_TSEC

32U

TEGRA264_BWMGR_VIC

33U

TEGRA264_BWMGR_APEDMA

34U

TEGRA264_BWMGR_SE

35U

TEGRA264_BWMGR_ISP

36U

TEGRA264_BWMGR_HDA

37U

TEGRA264_BWMGR_VI2FAL

38U

TEGRA264_BWMGR_VI2

39U

TEGRA264_BWMGR_RCE

40U

TEGRA264_BWMGR_PVA

41U

TEGRA264_BWMGR_NVPMODEL

42U

Given the type of each client, its average bandwidth request goes to either the Niso BW or the Iso BW column, and its peak bandwidth request goes to the Floor BW column.

Memory Bandwidth Utilization#

Information about system memory bandwidth utilization is provided by the central actmon hardware. BPMP firmware considers the current memory bandwidth utilization to determine whether the EMC frequency should be adjusted by scaling up or down. This process is referred to as central actmon DFS.

Internally, the BPMP compares the current memory bandwidth utilization against the boost-up threshold and the boost-down threshold. These threshold values are configured in the BPMP device tree file:

bwmgr {
        enabled = <0x1>;

        cactmon {
                enabled = <0x1>;

                mc_all {
                        # Skip...
                        boost_up_threshold = <0x1e>;    # 30%
                        boost_down_threshold = <0x14>;  # 20%
                        # Skip...
                };
        };
};

If current memory bandwidth usage exceeds 30%, the BPMP will scale up the EMC frequency, and if the memory bandwidth usage is lower than 20%, the BPMP will scale down the EMC frequency.

To disable the central actmon DFS on the BPMP side, change the cactmon-enabled property to 0:

bwmgr {
        enabled = <0x1>;

        cactmon {
                enabled = <0x0>;

                # Skip...
        };
};

To completely disable the EMC DFS, including central actmon DFS and ICC memory bandwidth management:

$ echo 1 > /sys/kernel/debug/bpmp/debug/bwmgr/bwmgr_halt

MSS Profile Manager for EMC Clock Mapping#

The EMC clock affects the throughput and latency of the path from memory controller to DRAM module. To adjust the speed of EMC so that the SoC can saturate the available memory bandwidth, a clock mapping table is maintained inside the BPMP device-tree blob file for the following three clocks in the MSS and SOC domains:

  • UCF_SCF clock

  • UCF_MSS clock

  • SMMU clock

To check the clock mapping table, first get the current profile ID from the following node:

$ cat /sys/kernel/debug/bpmp/debug/mssprofiles/current_profile

Note

The default profile ID is 1.

After you obtain the profile ID, you can go inside the profile directory to check the clock mapping table:

$ cd /sys/kernel/debug/bpmp/debug/mssprofiles/profile_<profile_id>

Then you can check the clock mapping table for all three clocks with the following command:

$ grep "" *_setpoint_*

The preceding command provides output similar to the following:

smmu_setpoint_1:400000000
smmu_setpoint_2:900000000
smmu_setpoint_3:900000000
smmu_setpoint_4:900000000
smmu_setpoint_5:900000000
smmu_setpoint_6:900000000
smmu_setpoint_7:900000000
smmu_setpoint_8:900000000
ucf_slc_setpoint_1:864000000
ucf_slc_setpoint_2:1845000000
ucf_slc_setpoint_3:1845000000
ucf_slc_setpoint_4:1845000000
ucf_slc_setpoint_5:1845000000
ucf_slc_setpoint_6:1845000000
ucf_slc_setpoint_7:1845000000
ucf_slc_setpoint_8:1845000000
ucf_soc_setpoint_1:765000000
ucf_soc_setpoint_2:1971000000
ucf_soc_setpoint_3:1971000000
ucf_soc_setpoint_4:1971000000
ucf_soc_setpoint_5:1971000000
ucf_soc_setpoint_6:1971000000
ucf_soc_setpoint_7:1971000000
ucf_soc_setpoint_8:1971000000

where:

  • ucf_slc_setpoint_x: stands for the UCF_SCF clock

  • ucf_soc_setpoint_x: stands for the UCF_MCF clock

  • smmu_setpoint_x: stands for the SMMU clock

  • x: the index of the clock setpoint.

There are in total 4 EMC frequency points available on Jetson Thor. The mapping of EMC frequency to the setpoint index is as follows:

EMC Rate (KHz)

Setpoint ID

665600

1

2750000

6

3200000

7

4266000

8

For example, if EMC is running at 4,266,000 KHz, the setpoint ID is 8. Therefore, the UCF_SCF, UCF_MCF, and SMMU clocks are scaled to setpoint 8:

smmu_setpoint_8:900000000
ucf_slc_setpoint_8:1845000000
ucf_soc_setpoint_8:1971000000

To update the clock mapping table, you can update the mssprofmgr node within the BPMP device-tree blob file:

mssprofmgr {
    default-profile = <1>;
    profile@1 {
        profile-id = <1>;
        ucf-soc-mapping =  /bits/ 64 <
            765000000
            1971000000
            1971000000
            1971000000
            1971000000
            1971000000
            1971000000
            1971000000
        >;
        ucf-slc-mapping =  /bits/ 64 <
            864000000
            1845000000
            1845000000
            1845000000
            1845000000
            1845000000
            1845000000
            1845000000
        >;
        smmu-mapping =  /bits/ 64 <
            400000000
            900000000
            900000000
            900000000
            900000000
            900000000
            900000000
            900000000
        >;
    };
};

Multimedia Engine Power Management#

The Video Image Compositor (VIC) is used for common video-processing tasks like frame scaling, frame rotation, and pixel color space conversion.

The VIC has its own activity monitor (actmon) hardware used for monitoring its load, and this load information is used for load-based dynamic frequency scaling.

The VIC clock domain can be controlled through its sysfs interface:

  • /sys/class/devfreq/8188050000.vic

By default, VIC uses dynamic frequency scaling with the tegra_wmark governor, which automatically adjusts clock speeds based on VIC load. The governor’s behavior can be customized through parameters in the tegra_wmark directory:

$ cd /sys/class/devfreq/8188050000.vic/tegra_wmark

The following table lists the configurable parameters for the tegra_wmark governor:

Parameter

Description

load_target

Target load that the governor maintains for the engine.

up_wmark_margin

Load margin associated with the load_target parameter. When the underlying load exceeds load_target + up_wmark_margin, the governor scales up the clock rate to reduce the load.

down_wmark_margin

Load margin associated with the load_target parameter. When the underlying load drops below load_target - down_wmark_margin, the governor scales down the clock rate to increase the load.

up_freq_margin

Number of frequency steps for each up-scaling operation.

down_freq_margin

Number of frequency steps for each down-scaling operation.

The tegra_wmark governor relies on the actmon hardware to periodically sample the load of the VIC engine. If needed, actmon triggers a DVFS cycle via interrupt.

To check the actmon settings for the VIC engine, go to the debugfs directory under tegra-host1x.0:

$ cd /sys/kernel/debug/tegra-host1x.0/actmon/vic

To check or update the sampling period (in microseconds) for the VIC engine, read from or write to the following node:

$ cat /sys/kernel/debug/tegra-host1x.0/actmon/vic/sample_period
$ echo <sample_period> > /sys/kernel/debug/tegra-host1x.0/actmon/vic/sample_period

To check the current load of the VIC engine, read from the following node:

$ cat /sys/kernel/debug/tegra-host1x.0/actmon/vic/module0/usage

In the background, the actmon hardware itself calculates the load as a moving average. The formula of the load calculation is as follows:

load_avg = (load_avg * (2**(k+1)) - 1) + load) / (2**(k+1))

The value of load_avg is the average load value, and load is the value sampled by the actmon in the current cycle. The k is a programmable variable that you can check or update through the following node:

$ cat /sys/kernel/debug/tegra-host1x.0/actmon/vic/module0/k
$ echo <k> > /sys/kernel/debug/tegra-host1x.0/actmon/vic/module0/k

For maximum performance, you can disable dynamic scaling by switching to the performance governor:

$ echo performance > /sys/class/devfreq/8188050000.vic/governor

USB Power Management#

Autosuspend#

In Linux, the kernel can suspend (that is, change the link state to L2 or U3) an idle device by software. This action is called autosuspend.

The attributes to control autosuspend are under /sys/bus/usb/devices/<device-id>/power/:

  • wakeup

    This file is empty if the device does not support remote wakeup. Otherwise the file contains either enabled or disabled, indicating whether remote wakeup is enabled. You can write those words to the file.

  • control

    This file contains either on or auto. You can write those values to the file to change the device’s setting. Set to on to disallow autosuspend. In other words, the device won’t be placed in a low-power state by software. Set to auto to allow the kernel to autosuspend and autoresume the device.

  • autosuspend_delay_ms

    This file contains an integer value, which is the number of milliseconds the device remains idle before the kernel can autosuspend it. The default is 2000. Set to 0 to autosuspend as soon as the device becomes idle. Negative values mean never to autosuspend. You can write a number to the file to change the autosuspend idle-delay time.

Caution

Many devices do not fully support USB link power management. For this reason, by default the kernel disables autosuspend (the power/control attribute is initialized to on) for all devices other than hubs.

Also note that the kernel does not prevent you from enabling autosuspend on devices that can’t handle it. In theory, you could damage a device by suspending it at the wrong time. (Highly unlikely, but possible.)

We recommend that you manually enable autosuspend and check for issues. If autosuspend works for a device, you can add udev rules. You can also change the idle-delay time because two seconds might not be the best choice for every device.

PCIe Power Management#

Active State Power Management#

Active State Power Management (ASPM) is a power-management mechanism for PCIe devices to garner power savings while otherwise in a fully active state. Predominantly, this is achieved through active-state link power management; that is, the PCI Express serial link is powered down when there is no traffic across it.

Although ASPM reduces power consumption, it can also result in increased latency because the serial bus must be “awakened” from low-power mode, possibly reconfigured, and the host-to-device link re-established. This condition is known as ASPM exit latency and takes up valuable time, which can be annoying to the end user. As a result, knowing the trade-off between performance and power regarding this mechanism can be critical.

When ASPM is enabled, the hardware can manage link state automatically without communication between host and device. Two low-power modes can be entered by ASPM: L0s and L1. L0s sets low-power mode for one direction of the serial link only, usually downstream of the PHY controller. L1 shuts off the PCI Express link completely, including the reference clock signal, until a dedicated signal (CLKREQ#) is asserted, and results in greater power reductions, although with the penalty of greater exit latency. L1 also has many substates to improve the granularity of the power/performance trade-off.

Configure the PCIe Controller#

Before you can enable ASPM on Jetson Thor for an endpoint device, you might need to configure the ASPM capability of the PCIe controller. Determine which controller corresponds to the device for ASPM and then add the aspm-capability attribute to ODMDATA before flashing. The following example applies to PCIe controller C2:

ODMDATA="pcie@2_status=okay_aspm-capability=14";

ASPM Control on Linux#

After you add ASPM support on the PCIe controller, the Linux kernel determines ASPM settings according to the parameter policy of the pcie_aspm driver. Read the /sys/module/pcie_aspm/parameters/policy node to determine the valid values to set on this parameter.

$ cat /sys/module/pcie_aspm/parameters/policy
[default] performance powersave powersupersave

On Jetson devices, ASPM has four policies to determine how the kernel controls ASPM.

  • default

    ASPM enablement is set according to the defaults specified by the bootloader or firmware on the system.

  • performance

    ASPM is disabled for all devices. This policy prevents devices from entering a low-power state, which enhances performance.

  • powersave

    Enable ASPM states L0s and L1 for power saving.

  • powersupersave

    Enable ASPM states L0s and L1 and the L1 substates for power saving.

To change the policy at runtime, write the policy string to /sys/module/pcie_aspm/parameters/policy. If you want the policy to apply to every system boot, prepend pcie_aspm.policy= to the policy string (for example, pcie_aspm.policy=powersupersave) and add it to bootargs.

You can check the ASPM settings by using the lspci utility. The enablement of L0s and L1 states can be checked from the LnkCtl field, and the L1 substate can be checked from L1SubCtl1 field.

Exit Latency Restriction for ASPM#

In some cases, ASPM still gets disabled even if you configure the PCIe controller to support ASPM and set the powersupersave policy. One reason might be low-power state exit latency. In short, even with the powersupersave mode, the Linux pcie_aspm driver might prevent the device from enabling ASPM because of the high exit latency.

For example, if the PCIe controller on a Jetson Thor device is configured at Gen5 speed, high exit latency can prevent the PCIe endpoints that connect to the controller from enabling ASPM.

As a result, we strongly recommend that you set max-link-speed according to the maximum speed of the endpoint device. For example, if we know that the PCIe controller C2 is always connected to a Gen3 network interface card, we can configure the controller as follows:

ODMDATA="pcie@2_status=okay_aspm-capability=14_max-link-speed=3";

Runtime Power Management#

PCIe also supports software-based control for link power management. On the Linux kernel, this is designed through the runtime power management framework.

The interface to control PCIe runtime power management is under /sys/bus/pci/devices/<device-id>/power/: the control attribute. This file contains either on or auto. You can write those values to the file to change the device’s setting. Set to on to disallow autosuspend. In other words, the device won’t be placed in a low-power state by software. Set to auto to allow the kernel to autosuspend and autoresume the device.

The support for runtime power management of a PCIe device can be missing. In that case, setting control to auto has no effect.

UFS Power Management#

Hibernate is the deepest low-power link state for UFS. When no transmission occurs between the host and the UFS device, ultra power saving can be achieved with the state.

Auto-Hibernation#

Auto-hibernation is a feature that enables a UFS host controller to send the link to the hibernation state automatically.

For Jetson Thor with Linux, the interface to control auto-hibernation is under /sys/bus/platform/devices/a80b8d0000.ufshci/: the auto_hibern8 attribute. This file contains the auto-hibernate idle timer setting of a UFS host controller. A value of zero means that auto-hibernate is not enabled. A positive value specifies the number of microseconds of idle time before the UFS host controller autonomously puts the link into hibernate state. That state saves power at the expense of increased latency.

Software Hibernation#

Instead of controlling the link state by hardware, it is also possible to enter hibernation through software. On Linux, this is achieved by the runtime power management framework.

You must enable runtime power management for every SCSI LUN and the UFSHCD to enable the path for software hibernation.

# echo auto > /sys/bus/platform/devices/a80b8d0000.ufshci/power/control
# tee /sys/bus/scsi/devices/host0/target0\:0\:0/0\:0\:0\:*/power/autosuspend_delay_ms <<< 2000
# tee /sys/bus/scsi/devices/host0/target0\:0\:0/0\:0\:0\:*/power/control <<< auto
# echo auto > /sys/bus/platform/devices/a80b8d0000.ufshci/power/control

To verify whether software hibernation occurs, first ensure that the UFS device is idle. Then read /sys/bus/platform/devices/a80b8d0000.ufshci/power/runtime_status. If the device entered hibernation successfully, the value of the node is suspended.

NVMe Power Management#

Power States#

Power states represent various working states for an NVMe device. Each state corresponds to specific maximum power consumption, in/out conversion time, read/write latency, and other characteristics. The controller can support up to 32 power states, PS0 through PS31. The larger the number, the lower the power consumption.

You can obtain the controller data structure through the NVMe Identify command. In that structure, the power state descriptor data structure describes the characteristics of each power state that is supported by the controller. For example, information for nvme0n1 is similar to the following:

# sudo nvme id-ctrl /dev/nvme0n1 | grep "ps      0"  -A 2
ps      0 : mp:5.60W operational enlat:0 exlat:0 rrt:0 rrl:0
        rwt:0 rwl:0 idle_power:0.2200W active_power:5.60W
        active_power_workload:80K 128KiB SW

For more information, see “Identify – Power State Descriptor Data Structure” in the NVM Express Base Specification.

Autonomous Power State Transitions#

Some controllers support Autonomous Power State Transition (APST), which allows the controller to automatically switch power states to control temperature (thermal management) or power consumption.

By default, APST is enabled on Linux if it is supported.

# nvme get-feature /dev/nvme0n1 -f 0x0c -H | head -n 2
get-feature:0x0c (Autonomous Power State Transition), Current value:0x00000001
        Autonomous Power State Transition Enable (APSTE): Enabled

If you are looking for better performance, you can sacrifice power by setting nvme_core.default_ps_max_latency_us in bootargs. The value specifies the acceptable latency for power states to be enabled by the kernel. For example, set nvme_core.default_ps_max_latency_us=0 to keep the device in PS0, preventing any autonomous power state switch.

Thermal Throttle Management#

Some NVMe controllers support a mechanism for temperature management, which you can use as a basis for the controller to automatically switch between power states, or to provide temperature-management functions to the host to meet specific requirements.

For example, if the NVMe is used in a desktop or server with good heat dissipation, you might raise the upper limit of the temperature to maintain operating performance. If it is used in a laptop or mobile device, though, you need to keep the NVMe as cool as possible to protect the battery. You can use the NVMe thermal throttle interface to meet a range of usage requirements and user expectations.

From the Identify command, you can find the mntmt and mxtmt fields, which indicate the minimum and maximum temperatures (in degrees Kelvin) that trigger thermal management.

# nvme id-ctrl /dev/nvme0n1 | grep tmt
mntmt     : 273
mxtmt     : 360

The host can specify Thermal Management Temperature 1 (bits 15:0) and Thermal Management Temperature 2 (bits 31:16) in the Set Feature command with the feature identifier set to 10h. A value of 0h indicates that the controller does not report this field or that the host-controlled thermal management feature is not supported.

# nvme set-feature /dev/nvme0 -f 0x10 -V value

For more information, see “8.1.17.5 Host Controlled Thermal Management” in the NVM Express Base Specification.

Ethernet Power Management#

Energy-Efficient Ethernet#

Energy-Efficient Ethernet (EEE) is a set of enhancements to reduce power consumption during periods of low data activity. The intention is to reduce power consumption by at least half while retaining full compatibility with existing equipment.

When the controlling software or firmware decides that no data needs to be sent, it can issue a low-power idle (LPI) request to the Ethernet controller physical layer. The PHY then sends LPI symbols for a specified time onto the link and then disables its transmitter. Turning off the unused circuit reduces power consumption. Refresh signals are sent periodically to maintain link signaling integrity. When there is data to transmit, a normal IDLE signal is sent for a predetermined period of time. The data link is considered to be always operational because the receive signal circuit remains active even when the transmit path is in sleep mode.

You can enable or disable EEE by using the ethtool utility.

# ethtool --set-eee eth0 eee [on|off]
EEE settings for eth0:
        EEE status: eth0 - inactive

Note that EEE can be active only if both sides of the link support it. For example, if you connect a Tegra device to an RJ45 port, EEE can be inactive if the hub doesn’t support it.

You can check whether EEE is enabled and active by using ethtool --show-eee.

# ethtool --show-eee eth0
EEE settings for eth0:
        EEE status: eth0 - inactive

Display Power Management#

Display Power Management Signaling#

Display Power Management Signaling (DPMS) is a mechanism for power saving of video monitors. It is designed by the VESA consortium and defines the power management of horizontal synchronization (H-Sync) signals and vertical synchronization (V-Sync) signals. Three levels of power saving—Standby, Suspend and Off—are included for DPMS, and each can be configured with a time for inactivity before the monitor enters the given level.

The following table provides a brief overview of the differences between the states.

State

H-Sync

V-Sync

Power Saving

On

On

On

None

Standby

Off

On

Minimal

Suspend

On

Off

Substantial

Off

Off

Off

Maximum

On Tegra, the X Window System provides the display service. Due to this, you can use the xset utility to check the display status or configure the display-related settings. The following command gives you the display status:

$ xset q

You can enable or disable DPMS by using the following commands:

# xset +dpms // Enable DPMS
# xset -dpms // Disable DPMS

To adjust the idle-time delay for the monitor to enter the given power-saving state, you can use the following commands. For example, xset 10 20 30 means to enter “Standby” after 10 seconds of idle, “Suspend” after 20 seconds, and “Off” after 30 seconds. If you set the timeout to 0, the corresponding state is disabled. In other words, xset 0 0 0 means to disable DPMS implicitly. This could be a better way to “disable” DPMS, because the effect of -dpms could be reverted by other DPMS commands.

# xset dpms [standby [suspend [off]]]

You can also forcibly set the DPMS state instead of specifying the idle timeout to enter the state.

# xset dpms force on      // Turn off screen immediately
# xset dpms force off     // Turn off screen immediately
# xset dpms force standby // Standby screen
# xset dpms force suspend // Suspend screen

Supported Modes and Power Efficiency#

Jetson Thor is designed with a high efficiency Power Management Integrated Circuit (PMIC), voltage regulators, and power tree to optimize power efficiency. It supports multiple optimized power budgets, such as 10 watts, 15 watts, and 30 watts. For each power budget, several configurations are possible with various CPU frequencies and number of cores online.

Capping the memory, CPU, and GPU frequencies, and number of online CPU, GPU TPC, and PVA cores at a prequalified level confines the module to the target mode. Refer to the Thermal Design Guide, which you can find in the Jetson Download Center, for heavy workloads. The configurations predefined by NVIDIA are as follows.

The MAXN mode is an unconstrained power mode that allows a maximum number of cores and clock frequency for CPU, GPU, PVA, and SOC engines like VI, VIC, and so on. However, this mode does not guarantee the best performance for all use cases because hardware throttling is engaged when the total module power exceeds the TDP budget. Therefore, it is not the maximum performance mode. This is an experimental mode to tweak clock settings and create custom power modes that balance performance and power consumption. Refer to Power Estimator for more information about estimating the power and generating the nvpmodel configuration file for the custom power mode.

Because MAXN mode is an experimental setting for adjusting clock settings and creating custom power profiles, we don’t recommend running heavy workloads for prolonged periods in this mode.

NVP Model Clock Configuration for Jetson T5000

Property

Mode

MAXN

120W*

Power budget

n/a

120W

Mode ID

0

1

Online CPU

14

14

CPU maximum frequency (MHz)

2601

2601

GPU TPC

10

10

GPU maximum frequency (MHz)

1575

1386

NVDEC/NVENC/OFA/NVJPG maximum frequency (MHz)

1692

1557

PVA cores

1

1

PVA VPS maximum frequency (MHz)

1215

1215

PVA AXI maximum frequency (MHz)

909

909

Memory maximum frequency (MHz)

4266

4266

All modes SOC clocks maximum frequency (MHz)

adsp: 800 display: 843 rce: 396 vi: 873
ape: 600 display_hub: 385.7 se: 855 vic: 1107
axi_cbb: 202.5 host1x: 202.5 smmu: 900
bpmp: 810 isp: 1215 sor: 833
dce: 396 mcf: 1503 tsec: 360
* The default mode is 120W (mode ID 1).

Power Mode Controls#

You can display and change the power mode with the nvpmodel command.

  • To change the power mode, enter the command:

    $ sudo /usr/sbin/nvpmodel -m <x>
    

    Where <x> is the power mode ID (for example, 0, 1, 2 or 3).

    Alternatively, use the nvpmodel GUI front end. For more information, see nvpmodel GUI, later in this topic.

    After you set a power mode, the module stays in that mode until you change it. The mode persists across power cycles and SC7.

Note

GPU gpu_pg_mask can be set once before the GPU golden context is created. If the nvpmodel power mode change requires a different gpu_pg_mask value, a system reboot is required.

  • Example:

    ubuntu@jetson:~$ sudo nvpmodel -m 0
    NVPM WARN: Golden image context is already created
    NVPM WARN: Reboot required for changing to this power mode: 0
    NVPM WARN: DO YOU WANT TO REBOOT NOW? enter YES/yes to confirm:
    

Type YES or yes to initiate reboot or press any other key to cancel. The settings will be in effect after the reboot.

  • To display the current power mode, enter the following command:

    $ sudo /usr/sbin/nvpmodel -q
    

    Alternatively, see the mode displayed to the right of the NVIDIA icon in the nvpmodel window’s menu bar. For more information, see nvpmodel GUI, later in this topic.

  • To add a custom power mode definition, edit this file:

    /etc/nvpmodel.conf
    

    This is an example entry for mode 1:

    < POWER_MODEL ID=1 NAME=120W >
    CPU_ONLINE CORE_0 1
    CPU_ONLINE CORE_1 1
    CPU_ONLINE CORE_2 1
    CPU_ONLINE CORE_3 1
    CPU_ONLINE CORE_4 1
    CPU_ONLINE CORE_5 1
    CPU_ONLINE CORE_6 1
    CPU_ONLINE CORE_7 1
    CPU_ONLINE CORE_8 1
    CPU_ONLINE CORE_9 1
    CPU_ONLINE CORE_10 1
    CPU_ONLINE CORE_11 1
    CPU_ONLINE CORE_12 1
    CPU_ONLINE CORE_13 1
    GPU_POWER_GATING GPU_PG_MASK 64
    GPU_POWER_CONTROL_ENABLE GPU_PWR_CNTL_EN on
    CPU_AE_0 MIN_FREQ 972000
    CPU_AE_0 MAX_FREQ 2601000
    CPU_AE_1 MIN_FREQ 972000
    CPU_AE_1 MAX_FREQ 2601000
    CPU_AE_2 MIN_FREQ 972000
    CPU_AE_2 MAX_FREQ 2601000
    CPU_AE_3 MIN_FREQ 972000
    CPU_AE_3 MAX_FREQ 2601000
    CPU_AE_4 MIN_FREQ 972000
    CPU_AE_4 MAX_FREQ 2601000
    CPU_AE_5 MIN_FREQ 972000
    CPU_AE_5 MAX_FREQ 2601000
    CPU_AE_6 MIN_FREQ 972000
    CPU_AE_6 MAX_FREQ 2601000
    CPU_AE_7 MIN_FREQ 972000
    CPU_AE_7 MAX_FREQ 2601000
    CPU_AE_8 MIN_FREQ 972000
    CPU_AE_8 MAX_FREQ 2601000
    CPU_AE_9 MIN_FREQ 972000
    CPU_AE_9 MAX_FREQ 2601000
    CPU_AE_10 MIN_FREQ 972000
    CPU_AE_10 MAX_FREQ 2601000
    CPU_AE_11 MIN_FREQ 972000
    CPU_AE_11 MAX_FREQ 2601000
    CPU_AE_12 MIN_FREQ 972000
    CPU_AE_12 MAX_FREQ 2601000
    CPU_AE_13 MIN_FREQ 972000
    CPU_AE_13 MAX_FREQ 2601000
    GPU MIN_FREQ 314000000
    GPU MAX_FREQ 1386000000
    VIDEO MIN_FREQ 314000000
    VIDEO MAX_FREQ 1557000000
    GPU_POWER_CONTROL_DISABLE GPU_PWR_CNTL_DIS auto
    EMC MAX_FREQ 4266000000
    PVA0_VPS MAX_FREQ 1215000000
    PVA0_AXI MAX_FREQ 909000000
    

    The unit of measure for CPU frequency is kilohertz. The unit for GPU, VIDEO, EMC, and PVA frequency is hertz. You must assign each custom mode a unique number in the ID field. Test your use case to determine:

    • How many active cores to use.

    • Frequency limits per engine.

    The frequencies you select are subject to the MAXN limit defined in mode 0.

  • To learn about other options, enter the command:

    $ /usr/sbin/nvpmodel -h
    

Fan Profile Control#

Jetson Thor supports a profile of fan operation named “cool”.

Userspace fan speed control daemon nvfancontrol manages fan speed based on the trip point temperatures configured for the selected profile.

Fan Profile Configuration#

Every fan speed step is associated with the trip point temperature and corresponding hysteresis. The following table shows the configurations predefined by NVIDIA.

Fan profile configuration for Jetson AGX Thor Developer Kit

Fan profile "cool"

Trip temperature*†

0

15

24

29

35

45

115

Hysteresis*

0

0

0

0

0

0

0

Fan PWM value

255

255

192

140

102

77

77

Fan RPM value

5371

5371

4170

2900

2300

1750

1750

* Trip temperature and hysteresis in degrees Celsius.
† Trip temperature is the TMARGIN temperature.

nvfancontrol#

nvfancontrol is a userspace fan speed control daemon. This manages the fan speed based on the temperature-to-fan-speed mapping table in the nvfancontrol configuration file.

Basic elements in the nvfancontrol service include TMARGIN, kickstart PWM, fan profile, fan control, and fan governor. All of these can be programmed via the configuration file based on the user’s preferences. This chapter explains each of them in the following sections.

nvfancontrol.conf#

  • Location:

    /etc/nvfancontrol.conf
    
  • Sample nvfancontrol.conf file for Jetson Thor:

    POLLING_INTERVAL 2
    
    <FAN 1>
        TMARGIN ENABLED
        FAN_GOVERNOR pid {
                STEP_SIZE 10
        }
        FAN_GOVERNOR cont {
                STEP_SIZE 10
        }
        FAN_CONTROL close_loop {
                RPM_TOLERANCE 100
        }
        FAN_PROFILE cool {
                #TEMP   HYST    PWM     RPM
                0       0       255     5371
                15      0       255     5371
                24      0       192     4170
                29      0       140     2900
                35      0       102     2300
                45      0       77      1750
                115     0       77      1750
        }
        THERMAL_GROUP 0 {
                GROUP_MAX_TEMP 115
                #Thermal-Zone Coeffs Max-Temp
                cpu-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
                gpu-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
                soc012-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
                soc345-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
        }
        FAN_DEFAULT_CONTROL close_loop
        FAN_DEFAULT_PROFILE cool
        FAN_DEFAULT_GOVERNOR cont
        KICKSTART_PWM 51
    

Default Fan Profile#

For Jetson Thor devices, by default, the fan profile is set to “cool”. It is defined as FAN_DEFAULT_PROFILE in the configuration file /etc/nvfancontrol.conf.

Change the Default Fan Profile#

To change the fan profile, complete the following steps:

  1. Stop the nvfancontrol systemd service:

    sudo systemctl stop nvfancontrol
    
  2. Update the fan profile by editing /etc/nvfancontrol.conf

  3. Remove the status file:

    sudo rm /var/lib/nvfancontrol/status
    
  4. Start the nvfancontrol systemd service:

    sudo systemctl start nvfancontrol
    

Identifying the Current Fan Profile#

  • Enter the command:

    $ sudo nvfancontrol -q
    

Example:

$ sudo nvfancontrol -q
FAN1:FAN_PROFILE:cool
...
...

After you set a fan profile, the module stays in that profile until you change it. The profile persists across power cycles and SC7.

Fan Profile Table#

The fan profile table contains the mapping between the temperature and the fan speed. It also contains the hysteresis value for each step.

  • Syntax:

    FAN_PROFILE <fan_profile_name> {
            <temp>  <hyst>  <pwm>   <rpm>
    }
    
    Where:
    <fan_profile_name>: Fan Profile Name
    <temp>: Temperation step in degree celcius
    <hyst>: Hysteresis step
    <pwm>:  Fan PWM value
    <rpm>:  Fan RPM value
    
  • Example:

    FAN_PROFILE cool {
            #TEMP   HYST    PWM     RPM
            0       0       255     5371
            15      0       255     5371
            24      0       192     4170
            29      0       140     2900
            35      0       102     2300
            45      0       77      1750
            115     0       77      1750
    }
    

TMARGIN#

TMARGIN temperature is the difference between the maximum allowable temperature and the current thermal zone temperature. For example, if the maximum allowable temperature of cpu-thermal is 115°C and the current temperature of cpu-thermal is 45°C, the current TMARGIN temperature of cpu-thermal is 70°C (115 – 45).

Kickstart PWM#

The minimal required PWM value to start the fan from complete stop state is called kickstart PWM. The fan might not start spinning if the PWM value is less than kickstart PWM.

Thermal Group#

THERMAL_GROUP contains the group maximum temperature for calculating the TMARGIN temperature and the list of thermal zones considered for calculating the trip temperature.

  • Thermal group maximum temperature:

    GROUP_MAX_TEMP <temp_in_degree_celcius>
    

    This parameter is used only when TMARGIN is enabled. The TMARGIN temperature is calculated as shown in TMARGIN section.

  • Thermal zone name, coefficients, and the thermal zone maximum temperature:

    <thermal_zone_name> <coeff_0>,<coeff_2>....,<coeff_19> <thermal_zone_max_temp>
    
    • <thermal_zone_name>: Thermal zone name

    • <coeff_0..coeff_19>: Coefficients used for calculating weighted average. Currently, only <coeff_0> is taken into consideration.

    • <thermal_zone_max_temp>: Thermal zone maximum temperature. This is used only when TMARGIN is enabled. If GROUP_MAX_TEMP is specified, this temperature is ignored.

The following example demonstrates how to calculate weighted average temperature with TMARGIN enabled. If the current cpu-thermal is 55°C, gpu-thermal is 75°C, soc012-thermal is 43°C, and soc345-thermal is 39°C, the weighted average TMARGIN temperature is (115 - 55) * 0.25 + (115 - 75) * 0.25 + (115 - 43) * 0.25 + (115 - 39) * 0.25 or 62°C when using the following thermal group:

THERMAL_GROUP 0 {
        GROUP_MAX_TEMP 115
        #Thermal-Zone Coeffs Max-Temp
        cpu-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
        gpu-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
        soc012-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
        soc345-thermal 25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0
}

Fan Control#

The nvfancontrol service has two types of fan controls:

  • open-loop: The open-loop fan control adjusts the fan speed by setting the desired PWM value based on the current trip temperature step. The RPM values in the profile are ignored.

  • closed-loop: The closed-loop fan control makes the fan spin close to the desired RPM value based on the current trip temperature step. The PWM values in the profile are ignored.

    To have the fan spin at the exact same speed as the target RPM incurs a performance drop and the risk of shorter fan life due to the constant adjustment of the speed. A programmable value specifies the tolerance between the target RPM and the current RPM value. In the following example, an RPM difference of 100 is specified as being acceptable:

    FAN_CONTROL close_loop {
            RPM_TOLERANCE 100
    }
    

Fan Governor#

The fan governor decides the fan speed control logic based on the fan profile. The nvfancontrol service has two kinds of fan governors: pid and cont.

The examples in this section use the following profile, which enables TMARGIN and open-loop control:

TMARGIN ENABLED
FAN_PROFILE cool {
        #TEMP   HYST    PWM     RPM
        0       0       255     5371
        15      0       255     5371
        24      0       192     4170
        29      0       140     2900
        35      0       102     2300
        45      0       77      1750
        115     0       77      1750
}
  • pid: The pid governor changes the fan speed only when the weighted average temperature crosses the trip temperature step. The curve between the weighted average temperature and fan speed resembles a stair.

    For example, when a TMARGIN weighted average decreases and the TMARGIN weighted average is 70°C, the PWM is set to 77. Later, even when the TMARGIN weighted average decreases to 45°C, the PWM will still be set to 77. When the TMARGIN weighted average decreases to 44°C, the PWM is set to 102 until the next trip temperature step is crossed.

  • cont: The cont governor linearly interpolates the fan speed based on the upper and lower fan speeds between the trip temperature steps. Compared to the pid governor, the curve between weighted average temperature and fan speed is more continuous.

    For example, when the current TMARGIN weighted average is 32°C, the PWM will be set to 121 (140 + (32 - 29) * (102 - 140) / (35 - 29)).

Hysteresis in nvfancontrol#

In nvfancontrol, hysteresis is used to define the temperature threshold for fan speed changes when using the pid governor.

Consider the following profile with TMARGIN enabled:

TMARGIN ENABLED
FAN_PROFILE cool {
        #TEMP   HYST    PWM     RPM
        0       0       255     2900
        18      9       255     2900
        30      11      202     2300
        45      11      149     1700
        60      14      88      1000
        115     0       0       0
}

The fan turns on when the TMARGIN temperature reaches 60°C. When the TMARGIN temperature exceeds 74°C (60 + 14 = 74), the fan turns off.

Polling Interval#

The nvfancontrol daemon polls the thermal zone temperatures at the time interval specified by POLLING_INTERVAL and sets the fan speed value specified in the fan profile table:

POLLING_INTERVAL <time_in_seconds>

TMARGIN Configuration#

The TMARGIN configuration must be specified for the nvfancontrol daemon to implement the Fan Profile Table correctly.

Example for TMARGIN Enabled#
  • Formula to calculate the TMARGIN temperature:

    Tmargin_sensor_temp = GROUP_MAX_TEMP -OR- <thermal_zone_max_temp> - <current_thermal_zone_temp>
    
  • Formula to calculate the TMARGIN weighted average of the thermal group sensors:

    Tmargin_thermgroup_weighted_average = Tmargin_sensor0_temp * sensor0_weight_ratio + Tmargin_sensor1_temp * sensor1_weight_ratio + ...
    
    Where:
    Tmargin_sensor<x>_temp - Tmargin sensor temperature calculated using above formula.
    sensor<x>_wight_ratio - Currently only <coeff_0> value is considered for weight ratio as mentioned in section "Thermal Group"
    x - sensor number
    
  • Fan profile table:

    TMARGIN ENABLED
    FAN_PROFILE cool {
            #TEMP   HYST    PWM     RPM
            0       0       255     2900
            10      0       255     2900
            11      0       215     2440
            30      0       215     2440
            60      0       66      750
            105     0       66      750
    }
    

Temperature steps defined in the preceding table are the TMARGIN temperatures calculated by using the formula mentioned at the start of this section.

Assume that GROUP_MAX_TEMP is set to 105, the current fan governor is continuous, and the current fan control is closed-loop. As specified in the fan profile table, the TMARGIN trip temperature step is 60°C, which corresponds to 45°C (105 - 60), the weighted average of the thermal zone temperature.

When the weighted average of the thermal zone temperature reaches 46°C (TMARGIN temperature 59°C), nvfancontrol sets the fan RPM to around 806 (the linear interpolated value between 750 and 2440).

In the preceding table, the fan RPM value will stay at 750 when the weighted average of the thermal zone temperature is between 0°C and 45°C (the TMARGIN temperature between 105°C and 60°C).

Example for TMARGIN Disabled#
  • Formula to calculate the weighted average of the thermal group sensors:

    thermgroup_weighted_average = sensor0_temp * sensor0_weight_ratio + sensor1_temp * sensor1_weight_ratio + ...
    

    In this formula:

    • sensor<x>_temp: Current thermal zone temperature.

    • sensor<x>_weight_ratio: Currently only the <coeff_0> value is considered for the weight ratio, as mentioned in Thermal Group.

    • x: Sensor number.

  • Fan profile table:

    TMARGIN DISABLED
    FAN_PROFILE quiet {
            #TEMP   HYST    PWM     RPM
            0       0       0       0
            50      18      77      1000
            63      8       120     2000
            72      8       160     3000
            81      8       255     4000
    }
    

Temperature steps defined in the preceding table are the weighted average of the actual thermal zone temperature.

Assume that the current fan governor is pid and the current fan control is open-loop. As specified in the fan profile table, when the actual weighted average of the thermal zone temperature reaches 50°C and continues rising, nvfancontrol sets the fan PWM to 77. The fan PWM remains at 77 until the weighted average of the thermal zone temperature reaches 63°C.

Thermal Management#

Thermal management is essential for system stability and quality of user experience. Jetson Thor thermal management provides the following capabilities:

  • Sensing for on-board and on-chip thermal sensor temperature reporting.

  • Active Cooling for removing heat via the fan.

  • Passive Cooling for software and hardware clock throttling.

  • Shutdown for orderly software shutdown and hardware thermal shutdown.

Thermal management in Jetson Thor is performed by the following:

  • The Linux kernel, which monitors on-board and on-chip thermal sensors, performs cooldown, and supports software and hardware thermal shutdown.

  • The Board and Power Management Processor (BPMP), which monitors on-chip thermal sensors and performs slowdown and hardware thermal shutdown.

The following table identifies each thermal management action and the associated module for the SoC.

Thermal Action

Linux Device Driver

Sensing

tegra-bpmp-thermal.c

lm90.c

Slowdown for software throttling

cpufreq_cooling.c devfreq_cooling.c

Cooldown for fan

pwm-fan.c

Slowdown for hardware throttling

BPMP firmware

Software shutdown

thermal_core.c

Hardware shutdown

BPMP firmware

lm90.c

Linux Thermal Framework#

The Linux thermal framework provides generic user space and kernel space interfaces for working with devices that measure or control temperature. The central component of the framework is the thermal zone.

For more information about the Linux thermal framework, refer to <top>/kernel/3rdparty/canonical/linux-noble/Documentation/driver-api/thermal/sysfs-api.rst.

Thermal Zone#

A thermal zone is a virtual object that represents an area on the die whose temperature is monitored and controlled. A thermal zone acts as an object with the following components:

  • Temperature sensor

  • Cooling device

  • Trip points

  • Governor

BSP includes drivers that provide interfaces to these components.

This section introduces these components and demonstrates how they form a thermal zone on a Jetson device.

Configuring a Thermal Zone Using the Device Tree#

A thermal zone provides knobs to tune the thermal response of the zone. BSP provides several thermal zones tuned to provide optimum thermal performance. You can modify the provided thermal zones by editing the entries in the kernel device tree. Users can define sensors to use temperature limits and cooling actions on those limits. Device overheating can be resolved in most cases by tuning the thermal zone.

The following code snippet provides an example of a thermal zone for Jetson Thor. This thermal zone monitors the temperature of the TEGRA264_THERMAL_ZONE_CPU_MAX sensor. Clock throttling is performed using the devfreq cooling device when the passive trip point temperature is crossed:

cpu-thermal {
        status = "okay";
        thermal-sensors = <0x21c 0x08>;
        polling-delay = <0x3e8>;
        polling-delay-passive = <0x28>;

        trips {

                cpu-sw-slowdown {
                        temperature = <0x1a9c8>;
                        hysteresis = <0x00>;
                        type = "passive";
                        phandle = <0x21d>;
                };

                cpu-sw-shutdown {
                        temperature = <0x1bf44>;
                        hysteresis = <0x00>;
                        type = "critical";
                        phandle = <0x2f2>;
                };
        };

        cooling-maps {

                map-cpufreq {
                        trip = <0x21d>;
                        cooling-device = <0x170 0xffffffff 0xffffffff 0x172 0xffffffff 0xffffffff 0x174 0xffffffff 0xffffffff 0x176 0xfffffffffff 0x178 0xffffffff 0xffffffff 0x17a 0xffffffff 0xffffffff 0x17c 0xffffffff 0xffffffff>;
                };

                map-throttle-alert {
                        trip = <0x21d>;
                        cooling-device = <0x21e 0x01 0x01>;
                };
        };
};

For more information about thermal knobs, refer to <top>/kernel/3rdparty/canonical/linux-noble/Documentation/devicetree/bindings/thermal/thermal-zones.yaml.

Temperature Sensors#

A temperature sensor in a thermal zone is responsible for reporting the temperature in millidegrees Celsius. Jetson Thor has several types of temperature sensors on the chip and board.

For more information see Thermal Management in Linux.

Trip Points and Cooling Devices#

Thermal management uses trip points to communicate with thermal zones. A trip point describes the temperature at which cooling is recommended.

Trip points are classified by the type of cooling device triggered:

  • Passive trip points trigger passive cooling devices, which reduce the Jetson device’s performance and so reduce the amount of heat generated. Hardware or software clock throttling (reducing the frequency of a clock) is an example of a passive cooling device.

  • Active trip points trigger active cooling devices to remove the dissipated heat. A fan is an example of an active cooling device.

  • Critical trip points trigger a thermal shutdown.

A cooling map specifies how a cooling device is associated with certain trip points.

For more information, see Thermal Cooling.

Governors#

A governor implements a control loop that keeps a Jetson device within a safe operating temperature range. Although the Linux thermal framework provides a variety of governors, BSP provides a simple step_wise governor for all passive throttling needs.

BSP-specific Thermal Zones#

BSP defines platform-specific thermal zones. The zones are tuned to provide the best performance within the thermal constraints of the Jetson device. Each thermal zone uses a temperature sensor that is controlled by the Linux kernel or the BPMP firmware as described in the following table.

Thermal Zone

Thermal Sensor

Associated Module

cpu-thermal

TEGRA264_THERMAL_ZONE_CPU_MAX

Linux kernel and BPMP firmware

gpu-thermal

TEGRA264_THERMAL_ZONE_GPU_MAX

Linux kernel and BPMP firmware

soc012-thermal

TEGRA264_THERMAL_ZONE_SOC_012_MAX

Linux kernel and BPMP firmware

soc345-thermal

TEGRA264_THERMAL_ZONE_SOC_345_MAX

Linux kernel and BPMP firmware

tj-thermal

TEGRA264_THERMAL_ZONE_TJ_MAX

Linux kernel and BPMP firmware

For more information, see Thermal Management in BPMP.

Gains achieved by tuning are limited by the Thermal Design Power (TDP) of the system. Tuning cannot remedy a faulty TDP. Removing all thermal zones does not guarantee maximum performance and can cause resets and irreversible damage to the device.

Thermal Management in Linux#

The Linux kernel provided by BSP includes several drivers for on-board and on-chip temperature sensing.

Thermal Sensors#

The Jetson Thor product family has several types of sensors to support hardware and software cooling strategies.

On-board Sensors#

BSP includes a driver for on-board sensor devices such as TMP451.

These thermal sensors can sense their own temperature as well as the temperature of a remote diode. Jetson platforms have these sensors set up as follows:

Thermal Sensor

Thermal Measurement Location

TMP451 remote sensor

Temperature on die near GPU

TMP451 local sensor

Temperature of the board

BSP configures these sensors to operate in an extended mode to increase the temperature range to −64 °C to 191 °C.

Operation in SC7#

The voltage rail that powers the on-board sensor is gated when the SoC enters the SC7 state on most Jetson platforms, except for Jetson Thor. For Jetson Thor, the voltage rail powering the TMP451 sensor remains on, so the sensor is operational in SC7 and can perform hardware thermal shutdown.

Thermal Capabilities#

The on-board sensors generate thermal events in the following situations:

  • Alert (interrupt) when crossing HIGH limit.

  • Hardware thermal shutdown when crossing THERM limit.

Correction Offset#

The on-board sensors allow software to program a static offset temperature for the remote sensor. This accounts for any inaccuracy that might be present in the sensor hardware. BSP reads the offset in the boot configuration table (BCT) and programs it into the offset register on boot. The offset is calculated and validated via oil-bath experiments.

On-chip Sensors#

The on-chip NV_THERM thermal sensors are controlled by BPMP firmware and the tegra-bpmp-thermal Linux kernel driver.

The BPMP firmware exposes each on-chip thermal sensor using the Application Binary Interface (ABI). Each sensor has an ABI name shown in the table in BSP-specific Thermal Zones. The on-chip sensors, whose names have the TEGRA264_THERMAL_ZONE_ prefix, work as described in the following paragraphs.

The BPMP firmware has one programmable temperature threshold (one trip) for each on-chip sensor, allocated for a Linux thermal zone trip point. The tegra_bpmp_thermal driver walks through the list of thermal trip points in a Linux thermal zone based on the current temperature. It then determines a trip to program the sensor temperature threshold in BPMP firmware. The driver then uses the following thermal message requests (MRQs) to communicate with the BPMP thermal framework.

  • CMD_THERMAL_QUERY_ABI

  • CMD_THERMAL_GET_TEMP

  • CMD_THERMAL_SET_TRIP

  • CMD_THERMAL_GET_NUM_ZONES

The driver receives a CMD_THERMAL_HOST_TRIP_REACHED MRQ message when a particular sensor crosses a trip. The message is then relayed back to the Linux thermal framework.

For more information on thermal management features provided as part of BSP, see Thermal Management in BPMP.

Thermal Cooling#

BSP provides thermal management using fan control and throttling of various clocks in the system.

Fan Management#

BSP provides active cooling by fan management through the pwm-fan driver, controlled by nvfancontrol, which provides the following:

  • Fan speed control by programming the PWM controller.

  • Ramp-up and ramp-down control to change the speed of the fan smoothly.

  • Fan control during various power states.

SoC thermal management uses the fan as the first line of defense to delay clock throttling until a much higher temperature is reached.

Note

If nvfancontrol failed to start, the kernel will take over the fan speed control based on the trip point temperatures defined for the tj-thermal sensor.

Software Clock Throttling#

BSP provides thermal cooling by throttling various clocks in the system. When a thermal sensor’s temperature rises above a throttling trip point, clock throttling employs the DVFS capabilities of the clocks to reduce their operating frequencies, and thereby the voltages of the rails that power the clocks. This reduction in frequency and voltage reduces power consumption, which helps to control the temperature.

Because BSP provides cooling by reducing the clock frequency, it directly impacts performance and the user experience. If a device feels warm and seems sluggish, it might be due to thermal throttling of the clocks. This can be remedied by tuning the trip points and cooling devices of thermal zones.

BSP provides the following cooling devices for software clock throttling:

  • cpufreq_cooling

  • devfreq_cooling

Each of these cooling devices provides several cooling states, each of which translates to a maximum allowable operating frequency for the CPU and GPU clocks. These frequencies are optimized to provide the best possible performance at a given temperature. The frequency tables for these clocks are available in the sysfs nodes exposed by cpufreq and devfreq frameworks.

The governor uses the current temperature of a thermal zone as an input to the feedback control loop. Similarly, it uses the output of the control loop to set a new cooling state for the thermal zone’s cooling device. As the device heats up, the governor picks progressively higher cooling states, which result in lower frequency caps for all of the clocks and potentially greater cooling. BSP performs this thermal throttling of the clocks to maintain the junction temperature of the die within the recommended safe limits. For software throttling trip temperatures, see the table in Thermal Specifications.

Software Thermal Shutdown#

A critical trip point triggers a software thermal shutdown. It allows the operating system to save its state and perform an orderly shutdown before a hardware thermal reset occurs.

A software thermal shutdown is considered a rare event. It occurs after all other cooling strategies have failed.

BSP defines one critical trip point per thermal zone. You can set the lower limit for the orderly shutdown. For software thermal shutdown trip temperatures, see the table in Thermal Specifications.

Hardware Thermal Shutdown#

The on-chip and on-board sensors can trigger hardware shutdown when all other cooling strategies have failed, and software shutdown has failed to occur when it should. For hardware shutdown limits, see the table in Thermal Specifications.

Thermal Management in BPMP#

BSP thermal management features are part of the firmware running on BPMP for Jetson platforms running any host operating system (host OS) on the CPU.

Thermal Sensing#

The BPMP firmware hosts the nvtherm drivers for the on-chip thermal sensors as follows:

Thermal Sensor

ABI Name

Sensed Location

NV_THERM

CPU_MAX

Hottest spot of CPU clusters.

GPU_MAX

Hottest spot of GPU clusters.

SOC_012_MAX
SOC_345_MAX

The two maximum temperatures reported across the SOC region, which is the area outside the CPU and GPU regions.

TJ_MAX

Virtual sensor corresponding to the highest temperature among CPU_MAX, GPU_MAX, SOC_012_MAX, and SOC_345_MAX.

NV_THERM#

NV_THERM is the collection of on-chip ring oscillators whose frequency changes are based on temperature. To convert a measured frequency to a temperature, the oscillating frequency of the sensor, at a fixed temperature, must be known in advance and stored in the on-chip fuses.

The BPMP firmware nvtherm driver uses these fuses during boot and calibrates the sensor. When the calibration is complete, the temperature sensor reports the temperature, in degrees Celsius, with a 0.03125 °C precision margin.

Sensors and Sensor Groups#

The temperature sensors on the chip are logically classified in sensor groups, based on their proximity to certain hardware blocks. The sensor groups are represented as a single sensor to the host OS and the BPMP firmware.

For example, Jetson Thor has some temperature sensors in the CPU cluster. These are grouped as CPU sensors that are represented as TEGRA264_THERMAL_ZONE_CPU_MAX to the operating system running on the CPUs. The BPMP firmware reports the temperature of a given group by taking the maximum of all the sensors in the group.

Note

The GPU power rail might be turned off at idle by run-time power management. The temperature cannot be read from GPU thermal sensors when the power is off. An attempt to read a sensor with the power off returns error code EAGAIN (resource temporarily unavailable).

Thermal Event Detection#

Thermal sensors can report the temperature when the current temperature crosses a software-defined trip point. The sensors are capable of monitoring several of these software trip points to perform the following thermal actions:

  • Report when the thermal trip point has been crossed.

  • Trigger a hardware thermal shutdown.

  • Trigger hardware throttling.

Voltage Rail Dependencies#

To provide accurate temperature sensing, the sensors require a minimum voltage. Additionally, the sensors cannot operate when the rail is power-gated.

When the system is in a low-power state, the firmware provides the following mode of operation:

  • No temperature measurements during SC7: Because the rail powering the sensor is power-gated in the SC7 state, the oscillator is not running. Therefore, the frequency-to-temperature conversion might produce inaccurate values. To avoid spurious temperature reports from the sensors, stop the sensors before entering the SC7 state.

BPMP Thermal Framework#

The BPMP firmware hosts a thermal framework to perform the following tasks:

  • Register thermal sensors as thermal zones, as identified in Thermal Sensing.

  • Allow BPMP modules to register trip points on the thermal zones.

  • Allow the host OS to register trips using thermal MRQ messages.

  • Provide trip management and reporting.

The thermal framework maintains a list of trips per sensor that includes the current trip from the host OS and various BPMP modules. As temperatures change, the framework examines the list of current trips and notifies the owners of the trips of the changes. The notification is sent using a callback for the BPMP owned trips and the thermal MRQ command CMD_THERMAL_HOST_TRIP_REACHED for trips that are owned by the host OS.

The primary thermal MRQ requests handled by the framework are as follows:

  • CMD_THERMAL_QUERY_ABI

  • CMD_THERMAL_GET_TEMP

  • CMD_THERMAL_SET_TRIP

  • CMD_THERMAL_GET_NUM_ZONES

Because a sensor might have several trips, the thermal framework must ensure that a notification is generated whenever a given trip is crossed. For example, if TEGRA264_THERMAL_ZONE_CPU_MAX has trips at 55°, 60°, 65°, and 70 °C, the thermal framework sends a single notification when the temperature crosses 55°, 60°, 65°, or 70 °C.

Additionally, the framework implements hysteresis to prevent sending too many notifications. Thus for the preceding example, the framework

  • Sends one notification when the temperature reaches 55 °C.

  • Waits until the temperature drops below 54 °C.

  • Sends another notification when the temperature rises back to 55 °C.

To generate these notifications, the thermal framework sets low trips on the sensors to receive events that the temperature has dropped below the limit.

Hardware Throttling#

Each element in a power delivery system includes limitations, such as the following:

  • The amount of current a battery can supply without shutting down.

  • The amount of current a regulator can provide before it fails to maintain its output voltage.

  • The amount of ripple current an inductor in a switching regulator can tolerate without overheating.

These limitations can result in fast transient electrical and thermal events, such as the following:

  • Overcurrent at the battery.

  • Voltage drop at the PMIC.

  • Temperature spikes.

The firmware refers to these events as OC alarms and triggers clock hardware throttling to handle them.

Impact#

Like software throttling, hardware throttling can reduce performance. Because the triggering events are rare and transient in nature, though, the user experience is minimally impacted.

The host OS is not notified of these events, but you can detect the drop in clock rates by using a performance measuring tool that samples the CPU cycle counters. While thermal management in the host OS seeks to control temperature on an ongoing basis, hardware throttling clamps down the clocks to handle events.

Throttle Points and Vector Configuration#

The BPMP device tree binary holds the various throttle points and the throttle settings that govern when and how throttling is performed. The nvtherm driver in the BPMP firmware handles any interrupts resulting from these events.

The following table shows the hardware throttling levels:

Hardware throttling

Clock throttled percentage

Heavy

87.5

Medium

75

Light

50

Throttle vectors are optimized for limiting peak current consumption while maximizing performance. To manage peak current consumption, the firmware supports capping the CPU and GPU clocks at various levels (such as light, medium, and heavy), as described in the device tree bindings. Clock capping prevents the CPU and GPU from drawing more current than their voltage regulators can supply.

For hardware throttling trip temperatures, see the table in Thermal Specifications.

Design Considerations#

Designing failsafe measures into Power Management Integrated Circuits (PMICs), or using the battery controller to shut down the device when the events described here occur, results in a bad user experience. Similarly, designing power delivery hardware for worst-case loads results in large and costly components.

Consequently, NVIDIA SoCs are designed for use with power delivery systems that are adequate for common loads. NVIDIA SoCs actively manage their components to avoid exceeding their design limits. When events are transient, the advantage of this approach to power management becomes more compelling.

Hardware Thermal Shutdown#

The final failsafe for firmware thermal management is a hardware thermal reset, or thermtrip. If software and hardware throttling are unable to control heat generation in the system, and the software becomes unresponsive, the SoC asserts the reset pin on the PMIC as the hardware shutdown mechanism.

For hardware shutdown limits, see the table in Thermal Specifications.

Thermal Specifications#

The following table describes the supported cooling states.

Thermal Zone or HWMON Node

Thermal Sensor

Cooling Action

Jetson T5000

cpu-thermal

TEGRA264_THERMAL_ZONE_CPU_MAX

SW throttling

109.0 °C

HW throttling

113.0 °C

SW shutdown

114.5 °C

HW shutdown

115.0 °C

gpu-thermal

TEGRA264_THERMAL_ZONE_GPU_MAX

SW throttling

109.0 °C

HW throttling

113.0 °C

SW shutdown

114.5 °C

HW shutdown

115.0 °C

soc012-thermal

TEGRA264_THERMAL_ZONE_SOC_012_MAX

SW throttling

109.0 °C

HW throttling

113.0 °C

SW shutdown

114.5 °C

HW shutdown

115.0 °C

soc345-thermal

TEGRA264_THERMAL_ZONE_SOC_345_MAX

SW throttling

109.0 °C

HW throttling

113.0 °C

SW shutdown

114.5 °C

HW shutdown

115.0 °C

tmp451 hwmon temp1

TMP451 local sensor

HW shutdown

117.0 °C

tmp451 hwmon temp2

TMP451 remote sensor

HW shutdown

117.0 °C

Note

When the threshold is exceeded, the TEMP_THERM signal is asserted by TMP451 thermal sensors and the hardware is shut down. The board should be sufficiently cooled before it is powered on again. The power rail for TMP451 on Jetson T5000 is always on, so powering the board on without sufficient cooling will fail. (The default hysteresis is 10°C.)

The board can be powered on again only after the temperature falls below (<threshold> - <hysteresis>) °C. If the board cannot be sufficiently cooled, completely cut off the power to TMP451 and reset the TEMP_THERM signal by unplugging the power supply and then plugging it in again.