Thermovision#

class nsight.thermovision.ThermalController( thermal_mode: Literal['auto', 'manual', 'off'] = 'auto', thermal_wait: int | None = None, thermal_cont: int | None = None, thermal_timeout: int | None = None, verbose: bool = False, )#

Bases: object

GPU thermal monitoring and throttling prevention.

Manages GPU temperature and prevents thermal throttling by pausing profiling when the GPU gets too hot and resuming after cooling.

Parameters:

thermal_mode (Literal['auto', 'manual', 'off'])
thermal_wait (int | None)
thermal_cont (int | None)
thermal_timeout (int | None)
verbose (bool)

init()#

Initialize NVML and get GPU handle.

Return type:: bool
Returns:: True if temperature retrieval is supported, False otherwise.

throttle_guard()#

Check thermal state and pause if GPU is too hot.

Thermal headroom = temperature margin before GPU starts throttling.

Operates in two modes: - Auto mode: Automatically adjusts thermal_cont based on workload - Manual mode: Uses user-provided thresholds without adaptation

Adaptive Algorithm (in auto mode): 1. When thermal headroom reaches thermal_cont after cooling, start counting iterations :rtype: None

Run kernel as headroom drops from thermal_cont toward thermal_wait
When headroom drops below thermal_wait, analyze iteration count: - Few iterations (<TARGET_MIN_ITERATIONS): GPU heats quickly → increase thermal_cont (cool more) - Many iterations (>TARGET_MAX_ITERATIONS): GPU heats slowly → decrease thermal_cont (cool less)
Wait until GPU cools back to thermal_cont, then repeat

Return type:: None