Thermovision#

class nsight.thermovision.ThermalController(
thermal_mode: Literal['auto', 'manual', 'off'] = 'auto',
thermal_wait: int | None = None,
thermal_cont: int | None = None,
thermal_timeout: int | None = None,
verbose: bool = False,
)#

Bases: object

GPU thermal monitoring and throttling prevention.

Manages GPU temperature and prevents thermal throttling by pausing profiling when the GPU gets too hot and resuming after cooling.

Parameters:
  • thermal_mode (Literal['auto', 'manual', 'off'])

  • thermal_wait (int | None)

  • thermal_cont (int | None)

  • thermal_timeout (int | None)

  • verbose (bool)

init()#

Initialize NVML and get GPU handle.

Return type:

bool

Returns:

True if temperature retrieval is supported, False otherwise.

throttle_guard()#

Check thermal state and pause if GPU is too hot.

Thermal headroom = temperature margin before GPU starts throttling.

Operates in two modes: - Auto mode: Automatically adjusts thermal_cont based on workload - Manual mode: Uses user-provided thresholds without adaptation

Adaptive Algorithm (in auto mode): 1. When thermal headroom reaches thermal_cont after cooling, start counting iterations :rtype: None

  1. Run kernel as headroom drops from thermal_cont toward thermal_wait

  2. When headroom drops below thermal_wait, analyze iteration count: - Few iterations (<TARGET_MIN_ITERATIONS): GPU heats quickly → increase thermal_cont (cool more) - Many iterations (>TARGET_MAX_ITERATIONS): GPU heats slowly → decrease thermal_cont (cool less)

  3. Wait until GPU cools back to thermal_cont, then repeat

Return type:

None