core.energy_monitor#

Megatron Energy Monitoring (NVML)

Module Contents#

Classes#

EnergyMonitor

Energy monitoring using NVML.

API#

class core.energy_monitor.EnergyMonitor#

Energy monitoring using NVML.

All ranks in the process group are expected to call functions lap() and get_total(). Energy is monitored across all ranks and aggregated with an all-reduce.

Initialization

Initialize EnergyMonitor.

setup() None#

Setup the NVML Handler.

shutdown() None#

Shutdown NVML.

pause() None#

Pause energy monitor (must resume afterward).

resume() None#

Resume/start energy monitor.

_get_energy() int#

Get current energy consumption from NVML.

lap() float#

Returns lap (iteration) energy (J) and updates total energy.

get_total() float#

Get total energy consumption (J) across all GPUs.