core.energy_monitor#
Megatron Energy Monitoring (NVML)
Module Contents#
Classes#
Energy monitoring using NVML. |
API#
- class core.energy_monitor.EnergyMonitor#
Energy monitoring using NVML.
All ranks in the process group are expected to call functions lap() and get_total(). Energy is monitored across all ranks and aggregated with an all-reduce.
Initialization
Initialize EnergyMonitor.
- setup() None#
Setup the NVML Handler.
- shutdown() None#
Shutdown NVML.
- pause() None#
Pause energy monitor (must resume afterward).
- resume() None#
Resume/start energy monitor.
- _get_energy() int#
Get current energy consumption from NVML.
- lap() float#
Returns lap (iteration) energy (J) and updates total energy.
- get_total() float#
Get total energy consumption (J) across all GPUs.