core.timers#

Megatron timers.

Module Contents#

Classes#

TimerBase

Timer base class.

DummyTimer

Dummy Timer.

Timer

Timer class with ability to start/stop.

Timers

Class for a group of Timers.

Data#

API#

core.timers.logger#

‘getLogger(…)’

class core.timers.TimerBase(name)#

Bases: abc.ABC

Timer base class.

Initialization

abstractmethod start(barrier=False)#

Start the timer.

Parameters:

barrier (bool, optional) – Synchronizes ranks before starting. Defaults to False.

abstractmethod stop(barrier=False)#

Stop the timer.

Parameters:

barrier (bool, optional) – Synchronizes ranks before stopping. Defaults to False.

abstractmethod reset()#

Reset timer.

abstractmethod elapsed(reset=True, barrier=False)#

Calculates the elapsed time and restarts timer.

Parameters:
  • reset (bool, optional) – Resets timer before restarting. Defaults to True.

  • barrier (bool, optional) – Synchronizes ranks before stopping. Defaults to False.

Returns:

Elapsed time.

Return type:

float

class core.timers.DummyTimer#

Bases: core.timers.TimerBase

Dummy Timer.

Initialization

start(barrier=False)#
stop(barrier=False)#
reset()#
elapsed(reset=True, barrier=False)#
active_time()#

Returns the cumulative duration the timer has been active. Note: Not supported for DummyTimer.

class core.timers.Timer(name)#

Bases: core.timers.TimerBase

Timer class with ability to start/stop.

Comment on using barrier: If this flag is passed, then all the caller processes will wait till all reach the timing routine. It is up to the user to make sure all the ranks in barrier_group call it otherwise, it will result in a hang. Comment on barrier_group: By default it is set to None which in torch distributed land, it will result in the global communicator.

Initialization

Initialize Timer.

Parameters:

name (str) – Name of the timer.

set_barrier_group(barrier_group)#

Sets barrier group.

Parameters:

barrier_group (ProcessGroup) – Torch ProcessGroup for barrier.

start(barrier=False)#

Start the timer.

Parameters:

barrier (bool, optional) – Synchronizes ranks before starting. Defaults to False.

stop(barrier=False)#

Stop the timer.

Parameters:

barrier (bool, optional) – Synchronizes ranks before stopping. Defaults to False.

reset()#

Reset timer.

elapsed(reset=True, barrier=False)#

Calculates the elapsed time and restarts timer.

Parameters:
  • reset (bool, optional) – Resets timer before restarting. Defaults to True.

  • barrier (bool, optional) – Synchronizes ranks before stopping. Defaults to False.

Returns:

Elapsed time.

Return type:

float

active_time()#

Calculates the cumulative duration for which the timer has been active

class core.timers.Timers(log_level, log_option)#

Class for a group of Timers.

Initialization

Initialize group of timers.

Parameters:
  • log_level (int) – Log level to control what timers are enabled.

  • log_option (str) – Setting for logging statistics over ranks for all the timers. Allowed: [‘max’, ‘minmax’, ‘all’].

__call__(name, log_level=None)#

Call timer with name and log level.

_get_elapsed_time_all_ranks(names, reset, barrier)#

Returns elapsed times of timers in names. Assumptions: - All the ranks call this function. - names are identical on all ranks. If the above assumptions are not met, calling this function will result in hang.

Parameters:
  • names (List[str]) – list of timer names

  • reset (bool) – reset the timer after recording the elapsed time

  • barrier (bool) – if set, do a global barrier before time measurments

Returns:

Tensor of size [world_size, len(names)] with times in float.

Return type:

torch.tensor

_get_global_min_max_time(names, reset, barrier, normalizer)#

Report only min and max times across all ranks.

_get_global_min_max_time_string(
names,
reset,
barrier,
normalizer,
max_only,
)#

Report strings for max/minmax times across all ranks.

_get_all_ranks_time_string(names, reset, barrier, normalizer)#

Report times across all ranks.

get_all_timers_string(
names: List[str] = None,
normalizer: float = 1.0,
reset: bool = True,
barrier: bool = False,
)#

Returns the output string with logged timer values according to configured options.

Parameters:
  • names (List[str]) – Names of the timers to log. If None, all registered timers are fetched. Defaults to None.

  • normalizer (float, optional) – Normalizes the timer values by the factor. Defaults to 1.0.

  • reset (bool, optional) – Whether to reset timer values after logging. Defaults to True.

  • barrier (bool, optional) – Whether to do a global barrier before time measurments. Defaults to False.

Raises:

Exception – Raises if log option is invalid.

Returns:

Formatted string with the timer values.

Return type:

str

log(
names: List[str],
rank: int = None,
normalizer: float = 1.0,
reset: bool = True,
barrier: bool = False,
)#

logs the timers passed in names to stdout. Example usage is to log average per step value for timer ‘foo’, this function can be called with normalizer factor set to logging interval.

Parameters:
  • names (List[str]) – Names of the timers to log.

  • rank (int, optional) – logs the timers to a specific rank. If set to None, logs to the last rank. Defaults to None.

  • normalizer (float, optional) – Normalizes the timer values by the factor. Defaults to 1.0.

  • reset (bool, optional) – Whether to reset timer values after logging. Defaults to True.

  • barrier (bool, optional) – Whether to do a global barrier before time measurments. Defaults to False.

write(
names: List[str],
writer,
iteration: int,
normalizer: float = 1.0,
reset: bool = True,
barrier: bool = False,
)#

Write timers to a tensorboard writer. Note that we only report maximum time across ranks to tensorboard.

Parameters:
  • names (List[str]) – Names of the timers to log.

  • writer (SummaryWriter) – Tensorboard SummaryWriter object

  • iteration (int) – Current iteration.

  • normalizer (float, optional) – Normalizes the timer values by the factor. Defaults to 1.0.

  • reset (bool, optional) – Whether to reset timer values after logging. Defaults to True.

  • barrier (bool, optional) – Whether to do a global barrier before time measurments. Defaults to False.