nemo_automodel.components.training.timers

Megatron based timers.

Module Contents

Classes

Name	Description
`DummyTimer`	Dummy Timer.
`Timer`	Timer class with ability to start/stop.
`TimerBase`	Timer base class.
`Timers`	Class for a group of Timers.

Data

dist_all_gather_func

API

class nemo_automodel.components.training.timers.DummyTimer()

Bases: TimerBase

Dummy Timer.

nemo_automodel.components.training.timers.DummyTimer.active_time()

Returns the cumulative duration the timer has been active.

Note: Not supported for DummyTimer.

nemo_automodel.components.training.timers.DummyTimer.elapsed(
    reset = True,
    barrier = False
)

Dummy timer elapsed time.

nemo_automodel.components.training.timers.DummyTimer.reset()

Dummy timer reset.

nemo_automodel.components.training.timers.DummyTimer.start(
    barrier = False
)

Dummy timer start.

nemo_automodel.components.training.timers.DummyTimer.stop(
    barrier = False
)

Dummy timer stop.

class nemo_automodel.components.training.timers.Timer(
    name
)

Bases: TimerBase

Timer class with ability to start/stop.

Comment on using barrier: If this flag is passed, then all the caller processes will wait till all reach the timing routine. It is up to the user to make sure all the ranks in barrier_group call it otherwise, it will result in a hang. Comment on barrier_group: By default it is set to None which in torch distributed land, it will result in the global communicator.

_active_time

= 0.0

_elapsed

= 0.0

_start_time

= time.time()

nemo_automodel.components.training.timers.Timer.active_time()

Calculates the cumulative duration for which the timer has been active.

nemo_automodel.components.training.timers.Timer.elapsed(
    reset = True,
    barrier = False
)

Calculates the elapsed time and restarts timer.

Parameters:

reset

boolDefaults to True

Resets timer before restarting. Defaults to True.

barrier

boolDefaults to False

Synchronizes ranks before stopping. Defaults to False.

Returns:

Elapsed time.

nemo_automodel.components.training.timers.Timer.reset()

Reset timer.

nemo_automodel.components.training.timers.Timer.set_barrier_group(
    barrier_group
)

Sets barrier group.

Parameters:

barrier_group

ProcessGroup

Torch ProcessGroup for barrier.

nemo_automodel.components.training.timers.Timer.start(
    barrier = False
)

Start the timer.

Parameters:

barrier

boolDefaults to False

Synchronizes ranks before starting. Defaults to False.

nemo_automodel.components.training.timers.Timer.stop(
    barrier = False
)

Stop the timer.

Parameters:

barrier

boolDefaults to False

Synchronizes ranks before stopping. Defaults to False.

class nemo_automodel.components.training.timers.TimerBase(
    name: str
)

Abstract

Timer base class.

nemo_automodel.components.training.timers.TimerBase.__enter__()

Start the timer when entering a context using the configured barrier option.

nemo_automodel.components.training.timers.TimerBase.__exit__(
    exc_type,
    exc_val,
    exc_tb
)

Stop the timer when exiting a context using the configured barrier option.

nemo_automodel.components.training.timers.TimerBase.elapsed(
    reset = True,
    barrier = False
)

abstract

Calculates the elapsed time and restarts timer.

Parameters:

reset

boolDefaults to True

Resets timer before restarting. Defaults to True.

barrier

boolDefaults to False

Synchronizes ranks before stopping. Defaults to False.

Returns:

Elapsed time.

nemo_automodel.components.training.timers.TimerBase.reset()

abstract

Reset timer.

nemo_automodel.components.training.timers.TimerBase.start(
    barrier = False
)

abstract

Start the timer.

Parameters:

barrier

boolDefaults to False

Synchronizes ranks before starting. Defaults to False.

nemo_automodel.components.training.timers.TimerBase.stop(
    barrier = False
)

abstract

Stop the timer.

Parameters:

barrier

boolDefaults to False

Synchronizes ranks before stopping. Defaults to False.

nemo_automodel.components.training.timers.TimerBase.with_barrier(
    barrier = True
)

Set the barrier option for use in context manager.

Parameters:

barrier

boolDefaults to True

Whether to use barrier in context manager. Defaults to True.

Returns:

Returns self for chaining.

class nemo_automodel.components.training.timers.Timers(
    log_level,
    log_option
)

Class for a group of Timers.

_dummy_timer

= DummyTimer()

_log_levels

= {}

_max_log_level

= 2

_timers

= {}

nemo_automodel.components.training.timers.Timers.__call__(
    name,
    log_level = None,
    barrier = False
)

Call timer with name and log level.

Returns a timer object that can be used as a context manager.

Parameters:

name

str

Name of the timer.

log_level

intDefaults to None

Log level of the timer. Defaults to None.

barrier

boolDefaults to False

Whether to use barrier in context manager. Defaults to False.

nemo_automodel.components.training.timers.Timers._get_all_ranks_time_string(
    names,
    reset,
    barrier,
    normalizer
)

Report times across all ranks.

nemo_automodel.components.training.timers.Timers._get_elapsed_time_all_ranks(
    names,
    reset,
    barrier
)

Returns elapsed times of timers in names.

If the above assumptions are not met, calling this function will result in hang.

Parameters:

names

List[str]

list of timer names

reset

bool

reset the timer after recording the elapsed time

barrier

bool

if set, do a global barrier before time measurments

Returns:

torch.tensor: Tensor of size [world_size, len(names)] with times in float.

nemo_automodel.components.training.timers.Timers._get_global_min_max_time(
    names,
    reset,
    barrier,
    normalizer
)

Report only min and max times across all ranks.

nemo_automodel.components.training.timers.Timers._get_global_min_max_time_string(
    names,
    reset,
    barrier,
    normalizer,
    max_only
)

Report strings for max/minmax times across all ranks.

nemo_automodel.components.training.timers.Timers.get_all_timers_string(
    names: typing.List[str] = None,
    normalizer: float = 1.0,
    reset: bool = True,
    barrier: bool = False
)

Returns the output string with logged timer values according to configured options.

Parameters:

names

List[str]Defaults to None

Names of the timers to log. If None, all registered timers are fetched. Defaults to None.

normalizer

floatDefaults to 1.0

Normalizes the timer values by the factor. Defaults to 1.0.

reset

boolDefaults to True

Whether to reset timer values after logging. Defaults to True.

barrier

boolDefaults to False

Whether to do a global barrier before time measurments. Defaults to False.

Returns:

Formatted string with the timer values.

Raises:

Exception: Raises if log option is invalid.

nemo_automodel.components.training.timers.Timers.log(
    names: typing.List[str],
    rank: int = None,
    normalizer: float = 1.0,
    reset: bool = True,
    barrier: bool = False
)

Logs the timers passed in names to stdout.

Example usage is to log average per step value for timer ‘foo’, this function can be called with normalizer factor set to logging interval.

Parameters:

names

List[str]

Names of the timers to log.

rank

intDefaults to None

logs the timers to a specific rank. If set to None, logs to the last rank. Defaults to None.

normalizer

floatDefaults to 1.0

Normalizes the timer values by the factor. Defaults to 1.0.

reset

boolDefaults to True

Whether to reset timer values after logging. Defaults to True.

barrier

boolDefaults to False

Whether to do a global barrier before time measurments. Defaults to False.

nemo_automodel.components.training.timers.Timers.write(
    names: typing.List[str],
    writer,
    iteration: int,
    normalizer: float = 1.0,
    reset: bool = True,
    barrier: bool = False
)

Write timers to a tensorboard writer.

Note that we only report maximum time across ranks to tensorboard.

Parameters:

names

List[str]

Names of the timers to log.

writer

SummaryWriter

Tensorboard SummaryWriter object

iteration

int

Current iteration.

normalizer

floatDefaults to 1.0

Normalizes the timer values by the factor. Defaults to 1.0.

reset

boolDefaults to True

Whether to reset timer values after logging. Defaults to True.

barrier

boolDefaults to False

Whether to do a global barrier before time measurments. Defaults to False.

nemo_automodel.components.training.timers.Timers.write_to_wandb(
    names: list[str],
    writer,
    iteration: int,
    normalizer: float = 1.0,
    reset: bool = True,
    barrier: bool = False
) -> None

Patch to write timers to wandb for Megatron Core Timers.

nemo_automodel.components.training.timers.dist_all_gather_func = torch.distributed.all_gather_into_tensor