bridge.training.inprocess_restart#

Module Contents#

Functions#

inprocess_restart

Wraps the train_fn with in-process restart functionality.

maybe_wrap_for_inprocess_restart

Conditionally wrap function for in-process restart.

Data#

API#

bridge.training.inprocess_restart.logger: logging.Logger#

‘getLogger(…)’

bridge.training.inprocess_restart.inprocess_restart(
train_fn: Callable,
config: megatron.bridge.training.config.InProcessRestartConfig,
global_state: megatron.bridge.training.state.GlobalState,
) Callable#

Wraps the train_fn with in-process restart functionality.

Parameters:
  • train_fn – The training function to wrap.

  • config – Configuration settings for in-process restart.

  • global_state – State object for the training function.

Returns:

The wrapped training function.

bridge.training.inprocess_restart.maybe_wrap_for_inprocess_restart(
train_fn: Callable,
config: megatron.bridge.training.config.InProcessRestartConfig,
state: megatron.bridge.training.state.GlobalState,
) tuple[Callable, Optional[torch.distributed.Store]]#

Conditionally wrap function for in-process restart.