`nemo_automodel.training.step_scheduler`#

Module Contents#

Classes#

StepScheduler

Scheduler for managing gradient accumulation and checkpointing steps.

API#

class nemo_automodel.training.step_scheduler.StepScheduler( grad_acc_steps: int, ckpt_every_steps: int, dataloader: Optional[int], val_every_steps: Optional[int] = None, start_step: int = 0, start_epoch: int = 0, num_epochs: int = 10, max_steps: Optional[int] = None, )[source]#

Bases: torch.distributed.checkpoint.stateful.Stateful

Scheduler for managing gradient accumulation and checkpointing steps.

Initialization

Initialize the StepScheduler.

Parameters:

grad_acc_steps (int) – Number of steps for gradient accumulation.
ckpt_every_steps (int) – Frequency of checkpoint steps.
dataloader (Optional[int]) – The training dataloader.
val_every_steps (int) – Number of training steps between validation.
start_step (int) – Initial global step.
start_epoch (int) – Initial epoch.
num_epochs (int) – Total number of epochs.
max_steps (int) – Total number of steps to run.

__iter__()[source]#

Iterates over dataloader while keeping track of counters.

Raises:: StopIteration – If the dataloader was exhausted or max_steps was reached.
Yields:: dict – batch

set_epoch(epoch: int)[source]#: Set the epoch for the dataloader.

property is_optim_step#

Returns whether this step needs to call the optimizer step.

Returns:: if true, the optimizer should run.
Return type:: bool

property is_val_step#: Returns whether this step needs to call the validation.

property is_ckpt_step#

Returns whether this step needs to call the checkpoint saving.

Returns:: if true, the checkpoint should run.
Return type:: bool

property epochs#

Epoch iterator.

Yields:: iterator – over epochs

state_dict()[source]#

Get the current state of the scheduler.

Returns:: Current state with ‘step’ and ‘epoch’ keys.
Return type:: dict

load_state_dict(s)[source]#

Load the scheduler state from a dictionary.

Parameters:: s (dict) – Dictionary containing ‘step’ and ‘epoch’.

nemo_automodel.training.step_scheduler#

Module Contents#

Classes#

API#

`nemo_automodel.training.step_scheduler`#