nemo_automodel.training.step_scheduler
#
Module Contents#
Classes#
Scheduler for managing gradient accumulation and checkpointing steps. |
API#
- class nemo_automodel.training.step_scheduler.StepScheduler(
- grad_acc_steps: int,
- ckpt_every_steps: int,
- dataloader: Optional[int],
- val_every_steps: Optional[int] = None,
- start_step: int = 0,
- start_epoch: int = 0,
- num_epochs: int = 10,
- max_steps: Optional[int] = None,
Bases:
torch.distributed.checkpoint.stateful.Stateful
Scheduler for managing gradient accumulation and checkpointing steps.
Initialization
Initialize the StepScheduler.
- Parameters:
grad_acc_steps (int) β Number of steps for gradient accumulation.
ckpt_every_steps (int) β Frequency of checkpoint steps.
dataloader (Optional[int]) β The training dataloader.
val_every_steps (int) β Number of training steps between validation.
start_step (int) β Initial global step.
start_epoch (int) β Initial epoch.
num_epochs (int) β Total number of epochs.
max_steps (int) β Total number of steps to run.
- __iter__()[source]#
Iterates over dataloader while keeping track of counters.
- Raises:
StopIteration β If the dataloader was exhausted or max_steps was reached.
- Yields:
dict β batch
- property is_optim_step#
Returns whether this step needs to call the optimizer step.
- Returns:
if true, the optimizer should run.
- Return type:
bool
- property is_val_step#
Returns whether this step needs to call the validation.
- property is_ckpt_step#
Returns whether this step needs to call the checkpoint saving.
- Returns:
if true, the checkpoint should run.
- Return type:
bool
- property epochs#
Epoch iterator.
- Yields:
iterator β over epochs