nemo_automodel.components.training.step_scheduler
#
Module Contents#
Classes#
Scheduler for managing gradient accumulation and checkpointing steps. |
API#
- class nemo_automodel.components.training.step_scheduler.StepScheduler(
- grad_acc_steps: int,
- ckpt_every_steps: int,
- dataloader: Optional[int],
- val_every_steps: Optional[int] = None,
- start_step: int = 0,
- start_epoch: int = 0,
- num_epochs: int = 10,
- max_steps: Optional[int] = None,
Bases:
torch.distributed.checkpoint.stateful.Stateful
Scheduler for managing gradient accumulation and checkpointing steps.
Initialization
Initialize the StepScheduler.
- Parameters:
grad_acc_steps (int) – Number of steps for gradient accumulation.
ckpt_every_steps (int) – Frequency of checkpoint steps.
dataloader (Optional[int]) – The training dataloader.
val_every_steps (int) – Number of training steps between validation.
start_step (int) – Initial global step.
start_epoch (int) – Initial epoch.
num_epochs (int) – Total number of epochs.
max_steps (int) – Total number of steps to run.
- __iter__()[source]#
Iterates over dataloader while keeping track of counters.
- Raises:
StopIteration – If the dataloader was exhausted or max_steps was reached.
- Yields:
dict – batch
- property is_optim_step#
Returns whether this step needs to call the optimizer step.
- Returns:
if true, the optimizer should run.
- Return type:
bool
- property is_val_step#
Returns whether this step needs to call the validation.
- property is_ckpt_step#
Returns whether this step needs to call the checkpoint saving.
- Returns:
if true, the checkpoint should run.
- Return type:
bool
- property epochs#
Epoch iterator.
- Yields:
iterator – over epochs