nemo_automodel.training.base_recipe#

Module Contents#

Classes#

BaseRecipe

BaseRecipe provides checkpoint load/save functionality for recipes.

Functions#

has_load_restore_state

Checks whether object has load_state_dict and state_dict functions.

_find_latest_checkpoint

Find the latest checkpoint in the checkpoint directory and return it.

API#

nemo_automodel.training.base_recipe.has_load_restore_state(object)[source]#

Checks whether object has load_state_dict and state_dict functions.

TODO: also need to check function signatures.

Parameters:

object (any) – the object to check.

Returns:

returns True if has callable load_state_dict and state_dict

Return type:

bool

class nemo_automodel.training.base_recipe.BaseRecipe[source]#

BaseRecipe provides checkpoint load/save functionality for recipes.

__setattr__(key, value)[source]#

Overriden setattr to keep track of stateful classes.

Parameters:
  • key (str) – attribute named.

  • value (Any) – Value assigned

Raises:

ValueError – if __state_tracked is attemped to be overwriten.

save_checkpoint(epoch: int, step: int)[source]#

Save the current training state as a checkpoint.

As long as the object has a ‘load_state_dict’ and ‘state_dict’ function, it will be saved.

Parameters:
  • epoch (int) – The current epoch.

  • step (int) – The current step.

load_checkpoint(restore_from: str | None = None)[source]#

Loads the latest checkpoint.

nemo_automodel.training.base_recipe._find_latest_checkpoint(checkpoint_dir)[source]#

Find the latest checkpoint in the checkpoint directory and return it.