PhysicsNeMo Launch Utils
The PhysicsNeMo Launch Utils module utilities that support the saving and loading of model checkpoints. These utilities are used internally by the LaunchLogger, but can also be used by users to save and load model checkpoints.
- physicsnemo.launch.utils.checkpoint.get_checkpoint_dir(base_dir: str, model_name: str) → str[source]
Get a checkpoint directory based on a given base directory and model name
- Parameters
base_dir (str) – Path to the base directory where checkpoints are stored
model_name (str, optional) – Name of the model which is generating the checkpoint
- Returns
Checkpoint directory
- Return type
str
- physicsnemo.launch.utils.checkpoint.load_checkpoint(path: str, models: Optional[Union[Module, List[Module]]] = None, optimizer: Optional[optimizer] = None, scheduler: Optional[scheduler] = None, scaler: Optional[scaler] = None, epoch: Optional[int] = None, metadata_dict: Optional[Dict[str, Any]] = {}, device: Union[str, device] = 'cpu') → int[source]
Checkpoint loading utility
This loader is designed to be used with the save checkpoint utility in PhysicsNeMo Launch. Given a path, this method will try to find a checkpoint and load state dictionaries into the provided training objects.
- Parameters
path (str) – Path to training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none is provided this will attempt to load the checkpoint with the largest index, by default None
metadata_dict (Optional[Dict[str, Any]], optional) – Dictionary to store metadata from the checkpoint, by default None
device (Union[str, torch.device], optional) – Target device, by default “cpu”
- Returns
Loaded epoch
- Return type
int
- physicsnemo.launch.utils.checkpoint.save_checkpoint(path: str, models: Optional[Union[Module, List[Module]]] = None, optimizer: Optional[optimizer] = None, scheduler: Optional[scheduler] = None, scaler: Optional[scaler] = None, epoch: Optional[int] = None, metadata: Optional[Dict[str, Any]] = None) → None[source]
Training checkpoint saving utility
This will save a training checkpoint in the provided path following the file naming convention “checkpoint.{model parallel id}.{epoch/index}.mdlus”. The load checkpoint method in PhysicsNeMo core can then be used to read this file.
- Parameters
path (str) – Path to save the training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler. Will attempt to save on in static capture if none provided, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none this will save the checkpoint in the next valid index, by default None
metadata (Optional[Dict[str, Any]], optional) – Additional metadata to save, by default None