PhysicsNeMo Launch Utils#
The PhysicsNeMo Launch Utils module utilities that support the saving and loading of model checkpoints. These utilities are used internally by the LaunchLogger, but can also be used by users to save and load model checkpoints.
Checkpointing#
- physicsnemo.launch.utils.checkpoint.get_checkpoint_dir(base_dir: str, model_name: str) str[source]#
Get a checkpoint directory based on a given base directory and model name
- Parameters:
base_dir (str) – Path to the base directory where checkpoints are stored
model_name (str, optional) – Name of the model which is generating the checkpoint
- Returns:
Checkpoint directory
- Return type:
str
- physicsnemo.launch.utils.checkpoint.load_checkpoint(
- path: str,
- models: Module | List[Module] | None = None,
- optimizer: optimizer | None = None,
- scheduler: scheduler | None = None,
- scaler: scaler | None = None,
- epoch: int | None = None,
- metadata_dict: Dict[str, Any] | None = {},
- device: str | device = 'cpu',
Checkpoint loading utility
This loader is designed to be used with the save checkpoint utility in PhysicsNeMo Launch. Given a path, this method will try to find a checkpoint and load state dictionaries into the provided training objects.
- Parameters:
path (str) – Path to training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none is provided this will attempt to load the checkpoint with the largest index, by default None
metadata_dict (Optional[Dict[str, Any]], optional) – Dictionary to store metadata from the checkpoint, by default None
device (Union[str, torch.device], optional) – Target device, by default “cpu”
- Returns:
Loaded epoch
- Return type:
int
- physicsnemo.launch.utils.checkpoint.save_checkpoint(
- path: str,
- models: Module | List[Module] | None = None,
- optimizer: optimizer | None = None,
- scheduler: scheduler | None = None,
- scaler: scaler | None = None,
- epoch: int | None = None,
- metadata: Dict[str, Any] | None = None,
Training checkpoint saving utility.
This function saves training checkpoints to the provided path. Multiple files may be created depending on what is being saved:
Model checkpoints (when
modelsare provided): “{model_name}{model_id}.{model_parallel_rank}.{epoch}.{ext}” where ext is “.mdlus” for instances ofModuleor “.pt” for PyTorch models.Training state (when optimizer/scheduler/scaler are provided): “checkpoint.{model_parallel_rank}.{epoch}.pt”
For PhysicsNeMo models, the {model_name} is derived from the model’s metadata through
model.meta.name; if the model has no metadata, then the model’s class namemodel.__class__.__name__is used. For PyTorch models, the model_name is always derived from the model’s class name__class__.__name__. models). If multiple models share the same {model_name}, they are indexed by {model_id} (e.g., “MyModel0”, “MyModel1”).The function
load_checkpoint()can be used to restore from these files with models that are already instantiated. To load only the model checkpoint (even when the models are not already instantiated), use the methodfrom_checkpoint()to instantiate and load the model from the checkpoint.- Parameters:
path (str) – Path to save the training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler. Will attempt to save on in static capture if none provided, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none this will save the checkpoint in the next valid index, by default None
metadata (Optional[Dict[str, Any]], optional) – Additional metadata to save, by default None