PhysicsNeMo Launch Utils#

The PhysicsNeMo Launch Utils module utilities that support the saving and loading of model checkpoints. These utilities are used internally by the LaunchLogger, but can also be used by users to save and load model checkpoints.

Checkpointing#

physicsnemo.launch.utils.checkpoint.get_checkpoint_dir(base_dir: str, model_name: str) → str[source]#

Get a checkpoint directory based on a given base directory and model name

Parameters:

base_dir (str) – Path to the base directory where checkpoints are stored
model_name (str, optional) – Name of the model which is generating the checkpoint

Returns:

Checkpoint directory

Return type:

str

physicsnemo.launch.utils.checkpoint.load_checkpoint( path: str, models: Module | List[Module] | None = None, optimizer: optimizer | None = None, scheduler: scheduler | None = None, scaler: scaler | None = None, epoch: int | None = None, metadata_dict: Dict[str, Any] | None = {}, device: str | device = 'cpu', ) → int[source]#

Checkpoint loading utility

This loader is designed to be used with the save checkpoint utility in PhysicsNeMo Launch. Given a path, this method will try to find a checkpoint and load state dictionaries into the provided training objects.

Parameters:

path (str) – Path to training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none is provided this will attempt to load the checkpoint with the largest index, by default None
metadata_dict (Optional[Dict[str, Any]], optional) – Dictionary to store metadata from the checkpoint, by default None
device (Union[str, torch.device], optional) – Target device, by default “cpu”

Returns:

Loaded epoch

Return type:

int

Training checkpoint saving utility.

This function saves training checkpoints to the provided path. Multiple files may be created depending on what is being saved:

Model checkpoints (when models are provided): “{model_name}{model_id}.{model_parallel_rank}.{epoch}.{ext}” where ext is “.mdlus” for instances of Module or “.pt” for PyTorch models.
Training state (when optimizer/scheduler/scaler are provided): “checkpoint.{model_parallel_rank}.{epoch}.pt”

For PhysicsNeMo models, the {model_name} is derived from the model’s metadata through model.meta.name; if the model has no metadata, then the model’s class name model.__class__.__name__ is used. For PyTorch models, the model_name is always derived from the model’s class name __class__.__name__. models). If multiple models share the same {model_name}, they are indexed by {model_id} (e.g., “MyModel0”, “MyModel1”).

The function load_checkpoint() can be used to restore from these files with models that are already instantiated. To load only the model checkpoint (even when the models are not already instantiated), use the method from_checkpoint() to instantiate and load the model from the checkpoint.

Parameters:

path (str) – Path to save the training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler. Will attempt to save on in static capture if none provided, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none this will save the checkpoint in the next valid index, by default None
metadata (Optional[Dict[str, Any]], optional) – Additional metadata to save, by default None