NVIDIA Docs Hub NVIDIA PhysicsNeMo NVIDIA PhysicsNeMo Core (Latest Release) PhysicsNeMo Launch Utils

PhysicsNeMo Launch Utils

The PhysicsNeMo Launch Utils module utilities that support the saving and loading of model checkpoints. These utilities are used internally by the LaunchLogger, but can also be used by users to save and load model checkpoints.

Checkpointing

physicsnemo.launch.utils.checkpoint.get_checkpoint_dir(base_dir: str, model_name: str) → str[source]

Get a checkpoint directory based on a given base directory and model name

Parameters

base_dir (str) – Path to the base directory where checkpoints are stored
model_name (str, optional) – Name of the model which is generating the checkpoint

Returns

Checkpoint directory

Return type

str

physicsnemo.launch.utils.checkpoint.load_checkpoint(path: str, models: Optional[Union[Module, List[Module]]] = None, optimizer: Optional[optimizer] = None, scheduler: Optional[scheduler] = None, scaler: Optional[scaler] = None, epoch: Optional[int] = None, metadata_dict: Optional[Dict[str, Any]] = {}, device: Union[str, device] = 'cpu') → int[source]

Checkpoint loading utility

This loader is designed to be used with the save checkpoint utility in PhysicsNeMo Launch. Given a path, this method will try to find a checkpoint and load state dictionaries into the provided training objects.

Parameters

path (str) – Path to training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none is provided this will attempt to load the checkpoint with the largest index, by default None
metadata_dict (Optional[Dict[str, Any]], optional) – Dictionary to store metadata from the checkpoint, by default None
device (Union[str, torch.device], optional) – Target device, by default “cpu”

Returns

Loaded epoch

Return type

int

physicsnemo.launch.utils.checkpoint.save_checkpoint(path: str, models: Optional[Union[Module, List[Module]]] = None, optimizer: Optional[optimizer] = None, scheduler: Optional[scheduler] = None, scaler: Optional[scaler] = None, epoch: Optional[int] = None, metadata: Optional[Dict[str, Any]] = None) → None[source]

Training checkpoint saving utility

This will save a training checkpoint in the provided path following the file naming convention “checkpoint.{model parallel id}.{epoch/index}.mdlus”. The load checkpoint method in PhysicsNeMo core can then be used to read this file.

Parameters

path (str) – Path to save the training checkpoint
models (Union[torch.nn.Module, List[torch.nn.Module], None], optional) – A single or list of PyTorch models, by default None
optimizer (Union[optimizer, None], optional) – Optimizer, by default None
scheduler (Union[scheduler, None], optional) – Learning rate scheduler, by default None
scaler (Union[scaler, None], optional) – AMP grad scaler. Will attempt to save on in static capture if none provided, by default None
epoch (Union[int, None], optional) – Epoch checkpoint to load. If none this will save the checkpoint in the next valid index, by default None
metadata (Optional[Dict[str, Any]], optional) – Additional metadata to save, by default None

Previous PhysicsNeMo Launch Logging

Next Fourier Neural Operater for Darcy Flow