bridge.training.utils.checkpoint_utils
#
Module Contents#
Functions#
Check if a file exists. |
|
Ensure that the directory for a given filename exists. |
|
Determine the directory name for a specific checkpoint. |
|
Get the filename for the train state tracker file. |
|
Get the filename for the run configuration file within a checkpoint directory. |
|
Tracker file rescords the latest chckpoint during training to restart from. |
|
Check if a checkpoint directory exists. |
|
Read the run configuration from a YAML file (rank 0 only). |
|
Read the train state metadata from a YAML file (rank 0 only). |
Data#
API#
- bridge.training.utils.checkpoint_utils.TRAIN_STATE_FILE#
‘train_state.pt’
- bridge.training.utils.checkpoint_utils.TRACKER_PREFIX#
‘latest’
- bridge.training.utils.checkpoint_utils.CONFIG_FILE#
‘run_config.yaml’
- bridge.training.utils.checkpoint_utils.logger#
‘getLogger(…)’
- bridge.training.utils.checkpoint_utils.file_exists(path: str) bool #
Check if a file exists.
- Parameters:
path – The path to the file. Can be a local path or an MSC URL.
- Returns:
True if the file exists, False otherwise.
- bridge.training.utils.checkpoint_utils.ensure_directory_exists(
- filename: str,
- check_parent: bool = True,
Ensure that the directory for a given filename exists.
- Parameters:
filename – The path whose directory should be checked/created.
check_parent – If True (default), checks the parent directory of the filename. If False, treats the filename itself as the directory path.
- bridge.training.utils.checkpoint_utils.get_checkpoint_name(
- checkpoints_path: str,
- iteration: int,
- release: bool = False,
Determine the directory name for a specific checkpoint.
Constructs the path based on iteration number or release flag.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
iteration – The training iteration number.
release – If True, uses ‘release’ as the directory name instead of iteration.
- Returns:
The full path to the checkpoint directory.
- bridge.training.utils.checkpoint_utils.get_checkpoint_train_state_filename(
- checkpoints_path: str,
- prefix: Optional[str] = None,
Get the filename for the train state tracker file.
This file typically stores metadata about the latest checkpoint, like the iteration number.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
prefix – Optional prefix (e.g., ‘latest’) to prepend to the filename.
- Returns:
The full path to the train state tracker file.
- bridge.training.utils.checkpoint_utils.get_checkpoint_run_config_filename(checkpoints_path: str) str #
Get the filename for the run configuration file within a checkpoint directory.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
- Returns:
The full path to the run configuration file (e.g., run_config.yaml).
- bridge.training.utils.checkpoint_utils.get_checkpoint_tracker_filename(checkpoints_path: str) str #
Tracker file rescords the latest chckpoint during training to restart from.
Supports checkpoints produced by Megatron-LM.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
- Returns:
The full path to the checkpoint tracker file (e.g., latest_checkpointed_iteration.txt).
- bridge.training.utils.checkpoint_utils.checkpoint_exists(checkpoints_path: Optional[str]) bool #
Check if a checkpoint directory exists.
- Parameters:
checkpoints_path – Path to the potential checkpoint directory.
- Returns:
True if the path exists, False otherwise.
- bridge.training.utils.checkpoint_utils.read_run_config(run_config_filename: str) dict[str, Any] #
Read the run configuration from a YAML file (rank 0 only).
Reads the file on rank 0 and broadcasts the result to other ranks.
- Parameters:
run_config_filename – Path to the run config YAML file.
- Returns:
A dictionary containing the run configuration.
- Raises:
RuntimeError – If reading the config file fails on rank 0.
- bridge.training.utils.checkpoint_utils.read_train_state(
- train_state_filename: str,
Read the train state metadata from a YAML file (rank 0 only).
Reads the file on rank 0 and broadcasts the result to other ranks.
- Parameters:
train_state_filename – Path to the train state YAML file.
- Returns:
An initialized TrainState object.