bridge.training.utils.checkpoint_utils#
Module Contents#
Functions#
Check if a file exists. |
|
Ensure that the directory for a given filename exists. |
|
Determine the directory name for a specific checkpoint. |
|
Get the filename for the train state tracker file. |
|
Get the filename for the run configuration file within a checkpoint directory. |
|
Tracker file rescords the latest chckpoint during training to restart from. |
|
Check if a checkpoint directory exists. |
|
Infer the HuggingFace model identifier recorded in a Megatron Bridge checkpoint. |
|
Read the run configuration from a YAML file (rank 0 only). |
|
Read the train state metadata from a YAML file (rank 0 only). |
|
Remove runtime-only objects from run config dictionaries. |
Data#
API#
- bridge.training.utils.checkpoint_utils.TRAIN_STATE_FILE#
‘train_state.pt’
- bridge.training.utils.checkpoint_utils.TRACKER_PREFIX#
‘latest’
- bridge.training.utils.checkpoint_utils.CONFIG_FILE#
‘run_config.yaml’
- bridge.training.utils.checkpoint_utils.logger#
‘getLogger(…)’
- bridge.training.utils.checkpoint_utils._RUNTIME_ONLY_TARGETS#
‘frozenset(…)’
- bridge.training.utils.checkpoint_utils.file_exists(path: str) bool#
Check if a file exists.
- Parameters:
path – The path to the file. Can be a local path or an MSC URL.
- Returns:
True if the file exists, False otherwise.
- bridge.training.utils.checkpoint_utils.ensure_directory_exists(
- filename: str,
- check_parent: bool = True,
Ensure that the directory for a given filename exists.
- Parameters:
filename – The path whose directory should be checked/created.
check_parent – If True (default), checks the parent directory of the filename. If False, treats the filename itself as the directory path.
- bridge.training.utils.checkpoint_utils.get_checkpoint_name(
- checkpoints_path: str,
- iteration: int,
- release: bool = False,
Determine the directory name for a specific checkpoint.
Constructs the path based on iteration number or release flag.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
iteration – The training iteration number.
release – If True, uses ‘release’ as the directory name instead of iteration.
- Returns:
The full path to the checkpoint directory.
- bridge.training.utils.checkpoint_utils.get_checkpoint_train_state_filename(
- checkpoints_path: str,
- prefix: Optional[str] = None,
Get the filename for the train state tracker file.
This file typically stores metadata about the latest checkpoint, like the iteration number.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
prefix – Optional prefix (e.g., ‘latest’) to prepend to the filename.
- Returns:
The full path to the train state tracker file.
- bridge.training.utils.checkpoint_utils.get_checkpoint_run_config_filename(checkpoints_path: str) str#
Get the filename for the run configuration file within a checkpoint directory.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
- Returns:
The full path to the run configuration file (e.g., run_config.yaml).
- bridge.training.utils.checkpoint_utils.get_checkpoint_tracker_filename(checkpoints_path: str) str#
Tracker file rescords the latest chckpoint during training to restart from.
Supports checkpoints produced by Megatron-LM.
- Parameters:
checkpoints_path – Base directory where checkpoints are stored.
- Returns:
The full path to the checkpoint tracker file (e.g., latest_checkpointed_iteration.txt).
- bridge.training.utils.checkpoint_utils.checkpoint_exists(checkpoints_path: Optional[str]) bool#
Check if a checkpoint directory exists.
- Parameters:
checkpoints_path – Path to the potential checkpoint directory.
- Returns:
True if the path exists, False otherwise.
- bridge.training.utils.checkpoint_utils.get_hf_model_id_from_checkpoint(
- path: str | os.PathLike[str],
Infer the HuggingFace model identifier recorded in a Megatron Bridge checkpoint.
- Parameters:
path – Path to a Megatron checkpoint directory. This can be either the root checkpoint directory containing
iter_*subdirectories or a specific iteration directory.- Returns:
The HuggingFace model identifier/path if present, otherwise
None.- Raises:
FileNotFoundError – If the provided path does not exist.
NotADirectoryError – If the provided path is not a directory.
- bridge.training.utils.checkpoint_utils.read_run_config(run_config_filename: str) dict[str, Any]#
Read the run configuration from a YAML file (rank 0 only).
Reads the file on rank 0 and broadcasts the result to other ranks.
- Parameters:
run_config_filename – Path to the run config YAML file.
- Returns:
A dictionary containing the run configuration.
- Raises:
RuntimeError – If reading the config file fails on rank 0.
- bridge.training.utils.checkpoint_utils.read_train_state(
- train_state_filename: str,
Read the train state metadata from a YAML file (rank 0 only).
Reads the file on rank 0 and broadcasts the result to other ranks if torch.distributed is initialized. Otherwise, loads the file locally.
- Parameters:
train_state_filename – Path to the train state YAML file.
- Returns:
An initialized TrainState object.
- bridge.training.utils.checkpoint_utils._sanitize_run_config_object(obj: Any) Any#
Remove runtime-only objects from run config dictionaries.
Timers and other runtime constructs are serialized with
_target_entries that cannot be recreated without additional context (e.g., constructor arguments provided at runtime). These objects are not required when loading a checkpoint configuration, so we replace them withNoneto avoid instantiation errors when the config is processed later.