`bridge.training.utils.checkpoint_utils`#

Module Contents#

Functions#

`file_exists`	Check if a file exists.
`ensure_directory_exists`	Ensure that the directory for a given filename exists.
`get_checkpoint_name`	Determine the directory name for a specific checkpoint.
`get_checkpoint_train_state_filename`	Get the filename for the train state tracker file.
`get_checkpoint_run_config_filename`	Get the filename for the run configuration file within a checkpoint directory.
`get_checkpoint_tracker_filename`	Tracker file rescords the latest chckpoint during training to restart from.
`checkpoint_exists`	Check if a checkpoint directory exists.
`get_hf_model_id_from_checkpoint`	Infer the HuggingFace model identifier recorded in a Megatron Bridge checkpoint.
`read_run_config`	Read the run configuration from a YAML file (rank 0 only).
`read_train_state`	Read the train state metadata from a YAML file (rank 0 only).
`_sanitize_run_config_object`	Remove runtime-only objects from run config dictionaries.

Data#

`TRAIN_STATE_FILE`
`TRACKER_PREFIX`
`CONFIG_FILE`
`logger`
`_RUNTIME_ONLY_TARGETS`

API#

bridge.training.utils.checkpoint_utils.TRAIN_STATE_FILE#: ‘train_state.pt’

bridge.training.utils.checkpoint_utils.TRACKER_PREFIX#: ‘latest’

bridge.training.utils.checkpoint_utils.CONFIG_FILE#: ‘run_config.yaml’

bridge.training.utils.checkpoint_utils.logger#: ‘getLogger(…)’

bridge.training.utils.checkpoint_utils._RUNTIME_ONLY_TARGETS#: ‘frozenset(…)’

bridge.training.utils.checkpoint_utils.file_exists(path: str) → bool#

Check if a file exists.

Parameters:: path – The path to the file. Can be a local path or an MSC URL.
Returns:: True if the file exists, False otherwise.

bridge.training.utils.checkpoint_utils.ensure_directory_exists( filename: str, check_parent: bool = True, ) → None#

Ensure that the directory for a given filename exists.

Parameters:

filename – The path whose directory should be checked/created.
check_parent – If True (default), checks the parent directory of the filename. If False, treats the filename itself as the directory path.

bridge.training.utils.checkpoint_utils.get_checkpoint_name( checkpoints_path: str, iteration: int, release: bool = False, ) → str#

Determine the directory name for a specific checkpoint.

Constructs the path based on iteration number or release flag.

Parameters:

checkpoints_path – Base directory where checkpoints are stored.
iteration – The training iteration number.
release – If True, uses ‘release’ as the directory name instead of iteration.

Returns:

The full path to the checkpoint directory.

bridge.training.utils.checkpoint_utils.get_checkpoint_train_state_filename( checkpoints_path: str, prefix: Optional[str] = None, ) → str#

Get the filename for the train state tracker file.

This file typically stores metadata about the latest checkpoint, like the iteration number.

Parameters:

checkpoints_path – Base directory where checkpoints are stored.
prefix – Optional prefix (e.g., ‘latest’) to prepend to the filename.

Returns:

The full path to the train state tracker file.

bridge.training.utils.checkpoint_utils.get_checkpoint_run_config_filename(checkpoints_path: str) → str#

Get the filename for the run configuration file within a checkpoint directory.

Parameters:: checkpoints_path – Base directory where checkpoints are stored.
Returns:: The full path to the run configuration file (e.g., run_config.yaml).

bridge.training.utils.checkpoint_utils.get_checkpoint_tracker_filename(checkpoints_path: str) → str#

Tracker file rescords the latest chckpoint during training to restart from.

Supports checkpoints produced by Megatron-LM.

Parameters:: checkpoints_path – Base directory where checkpoints are stored.
Returns:: The full path to the checkpoint tracker file (e.g., latest_checkpointed_iteration.txt).

bridge.training.utils.checkpoint_utils.checkpoint_exists(checkpoints_path: Optional[str]) → bool#

Check if a checkpoint directory exists.

Parameters:: checkpoints_path – Path to the potential checkpoint directory.
Returns:: True if the path exists, False otherwise.

bridge.training.utils.checkpoint_utils.get_hf_model_id_from_checkpoint( path: str | os.PathLike[str], ) → str | None#

Infer the HuggingFace model identifier recorded in a Megatron Bridge checkpoint.

Parameters:

path – Path to a Megatron checkpoint directory. This can be either the root checkpoint directory containing iter_* subdirectories or a specific iteration directory.

Returns:

The HuggingFace model identifier/path if present, otherwise None.

Raises:

FileNotFoundError – If the provided path does not exist.
NotADirectoryError – If the provided path is not a directory.

bridge.training.utils.checkpoint_utils.read_run_config(run_config_filename: str) → dict[str, Any]#

Read the run configuration from a YAML file (rank 0 only).

Reads the file on rank 0 and broadcasts the result to other ranks.

Parameters:: run_config_filename – Path to the run config YAML file.
Returns:: A dictionary containing the run configuration.
Raises:: RuntimeError – If reading the config file fails on rank 0.

bridge.training.utils.checkpoint_utils.read_train_state( train_state_filename: str, ) → megatron.bridge.training.state.TrainState#

Read the train state metadata from a YAML file (rank 0 only).

Reads the file on rank 0 and broadcasts the result to other ranks if torch.distributed is initialized. Otherwise, loads the file locally.

Parameters:: train_state_filename – Path to the train state YAML file.
Returns:: An initialized TrainState object.

bridge.training.utils.checkpoint_utils._sanitize_run_config_object(obj: Any) → Any#

Remove runtime-only objects from run config dictionaries.

Timers and other runtime constructs are serialized with _target_ entries that cannot be recreated without additional context (e.g., constructor arguments provided at runtime). These objects are not required when loading a checkpoint configuration, so we replace them with None to avoid instantiation errors when the config is processed later.

bridge.training.utils.checkpoint_utils#

Module Contents#

Functions#

Data#

API#

`bridge.training.utils.checkpoint_utils`#